The next version of .Net contains a new DLL called WindowsBase. This DLL contains a new namespace called System.IO.Packaging which contains lots of useful classes for manipulating Word documents in the docx format.
Essentially a .docx aocument is just a zip file containing a bunch of files, mostly XML. The System.IO.Packaging namespace allows you to open and manipulate these files.
I thought I would try and use these classes to create a HttpHandler which would allow you to preview a word document in the docx format. This would not only need to convert the text stored in the document, but also the images contained in the document.
Inside the .docx file
If you take a .docx document and change the file extension to .zip you will be able to see the files contained within the document. When you look inside you will see a number of folders containing mostly XML. Below you see the document and the files contained in the zip (docx) file.

For a Word document you will see the 'word' folder, in which you will find the file 'document.xml'...this is your actual document in XML format. The XML is not exactly friendly, but there is plenty of documentation around. This file is what we will use to create the HTML preview. The other files we will be interested in are contained in the word\media folder...this is where the embedded images are stored.
Converting from .docx to HTML
As the word document is actually an XML file we can easily convert it to HTML using XSLT...at least in theory we can. The problem is that the OpenXML format is pretty complicated and to write the XSLT for all but the most simple document would be difficult and time consuming.
Fortunately for those of us using SharePoint Microsoft has already done the hard work for us. MOSS comes with standard document converters found at C:\Program Files\Microsoft Office Servers\12.0\TransformApps. These applications are used when you convert a document into a web page in a publishing site. The docx->html converter uses XSLT to perform the conversion and so we will re-use this XSLT...we a couple of modifications.
Modifying the DocX2Html XSLT
The XSL used by SharePoint is called DocX2Html.xsl and can be found in the TransformApps folder, along with the document converters. By default this XSL ignores images as the document converter does not extract them. We need to add an <xsl:template> to handle the images and create the required <img> tag.
Take a copy of the DocX2Html.xsl and make the following changes...
1. Add the following XSL template to the stylesheet.
<xsl:template match="w:drawing">
<xsl:variable name="w">
<xsl:value-of select=".//wp:extent/@cx"/>
</xsl:variable>
<xsl:variable name="h">
<xsl:value-of select=".//wp:extent/@cy"/>
</xsl:variable>
<img src="?image={.//a:blip/@r:embed[1]}">
<xsl:attribute name="width">
<xsl:value-of select="number($w) div 9525"/>px
</xsl:attribute>
<xsl:attribute name="height">
<xsl:value-of select="number($h) div 9525"/>px
</xsl:attribute>
</img>
</xsl:template>
This XSL generated the
<img> tag, sets the
src by creating a query string with the ID of the image and ensures the width and height are correct.
2. Add the following namespaces to the stylesheet element at the start of the XSLT.
xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"
3. Add the following parameter just after the <xsl:output> element.
<xsl:param name="relsDoc"/>
4. Save the file.
Creating the HttpHandler
As we are using .Net 3.0 we need to use Visual Studio 2008 to create the handler. You will need to create a standard class library a add a class which implements IHttpHandler.
Our handler will handle both the request for the HTML version of the docx document and the images contained within the document. To do this we implement two methods.
The first is to convert the document to HTML...
private void WriteHtmlFromDocx(HttpContext context)
{
XmlDocument oWordDoc = new XmlDocument();
oWordDoc.Load(CurrentDocument.GetStream());
XslCompiledTransform oTransform = new XslCompiledTransform();
oTransform.Load("c:\\temp\\DocX2Html.xsl");
oTransform.Transform(oWordDoc, null, context.Response.OutputStream);
}
Here we load the XML containing the docx document, load the modified DocX2Html XSL and perform the transformation, writing it out to the response stream.
The second is to respond with an image...
private void RespondWithImage(HttpContext context)
{
PackageRelationship oRelationship = CurrentDocument.GetRelationship(context.Request["image"]);
PackagePart oImage = CurrentPackage.GetPart(new Uri("/word/" + oRelationship.TargetUri, UriKind.Relative));
byte[] arBytes = new byte[oImage.GetStream().Length];
oImage.GetStream().Read(arBytes, 0, arBytes.Length);
context.Response.BinaryWrite(arBytes);
}
Here we get the relationship for the image ID passed in the query string. This ID was created in the previous XSLT. The relationship contains the relative path to the image, which is stored in the media folder of the .docx document. Using this path we retrieve the image itself in the form of a PackagePart. Next we read the image into a byte array and write it out to the response.
These two methods do the actual work, but we are also using some helper properties...
private Package CurrentPackage
{
get
{
if (_currentPackage == null)
{
string sFileName = _context.Request.Url.ToString().Substring(0, _context.Request.Url.ToString().Length - 4 - _context.Request.Url.Query.Length);
SPFile oFile = SPContext.Current.Web.GetFile(sFileName);
_currentPackage = Package.Open(oFile.OpenBinaryStream());
}
return _currentPackage;
}
}
The
CurrentPackage property simply gets the
Package (.docx file) from SharePoint. We use the current URL to know which document to get and
SPWeb.GetFile() to retrieve it from SharePoint.
The other property used is the CurrentDocument which gets the current Word document from the CurrentPackage (.docx file)...
private PackagePart CurrentDocument
{
get
{
if (_currentDocument == null)
_currentDocument = CurrentPackage.GetPart(new Uri("/word/document.xml", UriKind.Relative));
return _currentDocument;
}
}
Using the handler
To use the handler add it to the HttpHandler section of web.config in your SharePoint site...
<add verb="GET" path="*.docxhtml" type="TheKid.DocX.HtmlPreviewHandler, TheKid.DocX, Version=1.0.0.0, Culture=neutral, PublicKeyToken=cc64d2386a72d4c3" />
You should now be able to access a document using...
http://<your site>/documents/mydocument.docx
And then preview it using...
http://<your site>/documents/mydocument.docxhtml

Probably the simplest easiest way to add this to your SharePoint site would be with a custom action as explained in a previous article.
You can download the full class file, just remember you need Visual Studio 2008 to build it.