Creating a docx -> Html Preview Handler for SharePoint

Search

Accessible SharePoint WebSites
Download ARF

Creating a docx -> Html Preview Handler for SharePoint

http://blog.thekid.me.uk

The next version of .Net contains a new DLL called WindowsBase. This DLL contains a new namespace called System.IO.Packaging which contains lots of useful classes for manipulating Word documents in the docx format.

Essentially a .docx aocument is just a zip file containing a bunch of files, mostly XML. The System.IO.Packaging namespace allows you to open and manipulate these files.

I thought I would try and use these classes to create a HttpHandler which would allow you to preview a word document in the docx format. This would not only need to convert the text stored in the document, but also the images contained in the document.

Inside the .docx file

If you take a .docx document and change the file extension to .zip you will be able to see the files contained within the document. When you look inside you will see a number of folders containing mostly XML. Below you see the document and the files contained in the zip (docx) file.

imageimage 

For a Word document you will see the 'word' folder, in which you will find the file 'document.xml'...this is your actual document in XML format. The XML is not exactly friendly, but there is plenty of documentation around. This file is what we will use to create the HTML preview. The other files we will be interested in are contained in the word\media folder...this is where the embedded images are stored.

Converting from .docx to HTML

As the word document is actually an XML file we can easily convert it to HTML using XSLT...at least in theory we can. The problem is that the OpenXML format is pretty complicated and to write the XSLT for all but the most simple document would be difficult and time consuming.

Fortunately for those of us using SharePoint Microsoft has already done the hard work for us. MOSS comes with standard document converters found at C:\Program Files\Microsoft Office Servers\12.0\TransformApps. These applications are used when you convert a document into a web page in a publishing site. The docx->html converter uses XSLT to perform the conversion and so we will re-use this XSLT...we a couple of modifications.

Modifying the DocX2Html XSLT

The XSL used by SharePoint is called DocX2Html.xsl and can be found in the TransformApps folder, along with the document converters. By default this XSL ignores images as the document converter does not extract them. We need to add an <xsl:template> to handle the images and create the required <img> tag.

Take a copy of the DocX2Html.xsl and make the following changes...

1. Add the following XSL template to the stylesheet.

<xsl:template match="w:drawing">
 <xsl:variable name="w">
  <xsl:value-of select=".//wp:extent/@cx"/>
 </xsl:variable>

 <xsl:variable name="h">
  <xsl:value-of select=".//wp:extent/@cy"/>
 </xsl:variable>

 <img src="?image={.//a:blip/@r:embed[1]}">
  <xsl:attribute name="width">
   <xsl:value-of select="number($w) div 9525"/>px
  </xsl:attribute>

  <xsl:attribute name="height">
   <xsl:value-of select="number($h) div 9525"/>px
  </xsl:attribute>
 </img>
</xsl:template>


This XSL generated the <img> tag, sets the src by creating a query string with the ID of the image and ensures the width and height are correct.

2. Add the following namespaces to the stylesheet element at the start of the XSLT.

xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing"
xmlns:r="
http://schemas.openxmlformats.org/officeDocument/2006/relationships"

3. Add the following parameter just after the <xsl:output> element.

<xsl:param name="relsDoc"/>

4. Save the file.

Creating the HttpHandler

As we are using .Net 3.0 we need to use Visual Studio 2008 to create the handler. You will need to create a standard class library a add a class which implements IHttpHandler.

Our handler will handle both the request for the HTML version of the docx document and the images contained within the document. To do this we implement two methods.

The first is to convert the document to HTML...

private void WriteHtmlFromDocx(HttpContext context)
{
    XmlDocument oWordDoc = new XmlDocument();
    oWordDoc.Load(CurrentDocument.GetStream());

    XslCompiledTransform oTransform = new XslCompiledTransform();
    oTransform.Load("c:\\temp\\DocX2Html.xsl");

    oTransform.Transform(oWordDoc, null, context.Response.OutputStream);
}

Here we load the XML containing the docx document, load the modified DocX2Html XSL and perform the transformation, writing it out to the response stream.

The second is to respond with an image...

private void RespondWithImage(HttpContext context)
{
    PackageRelationship oRelationship = CurrentDocument.GetRelationship(context.Request["image"]);
    PackagePart oImage = CurrentPackage.GetPart(new Uri("/word/" + oRelationship.TargetUri, UriKind.Relative));

    byte[] arBytes = new byte[oImage.GetStream().Length];
    oImage.GetStream().Read(arBytes, 0, arBytes.Length);

    context.Response.BinaryWrite(arBytes);
}

Here we get the relationship for the image ID passed in the query string. This ID was created in the previous XSLT. The relationship contains the relative path to the image, which is stored in the media folder of the .docx document. Using this path we retrieve the image itself in the form of a PackagePart. Next we read the image into a byte array and write it out to the response.

These two methods do the actual work, but we are also using some helper properties...

private Package CurrentPackage
{
    get
    {
        if (_currentPackage == null)
        {
            string sFileName = _context.Request.Url.ToString().Substring(0, _context.Request.Url.ToString().Length - 4 - _context.Request.Url.Query.Length);
            SPFile oFile = SPContext.Current.Web.GetFile(sFileName);
            
            _currentPackage = Package.Open(oFile.OpenBinaryStream());
        }
        return _currentPackage;
    }
}


The CurrentPackage property simply gets the Package (.docx file) from SharePoint. We use the current URL to know which document to get and SPWeb.GetFile() to retrieve it from SharePoint.

The other property used is the CurrentDocument which gets the current Word document from the CurrentPackage (.docx file)...

private PackagePart CurrentDocument
{
    get
    {
        if (_currentDocument == null)
            _currentDocument = CurrentPackage.GetPart(new Uri("/word/document.xml", UriKind.Relative));

        return _currentDocument;
    }
}

Using the handler

To use the handler add it to the HttpHandler section of web.config in your SharePoint site...

<add verb="GET" path="*.docxhtml" type="TheKid.DocX.HtmlPreviewHandler, TheKid.DocX, Version=1.0.0.0, Culture=neutral, PublicKeyToken=cc64d2386a72d4c3" />

You should now be able to access a document using...

http://<your site>/documents/mydocument.docx

And then preview it using...

http://<your site>/documents/mydocument.docxhtml

imageimage

Probably the simplest easiest way to add this to your SharePoint site would be with a custom action as explained in a previous article.

 

You can download the full class file, just remember you need Visual Studio 2008 to build it.

Posted by Vincent Rothwell on Saturday, 20 Oct 2007 09:04  - 24 Comments
Orininally printed from http://thekid.me.uk - Copyright Vincent Rothwell 2007
 

Comments

Sunday, 27 Jul 2008 10:36 by Natasha
http://www.rpg-tv.com/index.php?s=233

Sunday, 27 Jul 2008 10:36 by Pankaj sharma
hi Kid, Awesome posting.You rock!!! This would really help me on one of my same kind of requirement. Regards Pankaj

Sunday, 27 Jul 2008 10:36 by Pankaj
Hi Kid, I have a requirement to pick the docx document from document library and convert that in to html and store that html in another document library. Can you guide me on this. Regards, Pankaj

Sunday, 27 Jul 2008 10:36 by Pankaj
Trying the code given by you. Its giving Error: Object reference not set to an instance of an object. Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code. Exception Details: System.NullReferenceException: Object reference not set to an instance of an object. Source Error: An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below. Stack Trace: [NullReferenceException: Object reference not set to an instance of an object.] DocXToHTMLHTTPHandler.DocToHTMLSyncHandler.get_CurrentPackage() +72 DocXToHTMLHTTPHandler.DocToHTMLSyncHandler.get_CurrentDocument() +63 DocXToHTMLHTTPHandler.DocToHTMLSyncHandler.ProcessRequest(HttpContext context) +101 System.Web.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute() +358 System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously) +64

Sunday, 27 Jul 2008 10:36 by Vince
Pankaj, You will need to attach a debugger to your code to see what is wrong. As this is just a sample it doesn't really have any error checking or validation. You may also want to look at http://blog.thekid.me.uk/archive/2007/07/25/debugging-tips-for-sharepoint-and-wss-exceptions.aspx --Vince

Sunday, 27 Jul 2008 10:36 by Pankaj
Hi Vince, Thanks for your responce. Just a query, Do we need to save the file to Temp directory after making the mentioned modifications.As per the code the path that is refering is c:\\temp\\DocX2Html.xsl. Are there some special permission to be given for folder that has the .Docx file? Regards, Pankaj

Sunday, 27 Jul 2008 10:36 by Pankaj
Hi Vince, Thanks for your responce. Just a query, Do we need to save the file to Temp directory after making the mentioned modifications.As per the code the path that is refering is c:\\temp\\DocX2Html.xsl. Are there some special permission to be given for folder that has the .Docx file? Regards, Pankaj

Sunday, 27 Jul 2008 10:36 by Pankaj
Hi Vince, I tried debugging this and found the following. Program work fine till current package I have put my docx file under the _layouts/<sitename> directory. I checked the path as well its shows the correct path for file e.g http://<servername>/_layouts/doctohtml/ES_Template.docx But after that its just directly going to the catch of property named 'CurrentDocument' and returns ex: Object reference not set to an instance of an object. Any help will be appreciated. Regards, Pankaj

Sunday, 27 Jul 2008 10:36 by Vince
Pankaj, The documents in my example are stored in a document library, not in the _layouts folder. The handler will try to retrieve the document from the current library. --Vince

Sunday, 27 Jul 2008 10:36 by Aman
Hi Vince, Thanks for this useful information. I am able to get the code work fine , except the issue with images. Its not showing the images.I checked the 'View source', it produces the '<img' tag and assigns the relation id for each image also with width and size e.g <img src="?image=rId6" width="624px&#xA; " height="399px&#xA; ">. And also the context.request for image is unable to capture the image id and returns nothing. if directly pass the rid it just display the garbage characters. Please help. Regards, Aman

Sunday, 27 Jul 2008 10:36 by Amirz
Hello, you can get more previews by integrating Vista Preview Pane with QuickView Plus IE plug in (www.avantstar.com). unfortunately it's a shareware. But there's always another "way" It supports more than 400 file types included most of known documents, images, and binaries. No need to create specific preview handler any more. same thing due to other IE viewer activex plug in. 1. Download and install QuickView Plus. Launch it, Configure your custom file by clicking View - Confifugure QuickView Plus - Applications - Internet Explorer (IE plug in). then add your custom files (e.g. .dwg, .psd, .dll etc with their correct content type (MIME) 2. Download MSDN Preview Handler. Install it. 3. Download Preview Handler Editor or Preview Handler Configure. Open it then Look for "MSDN Internet Explorer Preview Handler". select and add your custom files 4. Activate Vista preview pane. Select your file. the preview appears perfectly now. That's it. Hopefully works for you. Thanx and regards-Amirz

Sunday, 27 Jul 2008 10:37 by CJ
Hi Kid Great article and thanks for the source. Also, great idea usinga custom action feature. However, you only need .NET 3.0 to reference windowsbase.dll (it comes with the framework), so you should be fine using VS2005 with .NET 3.0 framework. Cheers CJ

Sunday, 27 Jul 2008 10:37 by Mikey
Hi, I'm currently working on a similar thing - I'm trying to render DOCX documents stored in a database on the web. So, I've been working with the docx2html template, and your info helped with the images, but Ive noticed that it doesnt handle bullets and lists. I was wondering if you have had any luck with this? Cheers, Mikey

Thursday, 30 Oct 2008 03:37 by latau
Hi, i have been assigned a job in sharepoint at work for creating a document library. Basically, the work demands to have the same look and feel for every site i am creating. i don't want to use the downloadable templates available since i feel they are too much detailed. i created a html page, which i would like to use as the template. will this work? or what are the options available for me? Thanks, Latau

Wednesday, 5 Nov 2008 02:59 by Pankaj
Hi Latau, I think you need to create a master page and using the desing thats there in html page, using minimal master page. apply that master page to any of the team site you created. You can create a site template based on that team site that has your custom master page applied. You can create sites and apply this site template, that will now help you to keep the site design consistent. I hope this would help.

Monday, 17 Nov 2008 05:51 by IRfan
Hi, I am working on DOCX to HTML conversion soln. I am almost done with docx conversion, However I found difficulties for header and footer conversion. I have noticed that there are some xml files for header and footer information header1.xml & footer1.xml. and found relation between document.xml and these header and footer files. However the Docx2Html.xsl file does not convert it automatically. I have debug through the xsl and found there is no template for <w:w:sectPr>. Do anyone have any idea regarding this....... Thanks, Irfan Shaikh

Friday, 12 Dec 2008 04:00 by christel
is there an easy way to create an outlook preview pane type which will display document content when clicking on documents in a document library

Wednesday, 24 Dec 2008 08:09 by Vitalii
Is there a way to create a word document (docx is okay) using c#, but on shared hosting? no COM P/Invoke allowed :( these are simple documents (only few mergefields) but customer wants to modify them myself in word, so digging xml doesn't fit :( thanks in advance

Wednesday, 24 Dec 2008 08:19 by Vitalii
forget about last question, i've found invoke docx lib for this. it's even free :)

Monday, 2 Feb 2009 10:28 by Yaroslav Pentsarskyy
Here is the sharepoint compientn that allows to generate word document based on the information from any list ... might be helpfull in some cases: http://store.sharemuch.com/products/generate-word-documents-from-sharepoint-list

Monday, 2 Mar 2009 05:15 by Ram
I am trying to convert word 2007 docx file with content controls into html file. I tried to implement above mentioned approach. It is working for normal docx files but It is not working for docx files with Content Controls placeholder with databinding. Kindly help me how to modify the XSLT file to render the Content Placeholder text in HTML page. Note : I am using CustomXML to update values in Content Controls. Thanks,

Wednesday, 4 Mar 2009 03:50 by

Monday, 23 Mar 2009 06:25 by Amar
Hi, I am able to modify your code according to my need and am able to display docx documents as html in an iframe. The problem is the formatting is getting lost (Numbering, Bullets...) are getting removed and when this happens, the alignment changes and the font gets changed. I assume that this is because the xslt does not handle bullets and numbering. Can you please guide me on how to retain the formatting. Any suggestions would be of great help. Thanks in advance....

Wednesday, 6 Jan 2010 03:40 by Blazej
Hi, Do you think it's possible to realize paging in this approach? If so, what would be the best approach in your opinion? Thanks in advance.



Url

Email

Comments