I'm trying to create a parser to find the tracked changes and author of a Word .docx file...
I found the document.xml but there are so many tags! Is there a glossary somewhere to what all those tags stand for?
I'd like to avoid brute forcing my way through this if possible.
For example, a . docx file is an Open XML formatted Microsoft Word document.
Double click the folder you wish to inspect (for example word). Double click the file you wish to inspect (for example document. xml). The document last selected should now appear in an Internet Explorer tab.
DOCX was originally developed by Microsoft as an XML-based format to replace the proprietary binary format that uses the . doc file extension. Since Word 2007, DOCX has been the default format for the Save operation.
The "Office Open XML" format and its XML vocabularies are described in detail in http://www.ecma-international.org/publications/standards/Ecma-376.htm .
To give you an idea, the following piece of XSLT should extract just the effective result text without tracked deletions of a wordprocessingML document, like would be stored under word/document.xml in a .docx file (a ZIP archive).
<!-- Match and output text spans except when
     appearing in w:delText child content -->
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main">
  <xsl:output method="text"/>
  <xsl:template match="w:t">
    <xsl:value-of select="."/>
  </xsl:template>
  <xsl:template match="w:delText"/>
  <xsl:template match="*">
    <xsl:apply-templates/>
  </xsl:template>
</xsl:stylesheet>
For your application to extract changes instead, you'd also have to take care of w:ins elements.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With