Is there a programmatic way to extract equations (and possibly images) from an MS Word document? I've googled all over, but have yet to find anything that I can sink my teeth into and work from. If possible, I'd like to be able to do this with VB.NET or C#, but I can pick up enough of any language to hack out a DLL. Thanks!
EDIT: Right now I'm looking at extracting the equations from Word 2003, but if converting it to 2007/Open XML is required, that's fine.
What Word format are your documents in? If they are in Open XML (file extension .docx) you could use the Open XML SDK available from Microsoft to extract images and embedded content.
An Open XML file is nothing but a zip archive using a special structure. You will find examples in the SDK how to access parts of that zip archive. Actually you could use any zip-capable library to extract the content from the document package.
If the documents still use the older binary format things are a bit more complicated. I think the easiest way would be to convert the documents to the Open XML format. There are several ways to do this:
Install Microsoft's Compatibility Pack and use the following command line for conversion:
"C:\Program Files\Microsoft Office\Office12\wordconv.exe" -oice -nme input\_file output_file
where input_file and output_file must be full path names.
I don't know if any of this will help, but the object model in Word 2000/2003 has an InlineShapes collection as part of the Document object which represents embedded images and possibly similar objects like equations.
Some VBA code to copy the first item onto the clipboard, which might help you extract them:
ThisDocument.InlineShapes.Items(1).Select
Selection.Copy
It's accessible in .NET too, MSDN link.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With