Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use Apache HWPF to extract text and images out of a DOC file

I downloaded the Apache HWPF. I want to use it to read a doc file and write its text into a plain text file. I don't know the HWPF so well.

My very simple program is here:

I have 3 problems now:

  1. Some of packages have errors (they can't find apache hdf). How I can fix them?

  2. How I can use the methods of HWDF to find and extract the images out?

  3. Some piece of my program is incomplete and incorrect. So please help me to complete it.

I have to complete this program in 2 days.

once again I repeat Please Please help me to complete this.

Thanks you Guys a lot for your help!!!

This is my elementary code :

public class test {
  public void m1 (){
    String filesname = "Hello.doc";
    POIFSFileSystem fs = null;
    fs = new POIFSFileSystem(new FileInputStream(filesname ); 
    HWPFDocument doc = new HWPFDocument(fs);
    WordExtractor we = new WordExtractor(doc);
    String str = we.getText() ;
    String[] paragraphs = we.getParagraphText();
    Picture pic = new Picture(. . .) ;
    pic.writeImageContent( . . . ) ;
    PicturesTable picTable = new PicturesTable( . . . ) ;
    if ( picTable.hasPicture( . . . ) ){
      picTable.extractPicture(..., ...);
      picTable.getAllPictures() ;
    }
}

1 Answers

Apache Tika will do this for you. It handles talking to POI to do the HWPF stuff, and presents you with either XHTML or Plain Text for the contents of the file. If you register a recursing parser, then you'll also get all the embedded images too.

like image 138
Gagravarr Avatar answered Dec 08 '25 07:12

Gagravarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!