Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I extract raw text from PDFs using Apache POI?

I need to extract raw text from several files, some of which are PDF and some of which are DOC file formats.

I have to use Apache POI to do this. Now, there is a lot of documentation I have found on dealing with word files (extracting and writing to etc.) but I am unable to find any documentation on extracting from a PDF.

Am I wrong in believing that Apache POI has this capability?

If so, can anyone recommend similar Java programs that allow text extraction from multiple file formats?

If not, can anyone point me to the documentation and/or the classes/methods that I should be looking at to do this?

Thank you in advance for any help.

like image 677
superdemongob Avatar asked Oct 15 '25 07:10

superdemongob


1 Answers

Yes, you are wrong in believing that POI will do that. Apache POI works with Microsoft Office file formats, which PDF isn't.

You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others).

like image 183
Gagravarr Avatar answered Oct 17 '25 20:10

Gagravarr



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!