How can I extract raw text from PDFs using Apache POI?

Question

I need to extract raw text from several files, some of which are PDF and some of which are DOC file formats.

I have to use Apache POI to do this. Now, there is a lot of documentation I have found on dealing with word files (extracting and writing to etc.) but I am unable to find any documentation on extracting from a PDF.

Am I wrong in believing that Apache POI has this capability?

If so, can anyone recommend similar Java programs that allow text extraction from multiple file formats?

If not, can anyone point me to the documentation and/or the classes/methods that I should be looking at to do this?

Thank you in advance for any help.

Gagravarr · Accepted Answer

Yes, you are wrong in believing that POI will do that. Apache POI works with Microsoft Office file formats, which PDF isn't.

You'll either want to use Apache PDFBox directly, or us Apache Tika which will do both Microsoft Office and PDF file formats (amongst many others).

How can I extract raw text from PDFs using Apache POI?

Tags:

java

pdf

text-extraction

apache-poi

superdemongob

1 Answers

Gagravarr

Recent Activity

Donate For Us

How can I extract raw text from PDFs using Apache POI?

Tags:

java

pdf

text-extraction

apache-poi

superdemongob

1 Answers

Gagravarr

Related questions

Recent Activity

Donate For Us