I'm using Tika* to parse a PDF file. There are no problems to retrieve the document's text, but I don't figure out how to extract text:
Adobe Writer gives you different text edit options, but I'm not able to see where they are "hidden".
Is there a solution to extract these metadata information? (underline, highligh ...)
Do you know if Tika is able to extract this data?
*http://tika.apache.org/
Wow. 4 years is a long time to wait for an answer, and I figure you have found a solution by now. Anyways, for the sake of those who would visit this link, the answer is Yes. Apache Tika can extract not just text in a document, but also the formatting as well (e.g. bold, italicized). This was my Scenario:
//inputStream is the document you wish to parse from.
AutoDetectParser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler(new ToXMLContentHandler());
Metadata metadata = new Metadata();
parser.parse(inputStream,handler,metadata);
System.out.println(handler.toString());
The print statement prints an XML of your document. With a little work of cleaning up the XML (really HTML tags), you would be left with tags like < b >text< /b> for bold text and < i >text < / i > for italicized text. Then you could find a way to render it. Good luck.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With