I'm trying to parse the pdf into an html, and then I would like to extract the headings and subheading from the tags. The pdf document was generated by Microsoft word so, I'm pretty sure there must be a way to get those headings.
So far, I have tried parsing with Apache Tika and PDFMiner.six but so far the html I have got doesn't have such tags which I could use to extract headings and subheadings of the document.
I wonder if there is a way to do it, would appreciate any help. Thank you
I suggest you to use GROBID which is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
A simple python client for GROBID REST services is available at https://github.com/kermitt2/grobid-client-python
This Python client can be used to process a set of PDF in a given directory by the GROBID service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With