Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract headings and sub headings from PDF Parsing with Python 3

I'm trying to parse the pdf into an html, and then I would like to extract the headings and subheading from the tags. The pdf document was generated by Microsoft word so, I'm pretty sure there must be a way to get those headings.

So far, I have tried parsing with Apache Tika and PDFMiner.six but so far the html I have got doesn't have such tags which I could use to extract headings and subheadings of the document.

I wonder if there is a way to do it, would appreciate any help. Thank you

like image 339
Ali Asad Avatar asked Oct 25 '25 00:10

Ali Asad


1 Answers

I suggest you to use GROBID which is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

A simple python client for GROBID REST services is available at https://github.com/kermitt2/grobid-client-python

This Python client can be used to process a set of PDF in a given directory by the GROBID service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF.

Hope this helps.

like image 178
Aswathy - Intel Avatar answered Oct 27 '25 12:10

Aswathy - Intel



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!