Extract headings and sub headings from PDF Parsing with Python 3

Question

I'm trying to parse the pdf into an html, and then I would like to extract the headings and subheading from the tags. The pdf document was generated by Microsoft word so, I'm pretty sure there must be a way to get those headings.

So far, I have tried parsing with Apache Tika and PDFMiner.six but so far the html I have got doesn't have such tags which I could use to extract headings and subheadings of the document.

I wonder if there is a way to do it, would appreciate any help. Thank you

Aswathy - Intel · Accepted Answer

I suggest you to use GROBID which is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.

A simple python client for GROBID REST services is available at https://github.com/kermitt2/grobid-client-python

This Python client can be used to process a set of PDF in a given directory by the GROBID service. Results are written in a given output directory and include the resulting XML TEI representation of the PDF.

Hope this helps.

Extract headings and sub headings from PDF Parsing with Python 3

Tags:

python

html

python-3.x

pdf

Ali Asad

1 Answers

Aswathy - Intel

Recent Activity

Donate For Us

Extract headings and sub headings from PDF Parsing with Python 3

Tags:

python

html

python-3.x

pdf

Ali Asad

1 Answers

Aswathy - Intel

Related questions

Recent Activity

Donate For Us