Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract data from a table in a PDF file?

I have a PDF file containing a table, the format is like this:

pdf img

Now;I need to extract the data from specific columns from each row to insert in a database. How can I extract the columns I want only with either javascript or python?

I already tried the manual way but that is not sufficient.

I expect to get the raw data put in a variable (array or list).

========================================== UPDATE:

I decided to go with python, the library's name is tabula; I installed it using pip:

pip install tabula-py

You pass the pdf to the library and specify the page of the table. The output of the table in my question looks magically like this:

enter image description here

like image 351
Mohammed Baashar Avatar asked Jan 22 '26 10:01

Mohammed Baashar


2 Answers

I used pdfjs-dist to extract the items in a pdf, and apply some rules to identify the table elements. The extracted items not only has the text information, but only has an attribute called "transform" (transformation matrix) that contains coordinates information, which can be also used to identify the table elements.

The first thing is to find the beginning of a table. In many cases the headers are the same so you can utilize those words to find a beginning. The first table element in a row may share the same coordinate which can also gives a clue where a table starts. After the beginning of a table is identified, because all the tables are fixed width, the items can be divided to certain columns. Just pay attention that there may be more than one row in a single cell, so that you'll need to combine them.

like image 165
Roy Keane Avatar answered Jan 23 '26 23:01

Roy Keane


You could try AWS Textract. It has a feature where it extracts tables gives you the data as a csv/json.

you can look up more about it here

like image 29
koushikmln Avatar answered Jan 24 '26 00:01

koushikmln



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!