Problems to extract table data using camelot without error message

Question

I am trying to extract tables from this pdf link using camelot, however, when a try this follow code:

import camelot

file = 'relacao_medicamentos_rename_2020.pdf'

tables = camelot.read_pdf(file)
tables.export('relacao_medicamentos_rename_2020.csv', f='csv', compress=False)

Simple nothing happens. This is very strange 'cause when I try the same code but with this pdf link works very welll.

rosa b. · Accepted Answer

As Stefano suggested you need to specify the relevant pages and set the option flavor='stream'. The default flavor='lattice' only works if there are lines between the cells.

Additionally, increasing row_tol helps to group rows together. So that, for example, the header of the first table is not read as three separate rows but as one row. Specifically 'Concentração/Composição' is identified as coherent text.

Also you may want to use strip_text=' ' to remove new line characters.

This results in (reading page 17 and 18 as an example):

import camelot
file = 'relacao_medicamentos_rename_2020.pdf'
tables = camelot.read_pdf(file, pages='17, 18', flavor='stream', row_tol=20, strip_text='
') 
tables.export('foo.csv', f='csv', compress=False)

Still, this way you end up with one table per page and one csv file per table. I.e. in the example above you get two .csv files. This needs to be handled outside camelot. To merge tables spanning multiple pages using pandas:

import pandas as pd
dfs = []  # list to store dataframes
for table in tables:
    df = table.df
    df.columns = df.iloc[0]  # use first row as header
    df = df[1:]  # remove the first row from the dataframe
    dfs.append(df)
df = pd.concat(dfs, axis=0)  # concatenate all dataframes in list 
df.to_csv('foo.csv')  # export dataframe to csv

Also, there are difficulties identifying table areas on pages containing both text and tables (e.g. pdf page 16). In these cases the table area can be specified. For the table on page 16, this would be:

tables = camelot.read_pdf(in_dir + file, pages='16', flavor='stream', row_tol=20, strip_text='
', table_areas=['35,420,380,65'],)

Note: Throughout the post I referenced pages by 'counting' the pages of the file and not by the page numbers printed on each page (the latter one starts at the second page of the document).

KlausGPaul · Answer

Further to Stefano's comment, you need to specify both "stream" and a page number. To get the number of pages, I used PyPDF2, which should be installed by camelot.

In addition, I also suppressed the "no tables found" warning (which is purely optional).

import camelot
import PyPDF2
import warnings

file = 'relacao_medicamentos_rename_2020.pdf'
reader = PyPDF2.PdfFileReader(file)
num_pages = reader.getNumPages()

warnings.simplefilter('ignore')

for page in range(1,num_pages+1):
    tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
    if tables.n > 0:
        tables.export(f'relacao_medicamentos_rename_2020.{page}.csv', f='csv', compress=False)

It is hard to tell why this works on one pdf file, but not on the other. There is so many different ways how pdf's are created and written by the authoring software that one has to do trial and error.

Also, not all tables found are actual tables, and the results are a bit messy, and need some cleansing.

Consider doing this straight away with something like:

   tables = camelot.read_pdf(file,flavor='stream',pages=f'{page}')
   for table in tables:
       df = table.df
       # clean data
       ...
       df.to_csv(....)

Problems to extract table data using camelot without error message

Tags:

python

ghostscript

pdf-extraction

python-camelot

Gabriel Souto

2 Answers

rosa b.

KlausGPaul

Recent Activity

Donate For Us

Problems to extract table data using camelot without error message

Tags:

python

ghostscript

pdf-extraction

python-camelot

Gabriel Souto

2 Answers

rosa b.

KlausGPaul

Related questions

Recent Activity

Donate For Us