I am using tabula-py 2.0.4, pandas 1.17.4 on python 3.7. I am trying to read PDF tables to dataframe with tabula.read_pdf
from tabula import read_pdf
fn = "file.pdf"
print(read_pdf(fn, pages='all', multiple_tables=True)[0])
The problem is that the values are read as float instead of string.
I need it to be read as string, so if the value is 20.0000, I know that accuracy is to the fourth decimal. Now it returns 20.0 instead of 20.0000.
Input data in PDF looks like
The output with above code is
You need to add a couple of options to tabula.read_pdf
. An example that parses a pdf-file and interprets the columns found differently:
import tabula
print(tabula.environment_info())
fname = ("https://github.com/chezou/tabula-py/raw/master/tests/resources/"
"data.pdf")
# Columns iterpreted as str
col2str = {'dtype': str}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2str,
'stream': True}
df1 = tabula.read_pdf(fname, **kwargs)
print(df1[0].dtypes)
print(df1[0].head())
# Guessing column type
col2val = {'dtype': None}
kwargs = {'output_format': 'dataframe',
'pandas_options': col2val,
'stream': True}
df2 = tabula.read_pdf(fname, **kwargs)
print(df2[0].dtypes)
print(df2[0].head())
With the following output:
Python version:
3.7.6 (default, Jan 8 2020, 13:42:34)
[Clang 4.0.1 (tags/RELEASE_401/final)]
Java version:
openjdk version "13.0.2" 2020-01-14
OpenJDK Runtime Environment (build 13.0.2+8)
OpenJDK 64-Bit Server VM (build 13.0.2+8, mixed mode, sharing)
tabula-py version: 2.0.4
platform: Darwin-19.3.0-x86_64-i386-64bit
uname:
uname_result(system='Darwin', node='MacBook-Pro-10.local', release='19.3.0', version='Darwin Kernel Version 19.3.0: Thu Jan 9 20:58:23 PST 2020; root:xnu-6153.81.5~1/RELEASE_X86_64', machine='x86_64', processor='i386')
linux_distribution: ('Darwin', '19.3.0', '')
mac_ver: ('10.15.3', ('', '', ''), 'x86_64')
None
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg object
cyl object
disp object
hp object
drat object
wt object
qsec object
vs object
am object
gear object
carb object
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
'pages' argument isn't specified.Will extract only from page 1 by default.
Unnamed: 0 object
mpg float64
cyl int64
disp float64
hp int64
drat float64
wt float64
qsec float64
vs int64
am int64
gear int64
carb int64
dtype: object
Unnamed: 0 mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With