For machine learning purposes (sckit-learn), I need to extract the raw text from lots of PDF files. First off, I was using xpdf pdftotext to do this task:
exe = r'"'+os.path.join(xpdf_path,"pdftotext.exe")+'"'
cmd = exe+" "+"\""+pdf+"\""+" "+"\""+pdf+".txt"+"\""
subprocess.check_output(cmd)
with open(pdf+".txt") as f:
texto_converted = f.read()
But unfortunately, for few of them, I was unable to get the text because they are using "stream" on their pdf source, like this one.
The result is something like this:
59!"#$%&'()*+,-.#/#01"21"" 345667.0*(879:4$;<;4=<6>4?$@"12!/ 21#$@A$3A$>@>BCDCEFGCHIJKIJLMNIJILOCNPQRDS QPFTRPUCTCVQWBCTTQXFPYTO"21 "#/!"#(Z[12\&A+],$3^_3;9`Z &a# .2"#.b#"(#c#A(87*95d$d4?$d3e#Z"f#\"#2b?2"#`Z 2"!eb2"#H1TBRgF JhiO
jFK# 2"k#`Z !#212##"elf/e21m#*c!n2!!#/bZ!#2#`Z "eo ]$5<$@;A533> "/\ko/f\#e#e#p
I Even trying using zlib + regex:
import re
import zlib
pdf = open("pdfa.pdf", "rb").read()
stream = re.compile(b'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in re.findall(stream,pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s).decode('UTF-8'))
print("")
except:
pass
The result was something like this:
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
1 0 -10 -10 10 10 d1
0.01 0 0 0.01 0 0 cm
I even tried pdftopng (xpdf) to try tesseract after, without success So, Is there any way to extract pure text from a PDF like that using Python or a third party app?
If you want to decompress the streams in a PDF file, I can recommend using qdpf
, but on this file
qpdf --decrypt --stream-data=uncompress document.pdf out.pdf
doesn't help either.
I am not sure though why your efforts with xpdf
and tesseract
did not work out, using image-magick's convert
to create PNG files in a temporary directory and tesseract
, you can do:
import os
from pathlib import Path
from tempfile import TemporaryDirectory
import subprocess
DPI=600
def call(*args):
cmd = [str(x) for x in args]
return subprocess.check_output(cmd, stderr=subprocess.STDOUT).decode('utf-8')
def ocr(docpath, lang):
result = []
abs_path = Path(docpath).expanduser().resolve()
old_dir = os.getcwd()
out = Path('out.txt')
with TemporaryDirectory() as tmpdir:
os.chdir(tmpdir)
call('convert', '-density', DPI, abs_path, 'out.png')
index = -1
while True:
# names have no leading zeros on the digits, would be difficult to sort glob() output
# so just count them
index += 1
png = Path(f'out-{index}.png')
if not png.exists():
break
call('tesseract', '--dpi', DPI, png, out.stem, '-l', lang)
result.append(out.read_text())
os.chdir(old_dir)
return result
pages = ocr('~/Downloads/document.pdf', 'por')
print('\n'.join(pages[1].splitlines()[21:24]))
which gives:
DA NÃO REALIZAÇÃO DE AUDIÊNCIA DE AUTOCOMPOSIÇÃO NO CASO EM CONCRETO
Com vista a obter maior celeridade processual, assim como da impossibilidade de conciliação entre
If you are on Windows, make sure your PDF file is not open in a different process (like a PDF viewer), as Windows doesn't seem to like that.
The final print
is limited as the full output is quite large.
This converting and OCR-ing takes a while so you might want to uncomment the print
in call()
to get some sense of progress.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With