Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterate through .PDFs and convert them to .txt using PDFMiner

I'm trying to merge two different things I've been able to accomplish independently. Unfortunately the PDFMiner docs are just not useful at all.

I have a folder that has hundred of PDFs, named: "[0-9].pdf", in it, in no particular order and I don't care to sort them. I just need a way to go through them and convert them to text.

Using this post: Extracting text from a PDF file using PDFMiner in python? - I was able to extract the text from one PDF successfully.

Some of this post: batch process text to csv using python - was useful in determining how to open a folder full of PDFs and work with them.

Now, I just don't know how I can combine them to one-by-one open a PDF, convert it to a text object, save that to a text file with the same original-filename.txt, and then move onto the next PDF in the directory.

Here's my code:

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import os
import glob

directory = r'./Documents/003/' #path
pdfFiles = glob.glob(os.path.join(directory, '*.pdf'))

resourceManager = PDFResourceManager()
returnString = StringIO()
codec = 'utf-8'
laParams = LAParams()
device = TextConverter(resourceManager, returnString, codec=codec, laparams=laParams)
interpreter = PDFPageInterpreter(resourceManager, device)

password = ""
maxPages = 0
caching = True
pageNums=set()

for one_pdf in pdfFiles:
    print("Processing file: " + str(one_pdf))
    fp = file(one_pdf, 'rb')
    for page in PDFPage.get_pages(fp, pageNums, maxpages=maxPages, password=password,caching=caching, check_extractable=True):
            interpreter.process_page(page)
    text = returnString.getvalue()
    filenameString = str(one_pdf) + ".txt"
    text_file = open(filenameString, "w")
    text_file.write(text)
    text_file.close()
    fp.close()

device.close()
returnString.close()

I get no compilation errors, but my code doesn't do anything.

Thanks for your help!

like image 711
kabaname Avatar asked Dec 05 '25 10:12

kabaname


1 Answers

Just answering my own question with the solution idea from @LaurentLAPORTE that worked.

Set directory to an absolute path using os like this: os.path.abspath("../Documents/003/"). And then it'll work.

like image 96
kabaname Avatar answered Dec 07 '25 00:12

kabaname



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!