Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert from PDF to TXT without unintended line breaks?

I am trying to convert a very clean PDF file into txt file using python. I have tried using pyPDF2 and PDFMiner, both worked perfectly in text recognition.

However, as in PDF the lines are wrapped, the extracted .txt file have unintended line break at the end: e.g line 1: "is an account of the Elder /n Days, ". There should not be a line break between the "Elder" and the "days".

txt file

The PDF file: enter image description here

When edited with Acrobat, it can be clearly seen the original text in PDF contains no hard line break, and could be edited as a paragraph instead of single lines. enter image description here

The Code I have tried (adapted from an answer from here: convert from pdf to text: lines and words are broken)

import io as io
from io import StringIO
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import os
import sys, getopt

#converts pdf, returns its text content as a string
def convert(fname, pages=None):
    if not pages:
        pagenums = set()
    else:
        pagenums = set(pages)

    output = io.StringIO()
    manager = PDFResourceManager()
    converter = TextConverter(manager, output, laparams=LAParams())
    interpreter = PDFPageInterpreter(manager, converter)

    infile = open(fname, 'rb')
    for page in PDFPage.get_pages(infile, pagenums):
        interpreter.process_page(page)
    infile.close()
    converter.close()
    text = output.getvalue()
    output.close
    return text

path='D:\Folder\File.pdf'
a=convert(path)
f=open("D:\Folder\File.txt",'a',encoding='utf-8')
f.write(a)
f.close()
like image 806
C.Ann.Sng Avatar asked Oct 17 '25 07:10

C.Ann.Sng


1 Answers

"A picture is worth a thousand words" and comments do not allow pictures! I am using the Web archive of a different copy but the Gist is exactly the same.

You are working with "justified" content but unlike reflowing justification in a word processor, the glyphs are generally tied to a line of a set position up from the page base. Adobe are working on reflowable PDFs and have the expertise to combine lines in a paragraph, however other readers will accept</br>
each line for what it is. </br>

<p style=indented>There are no paragraph boundary markers, like there is in say HTML </p>

Readers could in the future be augmented like Acrobat, to combine the lines, but it's not needed for reading (aloud) one line at a time. Some audio readers will noticeably stutter on those "line at a time" returns, whilst some are intelligently programmed to simply ignore them.

enter image description here

In short you need to add your own AI/regex to gather lines and add indents, but beware significant human literature differences such as hyphenation and oriental punctuation.

like image 198
K J Avatar answered Oct 19 '25 22:10

K J



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!