How to split PDF into paragraphs using Tika

Question

I have a PDF document which I am currently parsing using Tika-Python. I would like to split the document into paragraphs.

My idea is to split the document into paragraphs and then create a list of paragraphs using the isspace() function

I also tried splitting using however nothing works.

This is my current code:

file_data = (parser.from_file('/Users/graziellademartino/Desktop/UNIBA/Research Project/UK cases/file1.pdf'))
file_data_content = file_data['content']

paragraph = ''
for line in file_data_content:
    if line.isspace():  
        if paragraph:
            yield paragraph
            paragraph = ''
        else:
            continue
    else:
        paragraph += ' ' + line.strip()
yield paragraph

Booboo · Accepted Answer

I can't be sure what file_data_content now looks like because I do not know what you are using to process your PDF data and what it returns. But, if it is returning a basic string, such as Line1 Line2 etc., then the following below should work. When you say:

for line in file_data_content:

and file_data_content is a string, you are processing the string character by character rather than line by line and that would clearly be a problem. So, you would need to split your text into a list of lines and process each element of that list:

def create_paragraphs(file_data_content):
    lines = file_data_content.splitlines(True)
    paragraph = []
    for line in lines:
        if line.isspace():
            if paragraph:
                yield ''.join(paragraph)
                paragraph = []
        else:
            paragraph.append(line)
    if paragraph:
        yield ''.join(paragraph)

text="""Line1
Line2

Line3
Line4


Line5"""

print(list(create_paragraphs(text)))

Prints:

['Line1
Line2
', 'Line3
Line4
', 'Line5']

How to split PDF into paragraphs using Tika

Tags:

python

pdf

apache-tika

Graziella De Martino

1 Answers

Booboo

Recent Activity

Donate For Us

How to split PDF into paragraphs using Tika

Tags:

python

pdf

apache-tika

Graziella De Martino

1 Answers

Booboo

Related questions

Recent Activity

Donate For Us