Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python docx - AttributeError: 'bytes' object has no attribute 'seek'

Tags:

python

docx

What I have as input: docx document raw bytes in byte64 format.
What I am trying to achieve: extract text from this document for further processing.
I tried to follow this answer: extracting text from MS word files in python

My code fragment:

base64_bytes = input.encode('utf-8')
decoded_data = base64.decodebytes(base64_bytes)
document = Document(decoded_data)
docText = '\n\n'.join([paragraph.text.encode('utf-8') for paragraph in document.paragraphs])

The document = Document(decoded_data) line gives me the following error: AttributeError: 'bytes' object has no attribute 'seek'
The decoded_data is in the following format: b'PK\\x03\\x04\\x14\\x00\\x08\\x08\\x08\\x00\\x87@CP\\x00...

How should I format the raw data to extract text from docx?

like image 669
Michał Herman Avatar asked Apr 19 '26 02:04

Michał Herman


1 Answers

From the official documentation, emphasis mine:

docx.Document(docx=None)

Return a Document object loaded from docx, where docx can be either a path to a .docx file (a string) or a file-like object. If docx is missing or None, the built-in default document “template” is loaded.

So if you provide a string or string-like parameter it is interpreted as the path to a docx file. To provide the contents from memory, you need to pass in a file-like object aka a BytesIO instance (the entire point of StringIO and BytesIO being to "convert" strings and bytes to file-like objects):

document = Document(io.BytesIO(decoded_data))

side-note: you probably want to remove the .encode call in the list comprehension, in Python 3 text (str) and bytes (bytes) are not compatible at all, so the line is going to blow up when you try to concatenate bytes (encoded text) with textual separators.

like image 148
Masklinn Avatar answered Apr 21 '26 15:04

Masklinn



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!