Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Retrieve document content with document structure with python-docx

I have to retrieve tables and previous/next paragraphs from docx file, but can't imagine how to obtain this with python-docx

I can get a list of paragraphs by document.paragraphs

I can get a list of tables by document.tables

How can I get an ordered list of document elements like this

[
Paragraph1,
Paragraph2,
Table1,
Paragraph3,
Table3,
Paragraph4,
...
]?
like image 581
Belegnar Avatar asked Mar 11 '26 23:03

Belegnar


1 Answers

python-docx doesn't yet have API support for this; interestingly, the Microsoft Word API doesn't either.

But you can work around this with the following code. Note that it's a bit brittle because it makes use of python-docx internals that are subject to change, but I expect it will work just fine for the foreseeable future:

#!/usr/bin/env python
# encoding: utf-8

"""
Testing iter_block_items()
"""

from __future__ import (
    absolute_import, division, print_function, unicode_literals
)

from docx import Document
from docx.document import Document as _Document
from docx.oxml.text.paragraph import CT_P
from docx.oxml.table import CT_Tbl
from docx.table import _Cell, Table
from docx.text.paragraph import Paragraph


def iter_block_items(parent):
    """
    Generate a reference to each paragraph and table child within *parent*,
    in document order. Each returned value is an instance of either Table or
    Paragraph. *parent* would most commonly be a reference to a main
    Document object, but also works for a _Cell object, which itself can
    contain paragraphs and tables.
    """
    if isinstance(parent, _Document):
        parent_elm = parent.element.body
        # print(parent_elm.xml)
    elif isinstance(parent, _Cell):
        parent_elm = parent._tc
    else:
        raise ValueError("something's not right")

    for child in parent_elm.iterchildren():
        if isinstance(child, CT_P):
            yield Paragraph(child, parent)
        elif isinstance(child, CT_Tbl):
            yield Table(child, parent)


document = Document('test.docx')
for block in iter_block_items(document):
    print('found one')
    print(block.text if isinstance(block, Paragraph) else '<table>')

There is some more discussion of this here:
https://github.com/python-openxml/python-docx/issues/276

like image 192
scanny Avatar answered Mar 14 '26 13:03

scanny