Extracting PDF annotations/comments [duplicate]

Question

We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).

Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?

Marwan Alsabbagh · Accepted Answer

This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.

code

import poppler, os.path

path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]

for page_no, page in enumerate(pages):
    items = [i.annot.get_contents() for i in page.get_annot_mapping()]
    items = [i for i in items if i]
    print "page: %s comments: %s " % (page_no + 1, items)

output

page: 1 comments: ['This is an annotation'] 
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text']

installation

On Ubuntu the installation as as follows.

apt-get install python-poppler

Extracting PDF annotations/comments [duplicate]

Tags:

java

python

pdf

1 Answers

Marwan Alsabbagh

Recent Activity

Donate For Us

Extracting PDF annotations/comments [duplicate]

Tags:

java

python

pdf

1 Answers

Marwan Alsabbagh

Related questions

Recent Activity

Donate For Us