We have a pretty complex print workflow where the controlling is adding comments and annotations for draft versions of generated PDF documents using Adobe Reader or Adobe Acrobat. As part of the workflow imported PDF documents with annotations and comments should be parsed and the annotations should be imported into a CMS system (together with the PDF).
Q: are there any reliable tools (preferred Python or Java) for extracting such data in clean and reliable way to PDF files?
This code should do the job. One of the answers to the question Parse annotations from a pdf was very helpful in getting me to write the code below. It uses the poppler library to parse the annotations. This is a link to annotations.pdf.
code
import poppler, os.path
path = 'file://%s' % os.path.realpath('annotations.pdf')
doc = poppler.document_new_from_file(path, None)
pages = [doc.get_page(i) for i in range(doc.get_n_pages())]
for page_no, page in enumerate(pages):
items = [i.annot.get_contents() for i in page.get_annot_mapping()]
items = [i for i in items if i]
print "page: %s comments: %s " % (page_no + 1, items)
output
page: 1 comments: ['This is an annotation']
page: 2 comments: [' Please note ', ' Please note ', 'This is a comment in the text']
installation
On Ubuntu the installation as as follows.
apt-get install python-poppler
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With