I am trying to extract comments from a PDF using Python. These are the two pieces of code that I have tested:
One using PyPDF2:
import PyPDF2
src = 'xxxx.pdf'
input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
df_comments = pd.DataFrame()
for i in range(nPages) :
annotation = []
page = []
page0 = input1.getPage(i)
try :
for annot in page0['/Annots'] :
annotation.append(annot.getObject())
page = [i+1] * len(annotation)
page = pd.DataFrame(page)
annotation = pd.DataFrame(annotation)
df_temp = pd.concat([page, annotation], axis=1)
df_comments = pd.concat([df_comments, df_temp], ignore_index=True)
except :
# there are no annotations on this page
pass
and the other using fitz:
import fitz
doc = fitz.open(src)
for i in range(doc.pageCount):
page = doc[i]
for annot in page.annots():
print(annot.info)
The comments are getting extracted, however when I check the PDF I see that the comments are not being extracted sequentially. I have tried to check other parameters like creation date, modification date but that is not helping me.
Is their a way I can extract them serially as they are appearing in the PDF? Or Can I extract the text as well from the PDF against which the comment has been tagged?
I'm the current maintainer of PyPDF2.
The annotations are currently extracted in the order they appear in the annotations dictionary.
If you have a sensible way to sort them, feel free to open a feature request in the PyPDF2 issue tracker on github.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With