Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting images from PDF using pypdfium2 (Python)

I am trying to extract images from a PDF document using this specific library: pypdfium2 (https://pypi.org/project/pypdfium2/).

I would love to use PyMuPDF instead (given it's excellent speed and versatility), but because it uses a copy-left license I CANNOT use it for my workflow. So please don't provide an answer that advises me to use PyMuPDF.

Any suggestions are appreciated. I've looked through the docs but can't seem to find any image extraction methods.

To be clear, I am not trying to convert the PDF pages into images, I am trying to extract images within the document itself (assuming there are any). Images are typically embedded as either jpeg's or png's.

like image 460
americanthinker Avatar asked Sep 15 '25 09:09

americanthinker


1 Answers

pypdfium2 maintainer here. Yes, this is possible, and also documented. Take a look at PdfPage.get_objects() and PdfImage.extract() (or PdfImage.get_bitmap()).

There's also a built-in CLI pypdfium2 extract-images as testing utility. Its implementation demonstrates how to use the above APIs.

However, due to limitations in pdfium's public interface, pypdfium2 is by far not as good at image extraction as would technically be possible. You may want to consider pikepdf (MPL2-licensed), it's most sophisticated tool for this task IMHO.

(BTW, It's better to ask such questions on pypdfium2's discussions page on GitHub, then you're more likely to get a response.)

like image 101
mara004 Avatar answered Sep 16 '25 21:09

mara004