Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract images from word document using Python

How can i extract images/logo from word document using python and store them in a folder. Following code converts docx to html but it doesn't extract images from the html. Any pointer/suggestion will be of great help.

    profile_path = <file path>
    result=mammoth.convert_to_html( profile_path)
    f = open(profile_path, 'rb')
    b = open(profile_html, 'wb')
    document = mammoth.convert_to_html(f)
    b.write(document.value.encode('utf8'))
    f.close()
    b.close()
like image 906
Softchamp Avatar asked Oct 30 '25 16:10

Softchamp


1 Answers

Native without any lib

To extract the source Images from the docx (which is a variation on a zip file) without distortion or conversion.

shell out to OS and run

tar -m -xf DocxWithImages.docx word/media

enter image description here

You will find the source images Jpeg, PNG WMF or others in the word media folder extracted into a folder of that name. These are the unadulterated source embedment's without scale or crop.

You may be surprised that the visible area may be larger then any cropped version used in the docx itself, and thus need to be aware that Word does not always crop images as expected (A source of embarrassing redaction failure)

like image 195
K J Avatar answered Nov 01 '25 07:11

K J



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!