Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python | Http - can't get the correct mime type

I am building a web crawler using urllib3. Example code:

from urllib3 import PoolManager

pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")

I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain.It is important to me to get the correct mime type.

Example to a problematic url:

http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx

Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.

How come Firefox is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?

like image 548
Montoya Avatar asked Nov 29 '25 05:11

Montoya


1 Answers

I haven't read the Firefox source code, but I would guess that Firefox either tries to guess the filetype based on the URL, or refuses to render it inline if it's a specific Content-Type and larger than some maximum size, or perhaps it even inspects some of the file contents to figure out what it is based on a magic number at the start.

You can use the Python mimetypes module in the standard library to guess what the filetype is based on the URL:

import mimetypes
url = "http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx"
type, encoding = mimetypes.guess_type(url)

In this case, type is "application/vnd.openxmlformats-officedocument.wordprocessingml.document" which is probably what you want.

like image 73
shazow Avatar answered Nov 30 '25 20:11

shazow



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!