Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas raises ssl.CertificateError when using method read_html for HTTPS resources

I have a code that reads the contents of a web page using a url address.

Earlier my code worked well, now there is a problem with the site security certificate. To solve the problem with IE, I completed importing the certificate to trusted sites, and the problem is solved.

But when I run this code:

df = pd.read_html(i,header=0)[0]

I get an error:

Traceback (most recent call last):
  File "D:\Distrib\Load_Data_from_Flat_ver_1.py", line 95, in <module>
    df = pd.read_html(i,header=0)[0]
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 915, in read_html
    keep_default_na=keep_default_na)
  File "C:\Program Files\Python36\lib\site-packages\pandas\io\html.py", line 749, in _parse
    raise_with_traceback(retained)
  File "C:\Program Files\Python36\lib\site-packages\pandas\compat\__init__.py", line 385, in raise_with_traceback
    raise exc.with_traceback(traceback)
ssl.CertificateError: hostname '10.89.174.12' doesn't match 'localhost'

Can anyone help me with this problem?

like image 805
phillipwatts344 Avatar asked Sep 07 '25 08:09

phillipwatts344


1 Answers

What is the error

Reading the PSL documentation of ssl package, you will find an example where this specific error occurs.

>>> cert = {'subject': ((('commonName', 'example.com'),),)}
>>> ssl.match_hostname(cert, "example.com")
>>> ssl.match_hostname(cert, "example.org")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/py3k/Lib/ssl.py", line 130, in match_hostname
ssl.CertificateError: hostname 'example.org' doesn't match 'example.com'

When checking Server Common Name the second check fails. It is exactly what happens in your case.

Python path

Referring to the Pandas documentation:

io : str or file-like A URL, a file-like object, or a raw string containing HTML. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

You cannot read from HTTPS with the read_html method.

To circonvolve this problem, first download the resource using PSL over HTTPS without verifying the SSL context:

from urllib import request
import ssl

url="https://example.com/data.html"
context = ssl._create_unverified_context()
response = request.urlopen(url, context=context)
html = response.read()

And then process it with Pandas:

import pandas as pd
df = pd.read_html(html)

Create a Valid Context

As pointed out by @AlastairMcCormack:

context = ssl._create_unverified_context() should only be used for localhost or testing.

If accessing the resource without verifying the SSL context solves your problem, then it is time to create a valid context (intro, snippets) in order to safely fetch your resource.

Server path

You can also create a new certificate where the Common Name does match the server domain (or its IP). Here localhost seems come from a development certificate that was sent to production server, this could not work properly.

Anyway this point does not solve the fact than read_html does not handle HTTPS connections.

like image 130
jlandercy Avatar answered Sep 09 '25 16:09

jlandercy