Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python webscrape from company sharepoint

I need to scrape data from my company's Sharepoint site using Python, but I am stuck at the authentication phase. I have tried using HttpNtlmAuth from requests_ntlm, HttpNegotiateAuth from requests_negotiate_sspi, mechanize and none worked. I am new to web scraping and I have been stuck on this issue for a few days already. I just need to get the HTML source so I can start filtering for the data I need. Please anyone give me some guidance on this issue.

Methods I've tried:

import requests
from requests_negotiate_sspi import HttpNegotiateAuth

# this is the security certificate I downloaded using chrome
cert = 'certsharepoint.cer'

response = requests.get(
    r'https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx',
    auth=HttpNegotiateAuth(),
    verify=cert)

print(response.status_code)

Error:

[X509: NO_CERTIFICATE_OR_CRL_FOUND] no certificate or crl found (_ssl.c:4293)

Another method:

import sharepy
s = sharepy.connect("https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx",
username="username",
password="password")

Error:

Invalid Request: AADSTS90023: Invalid STS request

There seems to be a problem with the certificate in the first method and researching the Invalid STS request does not bring up any solutions that work for me.

Another method:

import requests
from requests_ntlm import HttpNtlmAuth

r = requests.get("http://ntlm_protected_site.com",auth=HttpNtlmAuth('domain\\username','password'))

Error:

403 FORBIDDEN

Using requests.get with headers like so:

headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) ' 
                      'AppleWebKit/537.11 (KHTML, like Gecko) '
                      'Chrome/23.0.1271.64 Safari/537.11',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
        'Accept-Encoding': 'none',
        'Accept-Language': 'en-US,en;q=0.8',
        'Connection': 'keep-alive'}
 
auth = HttpNtlmAuth(username = username,
                     password = password)

responseObject = requests.get(url, auth = auth, headers=headers)

returns a 200 response, whereas using requests.get without headers would return a 403 forbidden response. The returned HTML however is of no use, because it's the HTML for this page:

200 response

Moreover, removing the auth parameter from requests.get responseObject = requests.get(url, headers=headers) does not change anything, as in it still returns a 200 response with the same HTML for the "We can't sign you in" page.

like image 523
Sederfo Avatar asked May 04 '26 19:05

Sederfo


1 Answers

If doing this interactively, try using Selenium. https://selenium-python.readthedocs.io/ with webdriver_manager (so you can skip having to download the web browser driver). https://pypi.org/project/webdriver-manager/. Selenium will not only allow you to authenticate to your tenant interactively, but also makes it possible to collect dynamic content that may require interaction after loading the page: like pushing a button to reveal a table.

like image 56
Phil Lembo Avatar answered May 07 '26 11:05

Phil Lembo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!