I need to scrape data from my company's Sharepoint site using Python, but I am stuck at the authentication phase. I have tried using HttpNtlmAuth from requests_ntlm, HttpNegotiateAuth from requests_negotiate_sspi, mechanize and none worked. I am new to web scraping and I have been stuck on this issue for a few days already. I just need to get the HTML source so I can start filtering for the data I need. Please anyone give me some guidance on this issue.
Methods I've tried:
import requests
from requests_negotiate_sspi import HttpNegotiateAuth
# this is the security certificate I downloaded using chrome
cert = 'certsharepoint.cer'
response = requests.get(
r'https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx',
auth=HttpNegotiateAuth(),
verify=cert)
print(response.status_code)
Error:
[X509: NO_CERTIFICATE_OR_CRL_FOUND] no certificate or crl found (_ssl.c:4293)
Another method:
import sharepy
s = sharepy.connect("https://company.sharepoint.com/xxx/xxx/xxx/xxx/xxx.aspx",
username="username",
password="password")
Error:
Invalid Request: AADSTS90023: Invalid STS request
There seems to be a problem with the certificate in the first method and researching the Invalid STS request does not bring up any solutions that work for me.
Another method:
import requests
from requests_ntlm import HttpNtlmAuth
r = requests.get("http://ntlm_protected_site.com",auth=HttpNtlmAuth('domain\\username','password'))
Error:
403 FORBIDDEN
Using requests.get with headers like so:
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.11 (KHTML, like Gecko) '
'Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
auth = HttpNtlmAuth(username = username,
password = password)
responseObject = requests.get(url, auth = auth, headers=headers)
returns a 200 response, whereas using requests.get without headers would return a 403 forbidden response. The returned HTML however is of no use, because it's the HTML for this page:

Moreover, removing the auth parameter from requests.get responseObject = requests.get(url, headers=headers) does not change anything, as in it still returns a 200 response with the same HTML for the "We can't sign you in" page.
If doing this interactively, try using Selenium. https://selenium-python.readthedocs.io/ with webdriver_manager (so you can skip having to download the web browser driver). https://pypi.org/project/webdriver-manager/. Selenium will not only allow you to authenticate to your tenant interactively, but also makes it possible to collect dynamic content that may require interaction after loading the page: like pushing a button to reveal a table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With