I'm trying to get content of website but my requests return me an 403 ERROR.
After searching, I found Network>Headers section to add headers before GET request and tried these headers.
from bs4 import BeautifulSoup as bs
import requests
url = "https://clutch.co/us/agencies/digital-marketing"
HEADERS = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}
### Also tried "Referer" , "sec-ch-ua-platform" and "Origin" headers but nothing changed.
html = requests.get(url,headers=HEADERS)
print("RESULT:",html)
But result didn't change.
You can try to load the page from the Google cache instead directly:
import requests
from bs4 import BeautifulSoup
url = "https://clutch.co/us/agencies/digital-marketing"
cache_URL = "https://webcache.googleusercontent.com/search?q=cache:"
def get_data(link):
hdr = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Mobile Safari/537.36"
}
req = requests.get(cache_URL + link, headers=hdr)
content = req.text
return content
soup = BeautifulSoup(get_data(url), 'html.parser')
for h3 in soup.select('h3.company_info'):
print(h3.get_text(strip=True))
Prints:
WebFX
Ignite Visibility
SmartSites
Thrive Internet Marketing Agency
Lilo Social
NEWMEDIA.COM
Funnel Boost Media
Direct Online Marketing
SeedX Inc.
Impactable
...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With