I'm trying to get website HTML in my script so I can scrape it later but I'm running into problems getting it, I am not sure why but I am only getting part of the page HTML when I'm requesting it.
First I tried requesting it with the request library when that didn't work I tried to add some headers and send this together with the request but I got confused with cookies, Do I need to send those and what should I use? Request session or basic request?
link to the webstie
Eventually, I came out with this function that isn't really gets me what I want:
def get_page_html():
link = 'https://stips.co.il/explore'
headers={
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.9',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Host': 'stips.co.il',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}
responde = requests.post(link, headers=headers)
return responde.text
As I explained I got as a result of only a part of the page.
It seems to me that the page must be dynamically loading the content or something. The solution that I've found for this on other projects I've done is to use the selenium module to load in the page in a browser object and then get the source from the page after interacting with the page in a specific way. An example you can mess around with would look something like this:
from selenium import webdriver
browser = webdriver.Chrome() # You'll need to download drivers from link above
browser.implicitly_wait(10) # probably unnecessary, just makes sure all pages you visit fully load
browser.get('https://stips.co.il/explore')
while True:
input('Press Enter to print HTML')
HTML = browser.page_source
print(HTML)
This will let you see how the HTML is changing with regards to what you are doing to the page. Once you know what buttons you are trying to click on you can locate the elements and then do things like .click() on them automatically within the program. Once you have your script scraping all the data you need, you can run selenium in headless mode and it won't even pop up with a window on your screen! It will all be behind the scenes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With