I have to download source code of a website like www.humkinar.pk in simple HTML form. Content on site is dynamically generated. I have tried driver.page_source function of selenium but it does not download page completely such as image and javascript files are left. How can I download complete page. Is there any better and easy solution in python available?
I know your question is about selenium, but from my experience I am telling you that selenium is recommended for testing and NOT for scraping. It is very SLOW. Even with multiple instances of headless browsers (chrome for your situation), the result is delaying too much.
Python 2, 3
This trio will help you a lot and save you a bunch of time.
Do not use the parser of dryscrape, it is very SLOW and buggy. For this situation, one can use BeautifulSoup with the
lxmlparser. Use dryscrape to scrape Javascript generated content, plain HTML and images.If you are scraping a lot of links simultaneously, i highly recommend using something like ThreadPoolExecutor
from dryscrape import start_xvfb
from dryscrape.session import Session
from dryscrape.mixins import WaitTimeoutError
from bs4 import BeautifulSoup
def new_session():
session = Session()
session.set_attribute('auto_load_images', False)
session.set_header('User-Agent', 'SomeUserAgent')
return session
def session_reset(session):
return session.reset()
def session_visit(session, url, check):
session.visit(url)
# ensure that the market table is visible first
if check:
try:
session.wait_for(lambda: session.at_css(
'SOME#CSS.SELECTOR.HERE'))
except WaitTimeoutError:
pass
body = session.body()
session_reset(session)
return body
# start xvfb in case no X is running (server)
start_xvfb()
SESSION = new_session()
URL = 'https://stackoverflow.com/questions/45796411/download-entire-webpage-html-image-js-by-selenium-python/45824047#45824047'
CHECK = False
BODY = session_visit(SESSION, URL, CHECK)
soup = BeautifulSoup(BODY, 'lxml')
RESULT = soup.find('div', {'id': 'answer-45824047'})
print(RESULT)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With