I am currently working on a webscraping project using Selenium in python.
My code works as intended when run from a web driver in non-headless mode. However, it is not the case when it is run in headless mode. For instance, if I try to extract text from a website, the non-headless mode returns the text, while the headless mode returns None. (I have included some code below for reference).
First, I constructed the webdriver with the following code (the opt.headless is set to True or False in order to switch between headless and non-headless)
def getHeadlessDriver():
opts = webdriver.ChromeOptions()
opts.headless = False
driver = webdriver.Chrome(ChromeDriverManager().install(), options=opts)
return driver
Then, I used the find_elements_by_xpath function to extract texts data from a website. A sample code is provided below:
driver = getHeadlessDriver()
feedbacks = driver.find_elements_by_xpath(
"//div[contains(@class, 'LiveFeedbackSectionViewController__LiveFeedbackStatusItem-sc-1ahetk9-4 cUJPkM')]")
for feedback in feedbacks:
print(feedback.text)
I did some googling to find explanation for why the headless mode does not work, but I am still not sure. From my understanding, a headless mode "acts the same", but just without a Graphical User Interface.
Could there be a problem with the implementation of my code? Or does headless mode have other differences other than not having a graphical user interface?
Thank you.
If the website you are trying to scrape has dynamic elements rendered by javascript you will need Xvfb.
sudo apt-get install -y xvfb
"Xvfb or X virtual framebuffer is a display server implementing the X11 display server protocol. In contrast to other display servers, Xvfb performs all graphical operations in virtual memory without showing any screen output."
In python, there are two wrappers for Xvfb.
1- xvfbwrapper
pip install xvfbwrapper
Then add in your python file:
from xvfbwrapper import Xvfb
display = Xvfb()
display.start()
2- pyvirtualdisplay
pip install PyVirtualDisplay
And then in your code:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(1024, 768))
display.start()
I can usually bypass this problem with time.sleep(10), however, I got one particular website that I can't scrape with either time.sleep(10) or driver.implicitly_wait(10).
I think that the website has a system that checks the user-agent of the browser.
To try and bypass this issue I've added the user agent to the headless window and it worked.
browser_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.30'
options_edge.add_argument(f'user-agent={self.user_agent}')
You can get your user agent from websites like this: https://whatmyuseragent.com/ (not affiliated)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With