My goal is to get a list of the names of all the new items that have been posted on https://www.prusaprinters.org/prints during the full 24 hours of a given day.
Through a bit of reading I've learned that I should be using Selenium because the site I'm scraping is dynamic (loads more objects as the user scrolls).
Trouble is, I can't seem to get anything but an empty list from webdriver.find_elements_by_
with any of the suffixes listed at https://selenium-python.readthedocs.io/locating-elements.html.
On the site, I see "class = name"
and "class = clamp-two-lines"
when I inspect the element I want to get the title of (see screenshot), but I can't seem to return a list of all the elements on the page with that name
class or the clamp-two-lines
class.
Here's the code I have so far (the lines commented out are failed attempts):
from timeit import default_timer as timer
start_time = timer()
print("Script Started")
import bs4, selenium, smtplib, time
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Chrome(r'D:\PortableApps\Python Peripherals\chromedriver.exe')
url = 'https://www.prusaprinters.org/prints'
driver.get(url)
# foo = driver.find_elements_by_name('name')
# foo = driver.find_elements_by_xpath('name')
# foo = driver.find_elements_by_class_name('name')
# foo = driver.find_elements_by_tag_name('name')
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[class*=name]')]
# foo = [i.get_attribute('href') for i in driver.find_elements_by_css_selector('[id*=clamp-two-lines]')]
# foo = WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, '//*[@id="printListOuter"]//ul[@class="clamp-two-lines"]/li')))
print(foo)
driver.quit()
print("Time to run: " + str(round(timer() - start_time,4)) + "s")
My research:
To get text wait for visibility of the elements. Css selector for titles is #printListOuter h3
:
titles = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))
for title in titles:
print(title.text)
Shorter version:
wait = WebDriverWait(driver, 10)
titles = [title.text for title in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, '#printListOuter h3')))]
This is xpath of the name of the items:
.//div[@class='print-list-item']/div/a/h3/span
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With