Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping web data using PhantomJS and Selenium

I am using Phantomjs in selenium to scrape data from the link given in the snippet. While extracting the data with element.text in phantomjs(web_element), I am getting some blank values in between where as if I use chromedriver I was able to scrape all data.

I can only run using headless browser since I am running it in AWS Linux server

how can i scrape all the data without missing using phantomjs. Expecting some help here... thank you in advance

Below is the snippet attached

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.common.exceptions import NoSuchElementException
dcap = dict(DesiredCapabilities.PHANTOMJS)
dcap["phantomjs.page.settings.userAgent"] = (
     "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53 "
     "(KHTML, like Gecko) Chrome/29.0.1547.57 Safari/537.36")
driver = webdriver.PhantomJS(desired_capabilities = dcap,service_args=['--ignore-ssl-errors=true', '--load-images=false'])
driver.get("http://www.myntra.com/Dresses/Casual-Collection/Casual-Collection-by-Debenhams-Purple-Floral-Print-Maxi-Dress/348207/buy")
driver.implicitly_wait(5)
try:
    driver.find_element_by_class_name("size-buttons-show-size-chart").click()
    driver.implicitly_wait(10)
    div_s = driver.find_elements_by_class_name("size-chart-cell")
    # div_s = driver.find_elements_by_xpath("""//*[@id="mountRoot"]/div/div/div/div[3]/div/div[2]/div[1]/table/tbody/tr""")
    size_data = ''
    for s in div_s:
        print str(s.text)
except NoSuchElementException:
    print "NoSuchElementException"

Modified output:

Size XS S M L XL XXL 3XL
Brand Size UK10 UK12 UK14 UK16 UK18 UK20 UK22
Hips (INCHES) 36 38 40 42.5 45.25 48 50.75
31 41.75 # most Element is missing/ not able to scrape ???
Bust (INCHES) 34.25 36.25 38 40 43.75 46.5 49.25

Actual table is : Size Chart

like image 908
Dinu Duke Avatar asked Mar 13 '26 19:03

Dinu Duke


1 Answers

Interesting problem. Using the textContent would actually work in this case:

for s in div_s:
    print(str(s.get_attribute("textContent")))

Differences between .text, textContent and other properties are perfectly described here:

  • innerText vs innerHtml vs label vs text vs textContent vs outerText
  • Difference between text and textContent properties

Note that there is no point in calling the implicitly_wait() multiple times - it does not act as time.sleep() - meaning, it would not wait for a certain amount of time immediately - instead, it would only instruct the driver to set the "implicit wait" to the specified amount of seconds:

An implicit wait is to tell WebDriver to poll the DOM for a certain amount of time when trying to find an element or elements if they are not immediately available.

A better way to wait in this case would be to use Explicit Waits.

like image 74
alecxe Avatar answered Mar 15 '26 09:03

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!