How do I scrape a website with an infinite scroller?

Question

The code below is what I have so far, but it only pulls data for the first 25 items, which are the first 25 items on the page before scrolling down for more:

import requests
from bs4 import BeautifulSoup
import time
import pandas as pd

start_time = time.time()
s = requests.Session()

#Get URL and extract content
response = s.get('https://www.linkedin.com/jobs/search?keywords=It%20Business%20Analyst&location=Boston%2C%20Massachusetts%2C%20United%20States&geoId=102380872&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0')
soup = BeautifulSoup(response.text, 'html.parser')

items = soup.find('ul', {'class': 'jobs-search__results-list'})
job_titles = [i.text.strip('
 ') for i in items.find_all('h3', {'class': 'base-search-card__title'})]
job_companies = [i.text.strip('
 ') for i in items.find_all('h4', {'class': 'base-search-card__subtitle'})]
job_locations = [i.text.strip('
 ') for i in items.find_all('span', {'class': 'job-search-card__location'})]
job_links = [i["href"].strip('
 ') for i in items.find_all('a', {'class': 'base-card__full-link'})]

a = pd.DataFrame({'Job Titles': job_titles})
b = pd.DataFrame({'Job Companies': job_companies})
c = pd.DataFrame({'Job Locations': job_locations})

value_counts1 = a['Job Titles'].value_counts()
value_counts2 = b['Job Companies'].value_counts()
value_counts3 = c['Job Locations'].value_counts()

l1 = [f"{key} - {value_counts1[key]}" for key in value_counts1.keys()]
l2 = [f"{key} - {value_counts2[key]}" for key in value_counts2.keys()]
l3 = [f"{key} - {value_counts3[key]}" for key in value_counts3.keys()]

data = l1, l2, l3
df = pd.DataFrame(
    data, index=['Job Titles', 'Job Companies', 'Job Locations'])

df = df.T

print(df)
print("--- %s seconds ---" % (time.time() - start_time))

I would like to pull data for more than the first 25 items, is there an efficient way of being able to do this?

nabroleonx · Accepted Answer

Get the container that holds the desired data by inspecting and you can scrape from the infinite scroll page with Selenium web driver using window.scrollTo()

check this for more >

crawl site that has infinite scrolling using python

or this web-scraping-infinite-scrolling-with-selenium

BlackMath · Answer

The best way is to create a function to scroll down:

# Scroll function
# This function takes two arguments. The driver that is being used and a timeout.
# The driver is used to scroll and the timeout is used to wait for the page to load.

def scroll(driver, timeout):
    scroll_pause_time = timeout

    # Get scroll height
    last_height = driver.execute_script("return document.body.scrollHeight")

    while True:
        # Scroll down to bottom
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait to load page
        time.sleep(scroll_pause_time)

        # Calculate new scroll height and compare with last scroll height
        new_height = driver.execute_script("return document.body.scrollHeight")
        if new_height == last_height:
            # If heights are the same it will exit the function
            break
        last_height = new_height

Then you can use the scroll function to scroll desidered page:

import time
import pandas as pd
from seleniumwire import webdriver  


# Create a new instance of the Firefox driver
driver = webdriver.Firefox()

# move to some url
driver.get('your_url')


# use "scroll" function to scroll the page every 5 seconds
scroll(driver, 5)

How do I scrape a website with an infinite scroller?

Tags:

python

python-3.x

pandas

web-scraping

intermarketics

2 Answers

nabroleonx

BlackMath

Recent Activity

Donate For Us

How do I scrape a website with an infinite scroller?

Tags:

python

python-3.x

pandas

web-scraping

intermarketics

2 Answers

nabroleonx

BlackMath

Related questions

Recent Activity

Donate For Us