The code below is what I have so far, but it only pulls data for the first 25 items, which are the first 25 items on the page before scrolling down for more:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
start_time = time.time()
s = requests.Session()
#Get URL and extract content
response = s.get('https://www.linkedin.com/jobs/search?keywords=It%20Business%20Analyst&location=Boston%2C%20Massachusetts%2C%20United%20States&geoId=102380872&trk=public_jobs_jobs-search-bar_search-submit&position=1&pageNum=0')
soup = BeautifulSoup(response.text, 'html.parser')
items = soup.find('ul', {'class': 'jobs-search__results-list'})
job_titles = [i.text.strip('\n ') for i in items.find_all('h3', {'class': 'base-search-card__title'})]
job_companies = [i.text.strip('\n ') for i in items.find_all('h4', {'class': 'base-search-card__subtitle'})]
job_locations = [i.text.strip('\n ') for i in items.find_all('span', {'class': 'job-search-card__location'})]
job_links = [i["href"].strip('\n ') for i in items.find_all('a', {'class': 'base-card__full-link'})]
a = pd.DataFrame({'Job Titles': job_titles})
b = pd.DataFrame({'Job Companies': job_companies})
c = pd.DataFrame({'Job Locations': job_locations})
value_counts1 = a['Job Titles'].value_counts()
value_counts2 = b['Job Companies'].value_counts()
value_counts3 = c['Job Locations'].value_counts()
l1 = [f"{key} - {value_counts1[key]}" for key in value_counts1.keys()]
l2 = [f"{key} - {value_counts2[key]}" for key in value_counts2.keys()]
l3 = [f"{key} - {value_counts3[key]}" for key in value_counts3.keys()]
data = l1, l2, l3
df = pd.DataFrame(
data, index=['Job Titles', 'Job Companies', 'Job Locations'])
df = df.T
print(df)
print("--- %s seconds ---" % (time.time() - start_time))
I would like to pull data for more than the first 25 items, is there an efficient way of being able to do this?
Get the container that holds the desired data by inspecting and you can scrape from the infinite scroll page with Selenium web driver using window.scrollTo()
check this for more >
crawl site that has infinite scrolling using python
or this web-scraping-infinite-scrolling-with-selenium
The best way is to create a function to scroll down:
# Scroll function
# This function takes two arguments. The driver that is being used and a timeout.
# The driver is used to scroll and the timeout is used to wait for the page to load.
def scroll(driver, timeout):
scroll_pause_time = timeout
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
# Scroll down to bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait to load page
time.sleep(scroll_pause_time)
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
# If heights are the same it will exit the function
break
last_height = new_height
Then you can use the scroll function to scroll desidered page:
import time
import pandas as pd
from seleniumwire import webdriver
# Create a new instance of the Firefox driver
driver = webdriver.Firefox()
# move to some url
driver.get('your_url')
# use "scroll" function to scroll the page every 5 seconds
scroll(driver, 5)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With