Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping multiple paginated links with BeautifulSoup and Requests

Python Beginner here. I'm trying to scrape all products from one category on dabs.com. I've managed to scrape all products on a given page, but I'm having trouble iterating over all the paginated links.

Right now, I've tried to isolate all the pagination buttons with the span class='page-list" but even that isn't working. Ideally, I would like to make the crawler keep clicking next until it has scraped all products on all pages. How can I do this?

Really appreciate any input

from bs4 import BeautifulSoup

import requests

base_url = "http://www.dabs.com"
page_array = []

def get_pages():
    html = requests.get(base_url)
    soup = BeautifulSoup(html.content, "html.parser")

    page_list = soup.findAll('span', class="page-list")
    pages = page_list[0].findAll('a')

    for page in pages:
        page_array.append(page.get('href'))

def scrape_page(page):
    html = requests.get(base_url)
    soup = BeautifulSoup(html.content, "html.parser")
    Product_table = soup.findAll("table")
    Products = Product_table[0].findAll("tr")

    if len(soup.findAll('tr')) > 0:
        Products = Products[1:]

    for row in Products:
        cells = row.find_all('td')
        data = {
            'description' : cells[0].get_text(),
            'price' : cells[1].get_text()
        }
        print data

get_pages()
[scrape_page(base_url + page) for page in page_array]
like image 665
user3093445 Avatar asked Jan 18 '26 05:01

user3093445


1 Answers

Their next page button has a title of "Next" you could do something like:

import requests
from bs4 import BeautifulSoup as bs

url = 'www.dabs.com/category/computing/11001/'
base_url = 'http://www.dabs.com'

r = requests.get(url)

soup = bs(r.text)
elm = soup.find('a', {'title': 'Next'})

next_page_link = base_url + elm['href']

Hope that helps.

like image 123
Daniel Timberlake Avatar answered Jan 20 '26 17:01

Daniel Timberlake