find_next not capturing all
instances

Question

I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:

from bs4 import BeautifulSoup as bsoup
import requests as reqs

home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"

page_to_parse = home_test

page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')

find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
    add_stats = stat.find_next('div').get_text()
    print(add_stats)

If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.

I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.

Results are always:

Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80

Where it should take on the following instead:

Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80

Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23

Guy · Accepted Answer

parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements

find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
    print(stat.get_text())

If you have multiple tables use two loops

find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
    all_stats = stats.find_all('div', recursive=False)
    for stat in all_stats:
        print(stat.get_text())

PRMoureu · Answer

find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.

You can change the way to select the div blocks with :

find_stats = parse_page.select('div#team_stats_extra > div')

print(len(find_stats))  # >>> returns 2

for stat in find_stats:
    add_stats = stat.get_text()
    print(add_stats)

To explain the selector select('div#team_stats_extra > div'), it is the same as :

find the div block with the id team_stats_extra
and select all direct children that are div

find_next not capturing all <div> instances

Tags:

python

html

beautifulsoup

web-scraping

thamy

2 Answers

Guy

PRMoureu

Recent Activity

Donate For Us