Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

find_next not capturing all <div> instances

I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:

from bs4 import BeautifulSoup as bsoup
import requests as reqs

home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"

page_to_parse = home_test

page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')

find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
    add_stats = stat.find_next('div').get_text()
    print(add_stats)

If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.

I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.

Results are always:

Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80

Where it should take on the following instead:

Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80

Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23
like image 272
thamy Avatar asked Dec 20 '25 19:12

thamy


2 Answers

parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements

find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
    print(stat.get_text())

If you have multiple tables use two loops

find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
    all_stats = stats.find_all('div', recursive=False)
    for stat in all_stats:
        print(stat.get_text())
like image 174
Guy Avatar answered Dec 23 '25 08:12

Guy


find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.

You can change the way to select the div blocks with :

find_stats = parse_page.select('div#team_stats_extra > div')

print(len(find_stats))  # >>> returns 2

for stat in find_stats:
    add_stats = stat.get_text()
    print(add_stats)

To explain the selector select('div#team_stats_extra > div'), it is the same as :

  • find the div block with the id team_stats_extra
  • and select all direct children that are div
like image 34
PRMoureu Avatar answered Dec 23 '25 09:12

PRMoureu