I am having an issue where not all instances are captured within a relatively simply beautifulsoup scrape. What I am running is the below:
from bs4 import BeautifulSoup as bsoup
import requests as reqs
home_test = "https://fbref.com/en/matches/033092ef/Northampton-Town-Lincoln-City-August-4-2018-League-Two"
away_test = "https://fbref.com/en/matches/ea736ad1/Carlisle-United-Northampton-Town-August-11-2018-League-Two"
page_to_parse = home_test
page = reqs.get(page_to_parse)
status_code = page.status_code
status_code = str(status_code)
parse_page = bsoup(page.content, 'html.parser')
find_stats = parse_page.find_all('div',id="team_stats_extra")
print(find_stats)
for stat in find_stats:
add_stats = stat.find_next('div').get_text()
print(add_stats)
If you have a look at the first print, the scrape captures the part of the website that I'm after, however if you inspect the second print, half of the instances in the earlier one aren't actually being taken on at all. I do not have any limits on this, so in theory it should take in all the right ones.
I've already testes quite a few different variants of find_next, find, or find_all, but the second loop find never takes all of them.
Results are always:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Where it should take on the following instead:
Northampton Lincoln City
12Fouls13
6Corners1
7Crosses2
89Touches80
Northampton Lincoln City
2Offsides2
9Goal Kicks15
32Throw Ins24
18Long Balls23
parse_page.find_all returns a list of one item, the WebElement with id="team_stats_extra". The loop need to be on it's child elements
find_stats = parse_page.find_all('div', id="team_stats_extra")
all_stats = find_stats[0].find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
If you have multiple tables use two loops
find_stats = parse_page.find_all('div', id="team_stats_extra")
for stats in find_stats:
all_stats = stats.find_all('div', recursive=False)
for stat in all_stats:
print(stat.get_text())
find_stats = parse_page.find_all('div',id="team_stats_extra") actually returns only one block, so the next loop performs only one iteration.
You can change the way to select the div blocks with :
find_stats = parse_page.select('div#team_stats_extra > div')
print(len(find_stats)) # >>> returns 2
for stat in find_stats:
add_stats = stat.get_text()
print(add_stats)
To explain the selector select('div#team_stats_extra > div'), it is the same as :
div block with the id team_stats_extradivIf you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With