Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrape URLs using BeautifulSoup in Python 3

I tried this code but the list with the URLs stays empty. No error massage, nothing.

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, features="xml")
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^https://www.metacritic.com/movie/")}):
    links.append(link.get('href'))

print(links)

I want to scrape all the URLs that start with "https://www.metacritic.com/movie/" that are found in the given URL "https://www.metacritic.com/browse/movies/genre/date?page=0".

What am I doing wrong?

like image 807
TAN-C-F-OK Avatar asked Jan 21 '26 10:01

TAN-C-F-OK


2 Answers

First you should use the standard library "html.parser" instead of "xml" for parsing the page content. It deals better with broken html (see Beautiful Soup findAll doesn't find them all)

Then take a look at the source code of the page you are parsing. The elements you want to find look like this: <a href="/movie/woman-at-war">

So change your code like this:

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re

req = Request('https://www.metacritic.com/browse/movies/genre/date?page=0', headers={'User-Agent': 'Mozilla/5.0'})
html_page = urlopen(req).read()

soup = BeautifulSoup(html_page, 'html.parser')
links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/")}):
    links.append(link.get('href'))

print(links)
like image 138
leiropi Avatar answered Jan 23 '26 00:01

leiropi


Your code is sound.

The list stays empty because there aren't any URLs on that page matching that pattern. Try re.compile("^/movie/") instead.

like image 41
jorijnsmit Avatar answered Jan 22 '26 23:01

jorijnsmit



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!