Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting css links using Beautiful Soup

I'm new to Beautiful Soup and I'd like to extract the CSS and JS links of a website using it. So far, I've succeeded but with a small flaw.

from bs4 import BeautifulSoup
import urllib.request

url="http://www.something.com"
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page.read())
for link in soup.find_all('link'):      #Lists out css links
    print(link.get('href'))

On using the snippet above, I'm able to get all the links to the css files. However, I also get other links like favicon. I'm kind of new to BeautifulSoup and I'd like to know if there's any way I can filter it out to only the stylesheets.

Also, for extracting the JS, if I run a simple find_all on the 'script' tag, I get the JS links as well as any JS that's written directly within script tags, in a very untidy manner. If I run a similar loop as my CSS one,

for link in soup.find_all('script'):        #Lists out all JS links
    print(link.get('src'))

I get the links without the direct JS written in the file within script tags. I'm pretty sure there's a better way to extract it, just that I'm a little confused. Have had a look at the href extraction link here, didn't help me too much.

I'm trying to make the code generic for all or most websites that I try it with so while this has worked for sites that I've used so far, some sites would use 'link' for things other than just the css links. So if you have a more generic logic or method I could use to retrieve css links / JSS links and code of a website, I'd greatly appreciate it!

Thanks!

like image 274
Izy- Avatar asked Oct 30 '25 17:10

Izy-


1 Answers

You can pass extra parameters to find_all to further filter your query.

Try:

soup.find_all('link', rel="stylesheet")
soup.find_all('script', src=re.compile(".*"))
like image 186
301_Moved_Permanently Avatar answered Nov 01 '25 08:11

301_Moved_Permanently