Extracting css links using Beautiful Soup

Question

I'm new to Beautiful Soup and I'd like to extract the CSS and JS links of a website using it. So far, I've succeeded but with a small flaw.

from bs4 import BeautifulSoup
import urllib.request

url="http://www.something.com"
page = urllib.request.urlopen(url)

soup = BeautifulSoup(page.read())
for link in soup.find_all('link'):      #Lists out css links
    print(link.get('href'))

On using the snippet above, I'm able to get all the links to the css files. However, I also get other links like favicon. I'm kind of new to BeautifulSoup and I'd like to know if there's any way I can filter it out to only the stylesheets.

Also, for extracting the JS, if I run a simple find_all on the 'script' tag, I get the JS links as well as any JS that's written directly within script tags, in a very untidy manner. If I run a similar loop as my CSS one,

for link in soup.find_all('script'):        #Lists out all JS links
    print(link.get('src'))

I get the links without the direct JS written in the file within script tags. I'm pretty sure there's a better way to extract it, just that I'm a little confused. Have had a look at the href extraction link here, didn't help me too much.

I'm trying to make the code generic for all or most websites that I try it with so while this has worked for sites that I've used so far, some sites would use 'link' for things other than just the css links. So if you have a more generic logic or method I could use to retrieve css links / JSS links and code of a website, I'd greatly appreciate it!

Thanks!

301_Moved_Permanently · Accepted Answer

You can pass extra parameters to find_all to further filter your query.

Try:

soup.find_all('link', rel="stylesheet")
soup.find_all('script', src=re.compile(".*"))

Extracting css links using Beautiful Soup

Tags:

python

css

beautifulsoup

Izy-

1 Answers

301_Moved_Permanently

Recent Activity

Donate For Us

Extracting css links using Beautiful Soup

Tags:

python

css

beautifulsoup

Izy-

1 Answers

301_Moved_Permanently

Related questions

Recent Activity

Donate For Us