I'm new to Beautiful Soup and I'd like to extract the CSS and JS links of a website using it. So far, I've succeeded but with a small flaw.
from bs4 import BeautifulSoup
import urllib.request
url="http://www.something.com"
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page.read())
for link in soup.find_all('link'): #Lists out css links
print(link.get('href'))
On using the snippet above, I'm able to get all the links to the css files. However, I also get other links like favicon. I'm kind of new to BeautifulSoup and I'd like to know if there's any way I can filter it out to only the stylesheets.
Also, for extracting the JS, if I run a simple find_all on the 'script' tag, I get the JS links as well as any JS that's written directly within script tags, in a very untidy manner. If I run a similar loop as my CSS one,
for link in soup.find_all('script'): #Lists out all JS links
print(link.get('src'))
I get the links without the direct JS written in the file within script tags. I'm pretty sure there's a better way to extract it, just that I'm a little confused. Have had a look at the href extraction link here, didn't help me too much.
I'm trying to make the code generic for all or most websites that I try it with so while this has worked for sites that I've used so far, some sites would use 'link' for things other than just the css links. So if you have a more generic logic or method I could use to retrieve css links / JSS links and code of a website, I'd greatly appreciate it!
Thanks!
You can pass extra parameters to find_all to further filter your query.
Try:
soup.find_all('link', rel="stylesheet")
soup.find_all('script', src=re.compile(".*"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With