I need to create a program in Python that parses all the URLs from a .html file and prints out all the tags and links like so:
meta: https://someurl.com
a: https://someurl.com
link: css/bootstrap.min.css
script: https://somescript.js
Currently, what I have is
from html.parser import HTMLParser
import re
class HeadParser(HTMLParser):
def handle_starttag(self, tag, attrs):
#use re.findall to get all the links
links = re.findall("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\), ]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", website)
for url in links:
print("{0}: {1}".format(tag, url))
website = open("./head.html").read()
HeadParser().feed(website)
and it returns to me
head: https://scooptacular.net
head: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
head: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
head: https://fonts.googleapis.com/css?family=Montserrat:400,700
head: https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
head: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
head: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>
head: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
head: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js
meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
meta: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
meta: https://fonts.googleapis.com/css?family=Montserrat:400,700
meta: https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
meta: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
meta: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>
meta: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
meta: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js
meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70
As you can see, it returns me a link for every tag, even duplicates, and does not return any local file links. What is wrong with my code?
EDIT:
the html i'm using is:
<head>
<meta property="og:url" content="https://scooptacular.net" />
<meta property="og:image" content="https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg" />
<meta property="og:image" content="https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg" />
<link href="css/bootstrap.min.css" rel="stylesheet">
<link href="css/agency.css" rel="stylesheet">
<link href="font-awesome-4.1.0/css/font-awesome.min.css" rel="stylesheet" type="text/css">
<link href="https://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css">
<link href='https://fonts.googleapis.com/css?family=Kaushan+Script' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic' rel='stylesheet' type='text/css'>
<link href='https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700' rel='stylesheet' type='text/css'>
<link href="css/bootstrap-formhelpers.min.css" rel="stylesheet" media="screen">
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
</head>
The primary issue is that handle_starttag is called for every tag, and with each call you're searching the entire page for matches to your regex, not just the tag you're on (the second argument you're passing to re.findall is website).
I don't see why you need to use regular expressions at all here. Why not just rely on whether or not the tag has an href, src or content attribute:
from html.parser import HTMLParser
class HeadParser(HTMLParser):
def handle_starttag(self, tag, attrs):
for attr in attrs:
if attr[0] in ['href', 'src', 'content']:
print('{0}: {1}'.format(tag, attr[1]))
website = open("./head.html").read()
HeadParser().feed(website)
Output:
meta: https://scooptacular.net
meta: https://scooptacular.net/img/uploaded/379d05029c0d84618c70ac037a25fd88.jpg
meta: https://scooptacular.net/img/uploaded/4baaa58a1a37fd3da3e4e78caf366b7f.jpg
link: css/bootstrap.min.css
link: css/agency.css
link: font-awesome-4.1.0/css/font-awesome.min.css
link: https://fonts.googleapis.com/css?family=Montserrat:400,700
link: https://fonts.googleapis.com/css?family=Kaushan+Script
link: https://fonts.googleapis.com/css?family=Droid+Serif:400,700,400italic,700italic
link: https://fonts.googleapis.com/css?family=Roboto+Slab:400,100,300,700
link: css/bootstrap-formhelpers.min.css
script: https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js
script: https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With