I am trying to write a python script that download an image from a webpage.on the webpage (I am using NASA's picture of the day page), a new picture is posted everyday, with different file names.
so my solutions was to parse the html using HTMLParser, looking for "jpg", and write the path and file name of the image to an attribute (named as "output", see code below) of the HTML parser object.
I am new to python and OOP (this is my first real python script ever), so I am not sure if this is how it is generally done. any advice and pointer is welcome.
here is my code:
# Grab image url
response = urllib2.urlopen('http://apod.nasa.gov/apod/astropix.html')
html = response.read()
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
# Only parse the 'anchor' tag.
if tag == "a":
# Check the list of defined attributes.
for name, value in attrs:
# If href is defined, print it.
if name == "href":
if value[len(value)-3:len(value)]=="jpg":
#print value
self.output=value #return the path+file name of the image
parser = MyHTMLParser()
parser.feed(html)
imgurl='http://apod.nasa.gov/apod/'+parser.output
To check whether a string ends with "jpg" you could use .endswith() instead of len() and slicing:
if name == "href" and value.endswith("jpg"):
self.output = value
If the search inside web page is more complex, you could use lxml.html or BeautifulSoup instead of HTMLParser e.g.:
from lxml import html
# download & parse web page
doc = html.parse('http://apod.nasa.gov/apod/astropix.html').getroot()
# find <a href that ends with ".jpg" and
# that has <img child that has src attribute that also ends with ".jpg"
for elem, attribute, link, _ in doc.iterlinks():
if (attribute == 'href' and elem.tag == 'a' and link.endswith('.jpg') and
len(elem) > 0 and elem[0].tag == 'img' and
elem[0].get('src', '').endswith('.jpg')):
print(link)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With