I am trying to make a web scraper, which takes data such as: title, image src, description, and location. All of the above work except the location, which is located within a tag.
This link shows my code that i am using: https://pastebin.com/BFZyyhxB
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
title = soup.title
image = soup.image
strong = soup.strong
description = soup.description
location = soup.location
title = soup.find('h1', class_='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('a', 'href', 'em') #This is either done incorrectly or needs more added
description = soup.find('div', class_='description')
print(title.text)
print(image)
print(strong.text)
print(description.string)
print(location)
This shows the HTML structure that I am trying to scrape. Including the em tag: 'https://pastebin.com/zHy7H220'
<div class="teaser"><figure data-mod="image" data-init="true"><div class="spacer" style="padding-top:66.50%;"></div>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">
<img srcset="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s180/Mike-Grimshaw.jpg 180w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s390/Mike-Grimshaw.jpg 390w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s458/Mike-Grimshaw.jpg 458w" src="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s615/Mike-Grimshaw.jpg">
</a>
</figure>
<div class="inner">
<em><a href="http://www.manchestereveningnews.co.uk/all-about/sale">Sale</a></em> <------------------ text within the <em> tag is what i am trying to get.
<strong>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">Mum who witnessed fiancé Michael Grimshaw being fatally stabbed 'cannot face returning home'</a></strong><div class="description">
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">A fundraising campaign has been set up to help Mr Grimshaw's family in the wake of his tragic death</a>
</div>
</div>
</div>
as you can see it returns nothing, which means that my code is incorrect. However i cannot find how to fix this issue, with countless tries of looking for tutorials.
Any help would be much appreciated.
Okay, so the <em> tag encapsulates the anchor tag. If you want the href link inside that anchor, I believe you will need:
location = soup.find('em').find('a')['href']
If it's the text you want, that's done with
location = soup.find('em').find('a').string # or .text
soup.find requires a single tag, along with an optional dict argument specifying any css selectors. The syntax you have used is incorrect.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With