Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get text result from <em> tag within a <div> tag

I am trying to make a web scraper, which takes data such as: title, image src, description, and location. All of the above work except the location, which is located within a tag.

This link shows my code that i am using: https://pastebin.com/BFZyyhxB

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

title = soup.title
image = soup.image
strong = soup.strong
description = soup.description
location = soup.location


title = soup.find('h1', class_='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('a', 'href', 'em') #This is either done incorrectly or needs more added
description = soup.find('div', class_='description')

print(title.text)
print(image)
print(strong.text)
print(description.string)
print(location)

This shows the HTML structure that I am trying to scrape. Including the em tag: 'https://pastebin.com/zHy7H220'

<div class="teaser"><figure data-mod="image" data-init="true"><div class="spacer" style="padding-top:66.50%;"></div>


<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">
<img srcset="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s180/Mike-Grimshaw.jpg 180w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s390/Mike-Grimshaw.jpg 390w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s458/Mike-Grimshaw.jpg 458w" src="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s615/Mike-Grimshaw.jpg">
</a>
</figure>
<div class="inner">
<em><a href="http://www.manchestereveningnews.co.uk/all-about/sale">Sale</a></em> <------------------ text within the <em> tag is what i am trying to get.
<strong>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">Mum who witnessed fiancé Michael Grimshaw being fatally stabbed 'cannot face returning home'</a></strong><div class="description">
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">A fundraising campaign has been set up to help Mr Grimshaw's family in the wake of his tragic death</a>
</div>
</div>
</div>

as you can see it returns nothing, which means that my code is incorrect. However i cannot find how to fix this issue, with countless tries of looking for tutorials.

Any help would be much appreciated.

like image 373
Amir Shaw Avatar asked Dec 09 '25 13:12

Amir Shaw


1 Answers

Okay, so the <em> tag encapsulates the anchor tag. If you want the href link inside that anchor, I believe you will need:

location = soup.find('em').find('a')['href']

If it's the text you want, that's done with

location = soup.find('em').find('a').string # or .text

soup.find requires a single tag, along with an optional dict argument specifying any css selectors. The syntax you have used is incorrect.

like image 95
cs95 Avatar answered Dec 11 '25 01:12

cs95



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!