Get text result from tag within a
tag

Question

I am trying to make a web scraper, which takes data such as: title, image src, description, and location. All of the above work except the location, which is located within a tag.

This link shows my code that i am using: https://pastebin.com/BFZyyhxB

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://www.manchestereveningnews.co.uk/news/greater-manchester-news').read()
soup = bs.BeautifulSoup(sauce, 'lxml')

title = soup.title
image = soup.image
strong = soup.strong
description = soup.description
location = soup.location


title = soup.find('h1', class_='publication-font', )
image = soup.find('img')
strong = soup.find('strong')
location = soup.find('a', 'href', 'em') #This is either done incorrectly or needs more added
description = soup.find('div', class_='description')

print(title.text)
print(image)
print(strong.text)
print(description.string)
print(location)

This shows the HTML structure that I am trying to scrape. Including the em tag: 'https://pastebin.com/zHy7H220'

<div class="teaser"><figure data-mod="image" data-init="true"><div class="spacer" style="padding-top:66.50%;"></div>


<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">
<img srcset="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s180/Mike-Grimshaw.jpg 180w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s390/Mike-Grimshaw.jpg 390w, http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s458/Mike-Grimshaw.jpg 458w" src="http://i1.manchestereveningnews.co.uk/incoming/article13366643.ece/ALTERNATES/s615/Mike-Grimshaw.jpg">
</a>
</figure>
<div class="inner">
<em><a href="http://www.manchestereveningnews.co.uk/all-about/sale">Sale</a></em> <------------------ text within the <em> tag is what i am trying to get.
<strong>
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">Mum who witnessed fiancé Michael Grimshaw being fatally stabbed 'cannot face returning home'</a></strong><div class="description">
<a href="http://www.manchestereveningnews.co.uk/news/greater-manchester-news/mum-who-witnessed-fianc-michael-13374115">A fundraising campaign has been set up to help Mr Grimshaw's family in the wake of his tragic death</a>
</div>
</div>
</div>

as you can see it returns nothing, which means that my code is incorrect. However i cannot find how to fix this issue, with countless tries of looking for tutorials.

Any help would be much appreciated.

cs95 · Accepted Answer

Okay, so the <em> tag encapsulates the anchor tag. If you want the href link inside that anchor, I believe you will need:

location = soup.find('em').find('a')['href']

If it's the text you want, that's done with

location = soup.find('em').find('a').string # or .text

soup.find requires a single tag, along with an optional dict argument specifying any css selectors. The syntax you have used is incorrect.

Get text result from <em> tag within a <div> tag

Tags:

python

html

beautifulsoup

Amir Shaw

1 Answers

cs95

Recent Activity

Donate For Us