I am new to web scraping and Scrapy. I hope you can help me. I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example: <pre class="prettyprint"><code>Hello, World! </code></pre> I would use the following code to retrieve the text. <pre class="prettyprint"><code>request.css('span.class_A::text').extract() </code></pre> However, when an html is now using an "id" instead of a "class", for example, <pre class="prettyprint"><code>Hello, Universe! </code></pre> the code below does not work anymore. <pre class="prettyprint"><code>request.css('span.id_A::text').extract() </code></pre> Please help! What's the correct way of extracting data using an "id". Thank you for your help!

This is one way. <pre class="prettyprint"><code>>>> HTML = ''' ... Hello, Earth ... Hello, Universe ... ''' >>> from scrapy.selector import Selector >>> selector = Selector(text=HTML) >>> selector.css('[id="id_A"]::text').extract() ['Hello, Earth'] </code></pre> Alternatively, <pre class="prettyprint"><code>>>> HTML = ''' ... Hello, Earth ... Hello, Universe ... ''' >>> from scrapy.selector import Selector >>> selector = Selector(text=HTML) >>> selector.css('span#id_A::text').extract() ['Hello, Earth'] </code></pre> Scrapy uses cssselect which follows W3 Selectors Level 3

Scrapy: extracting data from an html tag that uses an "id" Selector instead of a "class"

Tags:

web-scraping

scrapy

I am new to web scraping and Scrapy. I hope you can help me.

I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example:

<span class="class_A>Hello, World!</span>

I would use the following code to retrieve the text.

request.css('span.class_A::text').extract()

However, when an html is now using an "id" instead of a "class", for example,

<span id="id_A>Hello, Universe!</span>

the code below does not work anymore.

request.css('span.id_A::text').extract()

Please help! What's the correct way of extracting data using an "id".

Thank you for your help!

848

asked Jul 25 '17 20:07

RF_956

1 Answers

This is one way.

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('[id="id_A"]::text').extract()
['Hello, Earth']

Alternatively,

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('span#id_A::text').extract()
['Hello, Earth']

Scrapy uses cssselect which follows W3 Selectors Level 3

114

answered Jan 04 '23 01:01

Bill Bell

Related questions
                            
                                Beautiful Soup find first <a> whose title attribute equal a certain string
                            
                                Webscraping NSE Option Chain data in Python
                            
                                CsQuery to parse a collection of li items
                            
                                How can Scrapy deal with Javascript
                            
                                Web scraping into R multiple links with similar URL using a for loop or lapply
                            
                                How to scrape multi-level links using puppeteer js?
                            
                                how to build a webscraper in R using readLines and grep?
                            
                                Changing Scrapy/Splash user agent
                            
                                (Python 3) Spider must return Request, BaseItem, dict or None, got 'generator'
                            
                                404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser
                            
                                Downloading Mp3 using Python in Windows mangles the song however in Linux it doesn't
                            
                                Scrape a web page that requires they give you a session cookie first
                            
                                how to scrape all files in a catalog series from the national archives (archives.gov) with R
                            
                                web scraping tutorial using python 3?
                            
                                How to get full Wikipedia revision-history list from some article?
                            
                                PHP Simple HTML DOM Parser returning false on valid url
                            
                                Webscraping Instagram follower count BeautifulSoup
                            
                                What is the best way to scrape this HTML for an android app?
                            
                                Python - UnicodeEncodeError: 'charmap' codec can't encode characters in position 85-89: character maps to <undefined>
                            
                                Scraping Wikipedia tables with Python selectively

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With