I am new to web scraping and Scrapy. I hope you can help me.
I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example:
<span class="class_A>Hello, World!</span>
I would use the following code to retrieve the text.
request.css('span.class_A::text').extract()
However, when an html is now using an "id" instead of a "class", for example,
<span id="id_A>Hello, Universe!</span>
the code below does not work anymore.
request.css('span.id_A::text').extract()
Please help! What's the correct way of extracting data using an "id".
Thank you for your help!
Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.
Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.
When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.
This is one way.
>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('[id="id_A"]::text').extract()
['Hello, Earth']
Alternatively,
>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('span#id_A::text').extract()
['Hello, Earth']
Scrapy uses cssselect which follows W3 Selectors Level 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With