Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: extracting data from an html tag that uses an "id" Selector instead of a "class"

I am new to web scraping and Scrapy. I hope you can help me.

I am trying to extract data from a web page where it uses tag. Usually, if the span tag is using a class, for example:

<span class="class_A>Hello, World!</span>

I would use the following code to retrieve the text.

request.css('span.class_A::text').extract()

However, when an html is now using an "id" instead of a "class", for example,

<span id="id_A>Hello, Universe!</span>

the code below does not work anymore.

request.css('span.id_A::text').extract()

Please help! What's the correct way of extracting data using an "id".

Thank you for your help!

like image 848
RF_956 Avatar asked Jul 25 '17 20:07

RF_956


People also ask

How do you use the selector in Scrapy?

Description. When you are scraping the web pages, you need to extract a certain part of the HTML source by using the mechanism called selectors, achieved by using either XPath or CSS expressions. Selectors are built upon the lxml library, which processes the XML and HTML in Python language.

What is the Scrapy method that you can call to retrieve the contents of the selected node in XPath?

Scrapy comes with its own mechanism for extracting data. They're called selectors because they “select” certain parts of the HTML document specified either by XPath or CSS expressions.

How do I get text from XPath in Scrapy?

When you are using text nodes in a XPath string function, then use . (dot) instead of using .//text(), because this produces the collection of text elements called as node-set.


1 Answers

This is one way.

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('[id="id_A"]::text').extract()
['Hello, Earth']

Alternatively,

>>> HTML = '''
... <span id="id_A">Hello, Earth</span>
... <span id="id_B">Hello, Universe</span>
... '''
>>> from scrapy.selector import Selector
>>> selector = Selector(text=HTML)
>>> selector.css('span#id_A::text').extract()
['Hello, Earth']

Scrapy uses cssselect which follows W3 Selectors Level 3

like image 114
Bill Bell Avatar answered Jan 04 '23 01:01

Bill Bell