Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it possible to use ItemLoaders for parsing HTML nodes?

Tags:

python

scrapy

Normally, the item loader extracts data automatically before passing the values to the input processor:

  1. Data from xpath1 is extracted, and passed through the input processor of the name field. (Scrapy docs)

Is it possible to change this behaviour for certain elements of an item loader, so I can pass in a more complicated structure (in my opinion the selector)?

I have a HTML document like this:

<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>

And now I'd like to fetch these link elements in a spider

loader.add_css('links', '.foo')

and parse them in the item loader to get a list of values (after the output processor) like this:

[("http://example.com", "example 1"), ("http://example.org", "example 2")]

However, as item loaders do convert the input automatically to unicode, this does not seem so easy.

like image 991
aufziehvogel Avatar asked Nov 24 '25 20:11

aufziehvogel


1 Answers

You can use .add_value() and "manually" construct a list of texts and hrefs:

links = [(item.css('::text').extract()[0], 
          item.css('::attr(href)').extract()[0])
         for item in response.css('.foo')]
loader.add_value('links', links)
like image 73
alecxe Avatar answered Nov 26 '25 10:11

alecxe



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!