Is it possible to use ItemLoaders for parsing HTML nodes?

Question

Normally, the item loader extracts data automatically before passing the values to the input processor:

Data from xpath1 is extracted, and passed through the input processor of the name field. (Scrapy docs)

Is it possible to change this behaviour for certain elements of an item loader, so I can pass in a more complicated structure (in my opinion the selector)?

I have a HTML document like this:

<a class="foo" href="http://example.com">example 1</a>
<a class="foo" href="http://example.org">example 2</a>

And now I'd like to fetch these link elements in a spider

loader.add_css('links', '.foo')

and parse them in the item loader to get a list of values (after the output processor) like this:

[("http://example.com", "example 1"), ("http://example.org", "example 2")]

However, as item loaders do convert the input automatically to unicode, this does not seem so easy.

alecxe · Accepted Answer

You can use .add_value() and "manually" construct a list of texts and hrefs:

links = [(item.css('::text').extract()[0], 
          item.css('::attr(href)').extract()[0])
         for item in response.css('.foo')]
loader.add_value('links', links)

Is it possible to use ItemLoaders for parsing HTML nodes?

Tags:

python

scrapy

aufziehvogel

1 Answers

alecxe

Recent Activity

Donate For Us

Is it possible to use ItemLoaders for parsing HTML nodes?

Tags:

python

scrapy

aufziehvogel

1 Answers

alecxe

Related questions

Recent Activity

Donate For Us