Why doesn't scrapy xpath function support the 'matches()' syntax?

Question

I'm running scrapy 0.20.2.

$ scrapy shell "http://newyork.craigslist.org/ata/"

I would like to make the list of all links to advertisements pages set apart the index.html

$ sel.xpath('//a[contains(@href,html)]')
... 
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atq/4243973984.html">Wicke'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html" class'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/mnh/atd/4257230057.html">Recla'>,
<Selector xpath='//a[contains(@href,"html")]' data=u'<a href="/ata/index100.html" class="butt'>]

I would like to use the XPath matches() function to match links the form of the regex [0-9]+.html.

$ sel.xpath('//a[matches(@href,"[0-9]+.html")]')
...
ValueError: Invalid XPath: //a[matches(@href,"[0-9]+.html")]

What's wrong?

Ian Roberts · Accepted Answer

matches is an XPath 2.0 function, and scrapy only supports XPath 1.0 (which does not have any regular expression support built in). You'll have to extract all the links using a scrapy selector and then do the regex filtering at the Python level rather than within the XPath.

Jens Erat · Answer

For this special use case, there is an XPath 1.0-workaround using translate(...):

//a[
  translate(substring-before(@href, '.html'), '0123456789', '') = ''
  and @href != '.html'
  and substring-after(@href, '.html') = '']

The translate(...) call removes all digits from the name part before the .html extension. The second line check makes sure .html is excluded (nothing before the dot), the last makes sure .html actually is the file extension.

Why doesn't scrapy xpath function support the 'matches()' syntax?

Tags:

python

regex

web-scraping

xpath

scrapy

Michel Hua

2 Answers

Ian Roberts

Jens Erat

Recent Activity

Donate For Us

Why doesn't scrapy xpath function support the 'matches()' syntax?

Tags:

python

regex

web-scraping

xpath

scrapy

Michel Hua

2 Answers

Ian Roberts

Jens Erat

Related questions

Recent Activity

Donate For Us