Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extract text inside anchor tag using xpath

I am trying to ascertain how many pages are there for any search result on a site so that i can scrape data for all the pages using lxml and xpath.

There is a pagination tab with the following structure: Page: 1 2 3 ... 7 next

the html content for the same being something like

<ul class="ulclass">
 <li></li>
 <li>
      <span> You are on the first page</span>
      "1"
 </li>
 <li>
      <a href="link to second page">
        <span></span>
      "2"
      </a>
 </li>
  <li>
 </li>
      ...
  <li>
      <a href="link to last page">
        <span></span>
      "7"
      </a>
 </li>

My approach is to extract the page numbers 1,2,3,7 so that i can repeat the web scraping 7 times for every page 'cause otherwise it just scrapes the first result of the page. I have written the following xpath, but it doesnot return correct page numbers.

xpath('//ul[@class="ulclass"]/li/a/text())

like image 277
separ1 Avatar asked Dec 08 '25 11:12

separ1


1 Answers

If I expand your example to form this,

<ul class="ulclass">
<li><span>You are on the first page</span>"1"</li>
<li><a href="link to second page"><span></span>"2"</a></li>
<li><a href="link to third page"><span></span>"3"</a></li>
<li><a href="link to fourth page"><span></span>"4"</a></li>
<li><a href="link to fifth page"><span></span>"5"</a></li>
<li><a href="link to sixth page"><span></span>"6"</a></li>
<li><a href="link to last page"><span></span>"7"</a></li>
</ul>

then using scrapy in Python I can get this:

>>> from scrapy.selector import Selector
>>> selector = Selector(text=open('temp.htm').read())
>>> selector.xpath('..//ul[@class="ulclass"]/li/a/text()').extract()
['"2"', '"3"', '"4"', '"5"', '"6"', '"7"']
like image 182
Bill Bell Avatar answered Dec 11 '25 23:12

Bill Bell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!