Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scraping HTML table in Python lxml

The question may sound easy, but I am facing difficulty in solving it. I have a table like following:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

My code is following:

 from lxml import etree

 for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):
    for c in elem.xpath("//td"):
        if(c.getchildren()): # for the <span> thing
            text = c.xpath("//span/text()")
        else:
             text = c.text

But I am unable to iterate over the "td" elements. I have been trying this whole day but of no avail!! I want to get 2003. 1.19, and -0.48.

Kindly help!

like image 758
user3001408 Avatar asked Dec 06 '25 17:12

user3001408


1 Answers

It looks like you have HTML, not XML. Therefore, use lxml.html, not lxml.etree to parse the data. If data.html looks like this:

<table><tbody>
<tr>
<td>2003</td>
<td><span class="positive">1.19</span> </td>
<td><span class="negative">-0.48</span> </td>
</tr>

then

import lxml.html as LH
tree = LH.parse('data.html')
print([td.text_content() for td in tree.xpath('//td')])

yields

['2003', '1.19 ', '-0.48 ']

If

for elem in tree.xpath('//*[@id="printcontent"]/div[8]/div/table/tbody/tr'):

is not returning any elems, then you need to show us enough HTML to help us debug why this XPath is not working.

like image 69
unutbu Avatar answered Dec 08 '25 08:12

unutbu



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!