Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nokogiri next_element with filter

Let's say I've got an ill formed html page:

<table>
 <thead>
  <th class="what_I_need">Super sweet text<th>
 </thead>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

On BeautifulSoup, we were able to get the <th> and then call findNext("td"). Nokogiri has the next_element call, but that might not return what I want (in this case, it would return the tr element).

Is there a way to filter the next_element call of Nokogiri? e.g. next_element("td")?

EDIT

For clarification, I'll be looking at many sites, most of them ill formed in different ways.

For instance, the next site might be:

<table>
 <th class="what_I_need">Super sweet text<th>
 <tr>
  <td>
    I also need this
  </td>
  <td>
    and this (all td's in this and subsequent tr's)
  </td>
 </tr>
 <tr>
   ...all td's here too
 </tr>
 <tr>
   ...all td's here too
 </tr>
</table>

I can't assume any structure other than there will be trs below the item that has the class what_I_need

like image 223
Tyler DeWitt Avatar asked Dec 05 '25 10:12

Tyler DeWitt


1 Answers

First, note that your closing th tag is malformed: <th>. It should be </th>. Fixing that helps.

One way to do it is to use XPath to navigate to it once you've found the th node:

require 'nokogiri'

html = '
<table>
<thead>
  <th class="what_I_need">Super sweet text<th>
</thead>
<tr>
  <td>
    I also need this
  </td>
<tr>
</table>
'

doc = Nokogiri::HTML(html)

th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n    I also need this\n  "

This is taking advantage of Nokogiri's ability to use either CSS accessors or XPath, and to do it pretty transparently.

Once you have the <th> node, you could also navigate using some of Node's methods:

th.parent.next_element.at('td').text # => "\n    I also need this\n  "

One more way to go about it, is to start at the top of the table and look down:

table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n    I also need this\n  "

If you need to access all <td> tags within a table you can iterate over them easily:

table.search('td').each do |td|
  # do something with the td...
  puts td.text
end

If you want the contents of all <td> by their containing <tr> iterate over the rows then the cells:

table.search('tr').each do |tr|
  cells = tr.search('td').map(&:text)
  # do something with all the cells
end    
like image 64
the Tin Man Avatar answered Dec 08 '25 00:12

the Tin Man



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!