Let's say I've got an ill formed html page:
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
On BeautifulSoup, we were able to get the <th> and then call findNext("td"). Nokogiri has the next_element call, but that might not return what I want (in this case, it would return the tr element).
Is there a way to filter the next_element call of Nokogiri? e.g. next_element("td")?
EDIT
For clarification, I'll be looking at many sites, most of them ill formed in different ways.
For instance, the next site might be:
<table>
<th class="what_I_need">Super sweet text<th>
<tr>
<td>
I also need this
</td>
<td>
and this (all td's in this and subsequent tr's)
</td>
</tr>
<tr>
...all td's here too
</tr>
<tr>
...all td's here too
</tr>
</table>
I can't assume any structure other than there will be trs below the item that has the class what_I_need
First, note that your closing th tag is malformed: <th>. It should be </th>. Fixing that helps.
One way to do it is to use XPath to navigate to it once you've found the th node:
require 'nokogiri'
html = '
<table>
<thead>
<th class="what_I_need">Super sweet text<th>
</thead>
<tr>
<td>
I also need this
</td>
<tr>
</table>
'
doc = Nokogiri::HTML(html)
th = doc.at('th.what_I_need')
th.text # => "Super sweet text"
td = th.at('../../tr/td')
td.text # => "\n I also need this\n "
This is taking advantage of Nokogiri's ability to use either CSS accessors or XPath, and to do it pretty transparently.
Once you have the <th> node, you could also navigate using some of Node's methods:
th.parent.next_element.at('td').text # => "\n I also need this\n "
One more way to go about it, is to start at the top of the table and look down:
table = doc.at('table')
th = table.at('th')
th.text # => "Super sweet text"
td = table.at('td')
td.text # => "\n I also need this\n "
If you need to access all <td> tags within a table you can iterate over them easily:
table.search('td').each do |td|
# do something with the td...
puts td.text
end
If you want the contents of all <td> by their containing <tr> iterate over the rows then the cells:
table.search('tr').each do |tr|
cells = tr.search('td').map(&:text)
# do something with all the cells
end
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With