I'm new to Nokogiri, and Ruby in general.
I want to get the text of all the nodes in the document, starting from and inclusive of the first paragraph node.
I tried the following with XPath but I'm getting nowhere:
puts page.search("//p[0]/text()[next-sibling::node()]")
This doesn't work. What do I have to change?
You have to find the <p/> node and return all text() nodes, both inside and following. Depending what XPath capabilities Nokogiri has, use one of these queries:
//p[1]/(descendant::text() | following::text())
If it doesn't work, use this instead, which needs to find the first paragraph twice and can be a little bit, but probably unnoticeably, slower:
(//p[1]/descendant::text() | //p[1]/following::text())
A probably unsupported XPath 2.0 alternative would be:
//text()[//p[1] << .]
which means "all text nodes preceded by the first <p/> node in document".
This works with Nokogiri (which stands on top of libxml2 and supports XPath 1.0 expressions):
//p[1]//text() | //p[1]/following::text()
Proof:
require 'nokogiri'
html = '<body><h1>A</h1><p>B <b>C</b></p><p>D <b>E</b></p></body>'
doc = Nokogiri.HTML(html)
p doc.xpath('//p[1]//text() | //p[1]/following::text()').map(&:text)
#=> ["B ", "C", "D ", "E"]
Note that just selecting the text nodes themselves returns a NodeSet of Nokogiri::XML::Text objects, and so if you want only the text contents of them you must map them via the .text (or .content) methods.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With