What XPath can I use to get all text nodes after and including the first paragraph node?

Question

I'm new to Nokogiri, and Ruby in general.

I want to get the text of all the nodes in the document, starting from and inclusive of the first paragraph node.

I tried the following with XPath but I'm getting nowhere:

 puts page.search("//p[0]/text()[next-sibling::node()]")

This doesn't work. What do I have to change?

Jens Erat · Accepted Answer

You have to find the <p/> node and return all text() nodes, both inside and following. Depending what XPath capabilities Nokogiri has, use one of these queries:

//p[1]/(descendant::text() | following::text())

If it doesn't work, use this instead, which needs to find the first paragraph twice and can be a little bit, but probably unnoticeably, slower:

(//p[1]/descendant::text() | //p[1]/following::text())

A probably unsupported XPath 2.0 alternative would be:

//text()[//p[1] << .]

which means "all text nodes preceded by the first <p/> node in document".

Phrogz · Answer

This works with Nokogiri (which stands on top of libxml2 and supports XPath 1.0 expressions):

//p[1]//text() | //p[1]/following::text()

Proof:

require 'nokogiri'

html = '<body><h1>A</h1><p>B <b>C</b></p><p>D <b>E</b></p></body>'
doc = Nokogiri.HTML(html)

p doc.xpath('//p[1]//text() | //p[1]/following::text()').map(&:text)
#=> ["B ", "C", "D ", "E"]

Note that just selecting the text nodes themselves returns a NodeSet of Nokogiri::XML::Text objects, and so if you want only the text contents of them you must map them via the .text (or .content) methods.

What XPath can I use to get all text nodes after and including the first paragraph node?

Tags:

ruby

xpath

nokogiri

user1895623

2 Answers

Jens Erat

Phrogz

Recent Activity

Donate For Us

What XPath can I use to get all text nodes after and including the first paragraph node?

Tags:

ruby

xpath

nokogiri

user1895623

2 Answers

Jens Erat

Phrogz

Related questions

Recent Activity

Donate For Us