Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I tell the line number for a node using the Nokogiri reader interface?

Tags:

xml

ruby

nokogiri

I'm trying to write a Nokogiri script that will grep XML for text nodes containing ASCII double-quotes («"»). Since I want a grep-like output I need the line number, and the contents of each line. However, I am unable to see how to tell the line number where the element starts at. Here is my code:

require 'rubygems'
require 'nokogiri'

ARGV.each do |filename|
    xml_stream = File.open(filename)
    reader = Nokogiri::XML::Reader(xml_stream)
    titles = []
    text = ''
    grab_text = false
    reader.each do |elem|
        if elem.node_type == Nokogiri::XML::Node::TEXT_NODE
            data = elem.value
            lines = data.split(/\n/, -1);

            lines.each_with_index do |line, idx|
                if (line =~ /"/) then
                    STDOUT.printf "%s:%d:%s\n", filename, elem.line()+idx, line
                end
            end
        end
    end
end

elem.line() does not work.

like image 772
Shlomi Fish Avatar asked Sep 05 '25 17:09

Shlomi Fish


1 Answers

XML and parsers don't really have a concept of line numbers. You're talking about the physical layout of the file.

You can play a game with the parser using accessors looking for text nodes containing linefeeds and/or carriage returns but that can be thrown off because XML allows nested nodes.

require 'nokogiri'

xml =<<EOT_XML
<atag>
  <btag>
    <ctag 
      id="another_node">
      other text
    </ctag>
  </btag>
  <btag>
    <ctag id="another_node2">yet
                             another
                             text</ctag>
    </btag>
  <btag>
    <ctag id="this_node">this text</ctag>
  </btag>
</atag>
EOT_XML

doc = Nokogiri::XML(xml)

# find a particular node via CSS accessor
doc.at('ctag#this_node').text # => "this text"

# count how many "lines" there are in the document
doc.search('*/text()').select{ |t| t.text[/[\r\n]/] }.size # => 12

# walk the nodes looking for a particular string, counting lines as you go
content_at = []
doc.search('*/text()').each do |n|
  content_at << [n.line, n.text] if (n.text['this text'])
end
content_at # => [[14, "this text"]]

This works because of the parser's ability to figure out what is a text node and cleanly return it, without relying on regex or text matches.


EDIT: I went through some old code, snooped around in Nokogiri's docs some, and came up with the above edited changes. It's working correctly, including working with some pathological cases. Nokogiri FTW!

like image 71
the Tin Man Avatar answered Sep 07 '25 19:09

the Tin Man