I am trying to get what's inside of the title tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.
This is what I am doing:
require "open-uri"
require "uri"
def browse startpage, depth, block
if depth > 0
begin
open(startpage){ |f|
block.call startpage, f
}
rescue
return
end
end
end
browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
puts "Header information:"
puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
puts "Base URI: #{web.base_uri}"
puts "Content Type: #{web.content_type}"
puts "Charset: #{web.charset}"
puts "-----------------------------"
})
The title output is just [], why?
open returns a File object or passes it to the block (actually a Tempfile but that doesn't matter). Calling to_s just returns a string containing the object's class and its id:
open('https://www.ruby-lang.org/es/') do |f|
f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"
Scanning that string for a title is obviously useless:
"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)
Instead, you have to read the file's content:
open('https://www.ruby-lang.org/es/') do |f|
f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"
You can now scan the content for a <title> tag:
open('https://www.ruby-lang.org/es/') do |f|
str = f.read
str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]
or, using Nokogiri: (because You can't parse [X]HTML with regex)
open('https://www.ruby-lang.org/es/') do |f|
doc = Nokogiri::HTML(f)
doc.at_css('title').text
end
#=> "Lenguaje de Programación Ruby"
If you must insist on using open-uri, this one liner than get you the page title:
2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de Programación Ruby
=> nil
If you want to use something more complicated than this, please use nokogiri or mechanize. Thanks
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With