Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting page title with Ruby

Tags:

ruby

I am trying to get what's inside of the title tag but I can't get to do it. I am following some of the answers around stackoverflow that are supposed to work but for me they don't.

This is what I am doing:

require "open-uri"
require "uri"

def browse startpage, depth, block
    if depth > 0
        begin 
            open(startpage){ |f|
                block.call startpage, f
            }
        rescue
            return
        end
    end
end

browse("https://www.ruby-lang.org/es/", 2, lambda { |page_name, web|
    puts "Header information:"
    puts "Title: #{web.to_s.scan(/<title>(.*?)<\/title>/)}"
    puts "Base URI: #{web.base_uri}"
    puts "Content Type: #{web.content_type}"
    puts "Charset: #{web.charset}"
    puts "-----------------------------"
})

The title output is just [], why?

like image 632
dabadaba Avatar asked Aug 31 '25 10:08

dabadaba


2 Answers

open returns a File object or passes it to the block (actually a Tempfile but that doesn't matter). Calling to_s just returns a string containing the object's class and its id:

open('https://www.ruby-lang.org/es/') do |f|
  f.to_s
end
#=> "#<File:0x007ff8e23bfb68>"

Scanning that string for a title is obviously useless:

"#<File:0x007ff8e23bfb68>".scan(/<title>(.*?)<\/title>/)

Instead, you have to read the file's content:

open('https://www.ruby-lang.org/es/') do |f|
  f.read
end
#=> "<!DOCTYPE html>\n<html>\n...</html>\n"

You can now scan the content for a <title> tag:

open('https://www.ruby-lang.org/es/') do |f|
  str = f.read
  str.scan(/<title>(.*?)<\/title>/)
end
#=> [["Lenguaje de Programaci\xC3\xB3n Ruby"]]

or, using Nokogiri: (because You can't parse [X]HTML with regex)

open('https://www.ruby-lang.org/es/') do |f|
  doc = Nokogiri::HTML(f)
  doc.at_css('title').text
end
#=> "Lenguaje de Programación Ruby"
like image 75
Stefan Avatar answered Sep 03 '25 02:09

Stefan


If you must insist on using open-uri, this one liner than get you the page title:

2.1.4 :008 > puts open('https://www.ruby-lang.org/es/').read.scan(/<title>(.*?)<\/title>/)
Lenguaje de Programación Ruby
 => nil

If you want to use something more complicated than this, please use nokogiri or mechanize. Thanks

like image 31
CuriousMind Avatar answered Sep 03 '25 03:09

CuriousMind