Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract href from a tag using ruby regex?

I have this link which i declare like this:

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"

The question is how could I use regex to extract only the href value?

Thanks!

like image 578
Ryzal Yusoff Avatar asked Dec 04 '25 17:12

Ryzal Yusoff


2 Answers

If you want to parse HTML, you can use the Nokogiri gem instead of using regular expressions. It's much easier.

Example:

require "nokogiri"

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"

link_data = Nokogiri::HTML(link)

href_value = link_data.at_css("a")[:href]

puts href_value # => https://www.congress.gov/bill/93rd-congress/house-bill/11461

You should be able to use a regular expression like this:

href\s*=\s*"([^"]*)"

See this Rubular example of that expression.

The capture group will give you the URL, e.g.:

link = "<a href=\"https://www.congress.gov/bill/93rd-congress/house-bill/11461\">H.R.11461</a>"
match = /href\s*=\s*"([^"]*)"/.match(link)
if match
  url = match[1]
end

Explanation of the expression:

  • href matches the href attribute
  • \s* matches 0 or more whitespace characters (this is optional -- you only need it if the HTML might not be in canonical form).
  • = matches the equal sign
  • \s* again allows for optional whitespace
  • " matches the opening quote of the href URL
  • ( begins a capture group for extraction of whatever is matched within
  • [^"]* matches 0 or more non-quote characters. Since quotes inside HTML attributes must be escaped this will match all characters up to the end of the URL.
  • ) ends the capture group
  • " matches the closing quote of the href attribute's value
like image 30
neuronaut Avatar answered Dec 06 '25 09:12

neuronaut



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!