Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex to capture words before and after a target in ruby

Tags:

regex

ruby

Assuming we have a text:

In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many factors, including the programming language, machine architecture, multi-threading, and amount of available memory.

What I am trying to do is find 2 words before and after a specific word (target). So for example if target is word start it should match 'at' 'the' (left) and 'of' 'the' (right). I am using the following method in ruby but it returns no matches. Any tips about what to fix in my regex? I have also tried "#{target}" instead of Regex.escape.

    def checkWords(target, text, numLeft = 2, numRight = 2)

        regex = ""
        regex += " (\\S+) " * numLeft
        regex += Regexp.escape(target)
        regex += " (\\S+)" * numRight

        pattern = Regexp.new(regex, Regexp::IGNORECASE)
        matches = pattern.match(text)

        return true if matches
    end

Edit:

Regex printed:

(\S+)  (\S+) "£52" (\S+) (\S+)

Edit based on Wiktor Stribiżew:

def checkWords(target, text, numLeft = 2, numRight = 2)

pattern = Regexp.new(/#{"(\\S+) "*numLeft}#{Regexp.escape(target)}#{" (\\S+)"*numRight}/i)
matches = pattern.match(text)

end
like image 903
Vas Avatar asked Jan 20 '26 10:01

Vas


2 Answers

▶ input[/(\S+\s+){,2}start(\s+\S+){,2}/i]
#⇒ "at the start of the"

more generic:

▶ target = 'start'
▶ input[/(\S+\s+){,2}#{Regexp.escape target}(\s+\S+){,2}/i]
#⇒ "at the start of the"

To handle a punctuation after the target:

▶ target = 'start'
▶ input[/(\S+\s+){,2}#{Regexp.escape target}\p{P}?(\s+\S+){,2}/i]
#⇒ "at the start of the"

Your function might look like:

def checkWords(target, text, numLeft = 2, numRight = 2)
  text =~ /(\S+\s+){,#{numLeft}}#{Regexp.escape target}\p{P}?(\s+\S+){,#{numRight}}/i
end
like image 121
Aleksei Matiushkin Avatar answered Jan 22 '26 06:01

Aleksei Matiushkin


In the case you're looking at, I think you might be better served by splitting the text on non-word characters and then searching through the splits for your target word. Once you've found it, it's very easy to take the appropriate slices of the array of words in order to get the results you want.

For example:

def check_words(target, text, num_left = 2, num_right = 2)
  # Split the text using the regex /\W+/ (matches non-word characters)
  words = text.split /\W+/
  # Iterate over the words in the array
  # Enumerable#each_with_index includes the index, so retrieving the surrounding
  # words is a snap
  words.each_with_index do |word, index|
    if word == target
      # Make a hash with two Symbol keys and small
      # arrays containing the desired words
      return {
        before: words.slice(index - num_left, num_left),
        after: words.slice(index, num_right)
      }
    end
  end
end

This can then be called like so:

check_words('start', text)

And it returns a Hash containing the num_left words before and the num_right words after the keyword:

{:before=>["at", "the"], :after=>["start", "of"]}

The {before: ...} syntax is Ruby 2 for {:before => ...}; either syntax will work fine.

Also, you may be interested in the Ruby documentation for Regexp, if you've not seen it already.

like image 37
andyg0808 Avatar answered Jan 22 '26 05:01

andyg0808