Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R regular expression: isolate parenthesized suffix

Tags:

regex

r

I'm using regular expressions in R. I am trying to pick out parenthesized content that is at the end of some strings in a character vector. I'm able to find parenthesized content when it exists, but I'm failing to excluded non-parenthesized content in inputs that don't have parens.

Example:

> x <- c("DECIMAL", "DECIMAL(14,5)", "RAND(1)")
> gsub("(.*?)(\\(.*\\))", "\\2", x)
[1] "DECIMAL" "(14,5)"  "(1)"

The last 2 elements in output are correct, the first one is not. I want

c("", "(14,5)", "(1)")

The input can have anything, literally any word or number characters, before the parenthesized content.

like image 423
pauljohn32 Avatar asked Mar 16 '26 23:03

pauljohn32


1 Answers

You can use

sub("^.*?(\\(.*\\))?$", "\\1", x, perl=TRUE)

See the regex demo. Details:

  • ^ - start of string
  • .*? - any zero or more chars other than line break chars (since it is a PCRE regex, see perl=TRUE) as few as possible
  • (\\(.*\\))? - an optional Group 1: a (, then any zero or more chars other than line break chars, as many as possible, and then a )
  • $ - end of string.

See the R demo:

x <- c("DECIMAL", "DECIMAL(14,5)", "RAND(1)")
sub("^.*?(\\(.*\\))?$", "\\1", x, perl=TRUE)
## => [1] ""       "(14,5)" "(1)" 

NOTE: perl=TRUE is very important in this case because the two parts in the regex have quantifiers of different greediness.

like image 110
Wiktor Stribiżew Avatar answered Mar 18 '26 12:03

Wiktor Stribiżew