Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ignore part of a string when splitting using regular expression in R

Tags:

regex

r

I'm trying to split a string in R (using strsplit) at some specific points (dash, -) however not if the dash are within a string in brackets ([).

Example:

xx <- c("Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
xx
  [1] "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
  [2] "Total Internet-Time Spent Online-Past 7 Days" 

should give me something like:

list(c("Radio Stations","Listened to Past Week","Toronto [FM-CFXJ-93.5 (93.5 The Move)]"), c("Total Internet","Time Spent Online","Past 7 Days"))
  [[1]]
  [1] "Radio Stations"                         "Listened to Past Week"                 
  [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"

  [[2]]
  [1] "Total Internet"    "Time Spent Online" "Past 7 Days"  

Is there a way with regular expression to do this? The position and the number of dashs change within each elements of the vector, and there is not always brackets. However, when there are brackets, they are always at the end.

I've tried different things, but none are working:

## Trying to match "-" before "[" in Perl
strsplit(xx, split = "-(?=\\[)", perl=T)
# does nothing

## trying to first extract what follow "[" then splitting what is preceding that
temp <- strsplit(xx, "[", fixed = T)
temp <- lapply(temp, function(yy) substr(head(yy, -1),"-"))
# doesn't work as there are some elements with no brackets...

Any help would be appreciated.

like image 947
Bastien Avatar asked Nov 29 '25 20:11

Bastien


2 Answers

Based on: Regex for matching a character, but not when it's enclosed in square bracket

You can use:

strsplit(xx, "-(?![^\\[]*\\])", perl = TRUE)
[[1]]
[1] "Radio Stations"                         "Listened to Past Week"                 
[3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"

[[2]]
[1] "Total Internet"    "Time Spent Online" "Past 7 Days" 
like image 143
talat Avatar answered Dec 01 '25 11:12

talat


To match a - that is not inside [ and ] you must match a part of the string that is enclosed with [ and ] and omit it, and match - in all other contexts. In abc-def], the - is not in between [ and ] and acc. to the specs should not be split against.

It is done with this regex:

\[[^][]*](*SKIP)(*FAIL)|-

Here,

  • \[ - matches a [
  • [^][]* - zero or more chars other than [ and ] (if you use [^]] it will match any char but ])
  • ] - a literal ]
  • (*SKIP)(*FAIL)- PCRE verbs that omit the match and make the engine go on looking for the match after the end of the omitted one
  • | - or
  • - - a hyphen in other contexts.

Or, to match [...[...] like substrings (demo):

\[[^]]*](*SKIP)(*FAIL)|-

Or, to account for nested square brackets (demo):

(\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-

Here, (\[(?:[^][]++|(?1))*]) matches and captures [, then 1+ chars other than [ and ] (with [^][]++) or (|) (?1) recurses the whole capturing group 1 pattern (the whole part between (...)).

See the R demo:

xx <- c("abc-def]", "Radio Stations-Listened to Past Week-Toronto [FM-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
pattern <- "\\[[^][]*](*SKIP)(*FAIL)|-"
strsplit(xx, pattern, perl=TRUE)
# [[1]]
# [1] "abc"  "def]"
# [[2]]
# [1] "Radio Stations"                        
# [2] "Listened to Past Week"                 
# [3] "Toronto [FM-CFXJ-93.5 (93.5 The Move)]"
# [[3]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"      

pattern_recursive <- "(\\[(?:[^][]++|(?1))*])(*SKIP)(*FAIL)|-"
xx2 <- c("Radio Stations-Listened to Past Week-Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]","Total Internet-Time Spent Online-Past 7 Days")
strsplit(xx2, pattern_recursive, perl=TRUE)
# [[1]]
# [1] "Radio Stations"                            
# [2] "Listened to Past Week"                     
# [3] "Toronto [[F[M]]-CFXJ-93.5 (93.5 The Move)]"

# [[2]]
# [1] "Total Internet"    "Time Spent Online" "Past 7 Days"   
like image 22
Wiktor Stribiżew Avatar answered Dec 01 '25 11:12

Wiktor Stribiżew