Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Separating A String Into Characters

Tags:

string

regex

r

I have some ordered test results encoded in a character string. The string can be of arbitrary length. Each digit in the string represents a test result. In the following, for example, there are four test results represented:

2069

I want to tidy these up in R by splitting the string into individual observations. No problem with strsplit or string::str_split, which returns four values that will become my observations.

strsplit("2069" %>% as.character(), split = "") %>% unlist()
[1] "2" "0" "6" "9"

Now, however, I have realized that some results are values greater than 9. These two-digit values have been encoded with parentheses to make clear they are not individual results.

For example, in the following case I still have four values, but some have been enclosed in parentheses to group the values larger than 9.

2(10)1(12)

I'm struggling with a way to break these up so that I get

[1] "2" "10" "1" "12"

Appreciate any guidance. Thanks.

like image 350
rdelrossi Avatar asked Feb 01 '26 08:02

rdelrossi


1 Answers

Updated - pattern match based on the OP's new pattern showed in the comments. Here, we use str_extract to extract one or more digits that follow an open parentheses (regex lookaround ) or (|) any character that is not a parentheses ([^()])

library(stringr)
str_extract_all(str1, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2"  "10" "1"  "12"

[[2]]
[1] "2" "0" "6" "9"

[[3]]
[1] "2"  "15"

[[4]]
[1] "2" "1" "3" "1"

-testing on the OP's extra pattern

str_extract_all(str2, "(?<=[(])\\d+|[^()]")
[[1]]
[1] "2"  "10" "1"  "12"

[[2]]
[1] "2" "0" "6" "9"

[[3]]
[1] "2"  "15"

[[4]]
[1] "2" "1" "3" "1"

[[5]]
[1] "10" "0"  "2"  "0"  "1" 

-Earlier solutions (Based on the assumption that all the numbers that are greater than 9 will be wrapped inside the parentheses)

We may split on the parentheses in base R

unlist(strsplit(str1[1], "\\(|\\)"))
[1] "2"  "10" "1"  "12"

Assuming if there are both cases, then an option is to get the index of those elements have the parentheses and do this separately

i1 <- grepl("\\(|\\)", str1)
lst1 <- vector('list', length(str1))
lst1[i1] <- strsplit(str1[i1], "\\(|\\)")
lst1[!i1] <- strsplit(str1[!i1], "")
unlist(lst1)
[1] "2"  "10" "1"  "12" "2"  "0"  "6"  "9"  "2"  "15" "2"  "1"  "3"  "1" 

or another option is ifelse with grepl to create a single delimiter and then use strsplit

lst1 <- strsplit(trimws(ifelse(grepl("\\(|\\)", str1), 
    gsub("\\(|\\)", ",", str1), gsub("(?<=.)(?=.)", "\\1,\\2", 
       str1, perl = TRUE)), whitespace = ","), ",")
lst1
[[1]]
[1] "2"  "10" "1"  "12"

[[2]]
[1] "2" "0" "6" "9"

[[3]]
[1] "2"  "15"

[[4]]
[1] "2" "1" "3" "1"

data

str1 <- c("2(10)1(12)", "2069", "2(15)", "2131")
str2 <- c(str1, "(10)0201")
like image 143
akrun Avatar answered Feb 02 '26 23:02

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!