Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I split a string by a character without ignoring trailing split-characters?

Tags:

string

regex

r

I have a string similar to the following

my_string <- "apple,banana,orange,"

And I want to split by , to produce the output:

list(c('apple', 'banana', 'orange', ""))

I thought strsplit would accomplish this but it treats the trailing ',' like it doesn't exist

my_string <- "apple,banana,orange,"

strsplit(my_string, split = ',')
#> [[1]]
#> [1] "apple"  "banana" "orange"

Created on 2023-11-15 by the reprex package (v2.0.1)

What is the simplest approach to achieve the desired output?

Some more test cases with example strings and desired outputs

string1 = "apple,banana,orange,"
output1 = list(c('apple', 'banana', 'orange', ''))

string2 =  "apple,banana,orange,pear"
output2 = list(c('apple', 'banana', 'orange', 'pear'))

string3 =  ",apple,banana,orange"
output3 = list(c('', 'apple', 'banana', 'orange'))

## Examples of non-comma separated strings
# '|' separator
string4 =  "|apple|banana|orange|"
output4 = list(c('', 'apple', 'banana', 'orange', ''))

# 'x' separator
string5 =  "xapplexbananaxorangex"
output5 = list(c('', 'apple', 'banana', 'orange', ''))

EDIT:

Ideally solution should generalize to any splitting character

Would also prefer a base-R solution (although do still link any packages which supply this functionality since their source code might be useful to look through!)

like image 805
Selk Avatar asked Dec 01 '25 20:12

Selk


2 Answers

Why strsplit Doesn't Give Desired Output?

When you type ?strsplit, you will read the following statement

Note that this means that if there is a match at the beginning of a (non-empty) string, the first element of the output is "", but if there is a match at the end of the string, the output is the same as with the match removed.

That is the reason you don't see the trailing "" when you use strsplit.

Below are some demonstrations

> strsplit("apple,banana,orange,", ",")
[[1]]
[1] "apple"  "banana" "orange"


> strsplit(",apple,banana,orange,", ",")
[[1]]
[1] ""       "apple"  "banana" "orange"


> strsplit(",apple,banana,orange", ",")
[[1]]
[1] ""       "apple"  "banana" "orange"


> strsplit("apple,banana,orange", ",")
[[1]]
[1] "apple"  "banana" "orange"

A Base R Workaround

If you want to make a coding practice, one base R option can be defining a custom function (recursion) like below

f <- function(x, sep = ",") {
  pat <- sprintf("^(.*?)%s.*", sep)
  s1 <- sub(pat, "\\1", x)
  s2 <- sub(paste0("^.*?", sep), "", x)
  if (s2 == x) {
    return(x)
  }
  c(s1, Recall(s2, sep))
}

or a variant with substr + regexpr

f <- function(x, sep = ",") {
  idx <- regexpr(sep, x)
  s1 <- substr(x, 1, idx - 1)
  s2 <- substr(x, idx + 1, nchar(x))
  if (s2 == x) {
    return(x)
  }
  c(s1, Recall(s2, sep))
}

such that

> f("apple,banana,orange,")
[1] "apple"  "banana" "orange" ""

> f(",apple,banana,orange,")
[1] ""       "apple"  "banana" "orange" ""      

> f(",apple,banana,orange")
[1] ""       "apple"  "banana" "orange"

> f("apple,banana,orange")
[1] "apple"  "banana" "orange"
like image 152
ThomasIsCoding Avatar answered Dec 04 '25 10:12

ThomasIsCoding


Pasting another separator at the end should allow strsplit to function as intended.
Otherwise, you could fall back to using the scan function, which underpins the read.csv/table functions:

strsplit(paste0(string1, ","), ",")
##[[1]]
##[1] "apple"  "banana" "orange" ""

Generalisably taking into account regex replacement:

L <- list(string1, string2, string3, string4, string5)
mapply(
    function(x,s) strsplit(paste0(x, gsub("\\\\", "", s)), split=s),
    L,
    c(",", ",", ",", "\\|", "x")
)

##[[1]]
##[1] "apple"  "banana" "orange" ""      
##
##[[2]]
##[1] "apple"  "banana" "orange" "pear"  
##
##[[3]]
##[1] ""       "apple"  "banana" "orange"
##
##[[4]]
##[1] ""       "apple"  "banana" "orange" ""      
##
##[[5]]
##[1] ""       "apple"  "banana" "orange" "" 

scan option:

scan(text=string1, sep=",", what="")
##Read 4 items
##[1] "apple"  "banana" "orange" ""

Generalising:

mapply(
    function(x,s) scan(text=x, sep=s, what=""),
    L,
    c(",", ",", ",", "|", "x")
)
like image 36
thelatemail Avatar answered Dec 04 '25 10:12

thelatemail



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!