Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split a string and keep just one specific part

In my data.table, I use tstrsplit to split the ValueId column, with the keep= parameter. But in this case, I do not know the value to put in the keep, and I would like to use the value from the Level column.

All my attempts are failures. Is it possible ? Maybe not in data.table ?

Here is a reprex :

library(data.table)

foo <- data.table(Level = c(2,2,3,4,3),
                  ValueId = c("11983:1055521", "11983:1055521-5168:290668-198:100798", "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620",
                             "11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771", " 11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353"))

foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = 4)]

foo[, newvar := tstrsplit(ValueId, "-", fixed = TRUE, keep = Level)]

Thanks !!

like image 888
Discus23 Avatar asked Nov 24 '25 07:11

Discus23


2 Answers

You can use mapply with [ to extract the substring retuned by strsplit with the position given in foo$Level.

mapply(`[`, strsplit(foo$ValueId, "-", fixed = TRUE), foo$Level)
#[1] NA            "5168:290668" "198:100798"  "92:91604"    "198:100794" 
like image 158
GKi Avatar answered Nov 26 '25 23:11

GKi


There are a couple problems. One of them is in the tstrsplit function itself which is defined as:

function (x, ..., fill = NA, type.convert = FALSE, keep, names = FALSE) 
{
  if (!isTRUEorFALSE(names) && !is.character(names)) 
    stop("'names' must be TRUE/FALSE or a character vector.")
  ans = transpose(strsplit(as.character(x), ...), fill = fill, 
                  ignore.empty = FALSE)
  if (!missing(keep)) {
    keep = suppressWarnings(as.integer(keep))
    chk = min(keep) >= min(1L, length(ans)) & max(keep) <= 
      length(ans)
    if (!isTRUE(chk)) 
      stop("'keep' should contain integer values between ", 
           min(1L, length(ans)), " and ", length(ans), 
           ".")
    ans = ans[keep]
  }
  if (type.convert) 
    ans = lapply(ans, type.convert, as.is = TRUE)
  if (isFALSE(names)) 
    return(ans)
  else if (isTRUE(names)) 
    names = paste0("V", seq_along(ans))
  if (length(names) != length(ans)) {
    str = if (missing(keep)) 
      "ans"
    else "keep"
    stop("length(names) (= ", length(names), ") is not equal to length(", 
         str, ") (= ", length(ans), ").")
  }
  setattr(ans, "names", names)
  ans
}
<bytecode: 0x0000019bffd6da98>
  <environment: namespace:data.table>

The important thing to note is that if block that checks that your keep is the appropriate size for the return. In your example you have the first row that returns NA. The reason this works in your hard coded example is that strsplit is vectorized so the NA row is run at the same time as the rows that work so this if block doesn't get triggered. You can try this out by changing that 4 to 40 and you'll get this message Error in tstrsplit(ValueId, "-", fixed = TRUE, keep = 40) : 'keep' should contain integer values between 1 and 9. because in that case nothing worked.

So what you need to do is redefine the tstrsplit function so that it'll return NA instead of stopping

tstrsplitNA<-function (x, ..., fill = NA, type.convert = FALSE, keep) 
{
  ans = transpose(strsplit(as.character(x), ...), fill = fill, 
                  ignore.empty = FALSE)
  if (!missing(keep)) {
    keep = suppressWarnings(as.integer(keep))
    chk = min(keep) >= min(1L, length(ans)) & max(keep) <= 
      length(ans)
    if (!isTRUE(chk)) 
      ans<-NA_character_
    ans = ans[keep]
  }
  if (type.convert) 
    ans = lapply(ans, type.convert, as.is = TRUE)
    return(ans)
  ans
}

That isn't enough though because strsplit is vectorized so doing foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level)] isn't just running that function row by row but rather feeding the entirety of your ValueId column to strsplit and then transposing it which returns gibberish relative to what you want.

You can tell data.table to do the operation row by row simply by using the by argument with Level and ValueId

foo[, newvar := tstrsplitNA(ValueId, split="-", fixed = TRUE, keep = Level), by=c('Level','ValueId')]

foo
  Level                                                                                               ValueId      newvar
1:     2                                                                                         11983:1055521          NA
2:     2                                                                  11983:1055521-5168:290668-198:100798 5168:290668
3:     3 11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771-5162:290728-5166:290620  198:100798
4:     4                         11983:1055521-5168:290668-198:100798-92:91604-139:94569-135:94719-5161:290771    92:91604
5:     3                                     11983:1055521-5168:290676-198:100794-92:91781-139:95090-135:95353  198:100794
like image 20
Dean MacGregor Avatar answered Nov 27 '25 00:11

Dean MacGregor



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!