Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table explode column using tstrsplit

I have the following data.table as an example

df = data.table(id = c(1, 2, 3), val=c("['hello', 'world']", "['hi']", "['so', 'there']"))

I want to split the list like object into separate rows with the id repeated. So the expected data.table I want is the following

df2 = data.table(id = c(1, 1, 2, 3, 3), val=c("hello", "world", "hi", "so", "there"))

I tried the following

df[, c("test") := tstrsplit(val, ",", fixed=TRUE)]

However, I got the following error

Error in [.data.table(df, , :=(c("test"), tstrsplit(val, ",", fixed = TRUE))) : Supplied 2 items to be assigned to 3 items of column 'test'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.

Could someone point out what I'm doing wrong here? Thanks in advance.

like image 381
broccoli Avatar asked Apr 25 '26 03:04

broccoli


2 Answers

From the structure of the data you have, it seems you have a python dataset. You could use reticulate for this:

library(reticulate)
ast <- import('ast')

df_python <- r_to_py(df)
df_python$assign(val = df_python$val$transform(ast$literal_eval))$explode('val')

    id    val
0  1.0  hello
0  1.0  world
1  2.0     hi
2  3.0     so
2  3.0  there

Directly you could do:

df[, .(val = tstrsplit(gsub('[^a-z,]', '',val), ',')), by = 'id']
   id   val
1:  1 hello
2:  1 world
3:  2    hi
4:  3    so
5:  3 there
like image 86
KU99 Avatar answered Apr 26 '26 18:04

KU99


Here's one way,

df[, .(val = tstrsplit(gsub("[][']", "", val), ",", fixed=TRUE)), by = id]
#       id    val
#    <num> <list>
# 1:     1  hello
# 2:     1  world
# 3:     2     hi
# 4:     3     so
# 5:     3  there

It removes all square-brackets and single-quotes, then concatenates all val strings into a single string (,-collapsed), then tstrsplits them as originally intended. The by=id ensures that we don't inadvertently combine different vals, and that id is preserved in the output.

If you wanted to see the grouping, combining, then splitting in steps, then one could do

df1[, .(val = paste(gsub("[][']", "", val), collapse = ",")), by = id]
#       id          val
#    <num>       <char>
# 1:     1 hello, world
# 2:     2           hi
# 3:     3    so, there
df1[, .(val = paste(gsub("[][']", "", val), collapse = ",")), by = id
  ][, .(val = tstrsplit(val, ",", fixed = TRUE)), by = id]
#       id    val
#    <num> <list>
# 1:     1  hello
# 2:     1  world
# 3:     2     hi
# 4:     3     so
# 5:     3  there

Note that the error's recommendation to rep(id,...) is fine, except you'll need to do a little more work to know how many times to repeat each id; using it as a grouping variable relieves this need, at a small expense of execution time (since it does the tstrsplit once for each group instead of all together).

like image 35
r2evans Avatar answered Apr 26 '26 18:04

r2evans