I have the following data.table as an example
df = data.table(id = c(1, 2, 3), val=c("['hello', 'world']", "['hi']", "['so', 'there']"))
I want to split the list like object into separate rows with the id repeated. So the expected data.table I want is the following
df2 = data.table(id = c(1, 1, 2, 3, 3), val=c("hello", "world", "hi", "so", "there"))
I tried the following
df[, c("test") := tstrsplit(val, ",", fixed=TRUE)]
However, I got the following error
Error in
[.data.table(df, ,:=(c("test"), tstrsplit(val, ",", fixed = TRUE))) : Supplied 2 items to be assigned to 3 items of column 'test'. If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
Could someone point out what I'm doing wrong here? Thanks in advance.
From the structure of the data you have, it seems you have a python dataset. You could use reticulate for this:
library(reticulate)
ast <- import('ast')
df_python <- r_to_py(df)
df_python$assign(val = df_python$val$transform(ast$literal_eval))$explode('val')
id val
0 1.0 hello
0 1.0 world
1 2.0 hi
2 3.0 so
2 3.0 there
Directly you could do:
df[, .(val = tstrsplit(gsub('[^a-z,]', '',val), ',')), by = 'id']
id val
1: 1 hello
2: 1 world
3: 2 hi
4: 3 so
5: 3 there
Here's one way,
df[, .(val = tstrsplit(gsub("[][']", "", val), ",", fixed=TRUE)), by = id]
# id val
# <num> <list>
# 1: 1 hello
# 2: 1 world
# 3: 2 hi
# 4: 3 so
# 5: 3 there
It removes all square-brackets and single-quotes, then concatenates all val strings into a single string (,-collapsed), then tstrsplits them as originally intended. The by=id ensures that we don't inadvertently combine different vals, and that id is preserved in the output.
If you wanted to see the grouping, combining, then splitting in steps, then one could do
df1[, .(val = paste(gsub("[][']", "", val), collapse = ",")), by = id]
# id val
# <num> <char>
# 1: 1 hello, world
# 2: 2 hi
# 3: 3 so, there
df1[, .(val = paste(gsub("[][']", "", val), collapse = ",")), by = id
][, .(val = tstrsplit(val, ",", fixed = TRUE)), by = id]
# id val
# <num> <list>
# 1: 1 hello
# 2: 1 world
# 3: 2 hi
# 4: 3 so
# 5: 3 there
Note that the error's recommendation to rep(id,...) is fine, except you'll need to do a little more work to know how many times to repeat each id; using it as a grouping variable relieves this need, at a small expense of execution time (since it does the tstrsplit once for each group instead of all together).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With