Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split variable from comma into an ordered dataframe

Tags:

split

dataframe

r

I have a dataframe like this, where the values are separated by comma.

# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B

I would like to have the next data frame

# Event1  Event2  Event3  Event4  Event5
# A       B       C       NA      NA
# NA      NA      C       NA      NA
# A       B       NA      NA      NA
# A       B       NA      D       E
# A       B       NA      NA      E

I have tried with cSplit but I don't have the desired df. Is possible?

NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.

like image 728
Kevin Santos Avatar asked Jun 25 '26 10:06

Kevin Santos


1 Answers

1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.

Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.

s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))

data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))

giving:

  Event.1 Event.2 Event.3 Event.4 Event.5
1       A       B       C    <NA>    <NA>
2    <NA>    <NA>       C       D    <NA>
3       A       B    <NA>    <NA>    <NA>
4       A       B    <NA>       D       E
5       A       B    <NA>    <NA>       E

2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.

s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)

mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)

giving:

  Event.1 Event.2 Event.3 Event.4 Event.5
1       A       B       C    <NA>    <NA>
2    <NA>    <NA>       C       D    <NA>
3       A       B    <NA>    <NA>    <NA>
4       A       B    <NA>       D       E
5       A       B    <NA>    <NA>       E

This could also be represented as an R 4.2+ pipeline:

DF |>
  with(setNames(Events, seq_along(Events))) |>
  strsplit(",") |>
  stack() |>
  with(tapply(values, data.frame(ind, values), c)) |>
  `colnames<-`(NULL) |>
  data.frame(Event = _)

Note

The input in reproducible form:

Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)
like image 132
G. Grothendieck Avatar answered Jun 27 '26 00:06

G. Grothendieck



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!