I'm sure there's a simple solution to this, but I can't figure it out!! Suppose I have a dataframe that has the following information:
aaa<-c("A,B","B,C","B,D,E")
vvv<-c("101","101,102","102,103,104")
data_h<-data.frame(aaa,vvv)
data_h
    aaa         vvv
1   A,B         101
2   B,C     101,102
3 B,D,E 102,103,104
Desired output is a frequency map of individual hits, for subsequent analysis in a heat map. So:
  101   102   103   104
A  1     0     0     0
B  2     2     1     1
C  1     1     0     0
D  0     1     1     1
E  0     1     1     1
How do I make this transformation? I've seen many similar examples, but none where the contents of the data-frame need to be parsed.
The goal is to ultimately use heatmap or something similar on the output table to visualize the correlation between "aaa" and "vvv".
Here is a base R solution in 4 lines of code.  First we define a function, spl which splits the components of a comma separated string producing a vector of all the fields.   eg takes two string arguments and applies spl to each of them and then creates a grid of the result of the splitting.  Finally we apply eg to each row of data_h, rbind the results together and tabulate them with xtabs:
spl <- function(x) strsplit(as.character(x), ",")[[1]]
eg <- function(aaa, vvv) expand.grid(aaa = spl(aaa), vvv = spl(vvv))
dd <- do.call("rbind", Map(eg, data_h$aaa, data_h$vvv))
xtabs(data = dd)
The result is:
   vvv
aaa 101 102 103 104
  A   1   0   0   0
  B   2   2   1   1
  C   1   1   0   0
  D   0   1   1   1
  E   0   1   1   1
dcast Alternately replace the last line of code above (the one with the xtabs) with:
library(reshape2)
dcast(dd, aaa ~ vvv, fun = length, value.var = "vvv")
in which case the result is:
  aaa 101 102 103 104
1   A   1   0   0   0
2   B   2   2   1   1
3   C   1   1   0   0
4   D   0   1   1   1
5   E   0   1   1   1
tapply. Another alternative would be tapply (however, it will fill in empty cells with NA rather than 0):
tapply(1:nrow(dd), dd, length)
ADDED Alternatives. Some improvements.
The shape of the data.frame suggests using splitstackshape package. But I don't know very well this package so I just use it to reshape the data, and then compute frequencies by hand using table:
library(splitstackshape)
data_h_split <- concat.split.multiple(data_h,1:2)
# aaa_1 aaa_2 aaa_3 vvv_1 vvv_2 vvv_3
# 1     A     B  <NA>   101    NA    NA
# 2     B     C  <NA>   101   102    NA
# 3     B     D     E   102   103   104
Once you have the data in this format (no comma , regular columns), it is easy to compute frequencies using table( you can use tapply,reshape):
table(cbind.data.frame(ff= unlist(data_h_split[1:3]),
                       xx= unlist(data_h_split[4:6])))
   xx
ff  101 102 103 104
  A   1   0   0   0
  B   1   1   0   0
  C   0   1   0   0
  D   0   0   1   0
      0   0   0   0
  E   0   0   0   1
Here's a multi-step approach to get the result using "splitstackshape" to work for this.
library(splitstackshape)
## Split the "vvv" column first, and reshape at the same time
x <- concat.split.multiple(data_h, split.cols="vvv", ",", "long")
## Add an ID column
x$id <- 1:nrow(x)
## Split the "aaa" column next, again reshaping as we do so
x <- concat.split.multiple(x[complete.cases(x), ], split.cols="aaa", ",", "long")
## Use `table` with `droplevels`
with(droplevels(x), table(aaa, vvv))
#    vvv
# aaa 101 102 103 104
#   A   1   0   0   0
#   B   2   2   1   1
#   C   1   1   0   0
#   D   0   1   1   1
#   E   0   1   1   1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With