I'm sure there's a simple solution to this problem, but I'm having trouble figuring it out. I have a data frame in the following format:
Number Category Type   Count
1      X        A      10
2      X        B      14
3      Y        B      3
4      Z        A      14
"Type" is a factor with two levels, {A,B}, and each level gets at least one "Category" entry, (for simplicity, they are denoted XYZ here, but in my actual dataset there are too many to list). I would like the number of rows each Type has to match by Category:
Number Category Type   Count
1      X        A      10
2      X        B      14
3      Y        A      <NA>
4      Y        B      3
5      Z        A      14
6      Z        B      <NA>
For instance, if Type A is listed in four rows of Category A, but Type B has no Category A listings, then four new rows of Category A, Type B should be created (with Count=NA). Similarly, if Type A gets four rows of Category A and Type B has two, then two new rows should be created.
I was able to find numerous answers on how to do this for missing dates in time series data using seq(), expand.grid(), and merge(), but I can't quite see how to do it in this case. I hope this is clear... Grateful for any help!
dat <- read.table(header = TRUE, text =
                 "Number Category Type   Count
                  1      X        A      10
                  2      X        B      14
                  3      Y        B      3
                  4      Z        A      14")
Use expand.grid to make a master list and then merge:
merge(dat, expand.grid(lapply(dat[c("Type","Category")], levels)), all.y=TRUE)
#  Category Type Number Count
#1        X    A      1    10
#2        X    B      2    14
#3        Y    A     NA    NA
#4        Y    B      3     3
#5        Z    A      4    14
#6        Z    B     NA    NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With