Given a situation such as the following
library(dplyr)
myData <- tbl_df(data.frame( var1 = rnorm(100), 
                             var2 = letters[1:3] %>%
                                    sample(100, replace = TRUE) %>%
                                    factor(), 
                             var3 = LETTERS[1:3] %>%
                                    sample(100, replace = TRUE) %>%
                                    factor(), 
                             var4 = month.abb[1:3] %>%
                                    sample(100, replace = TRUE) %>%
                                    factor()))
I would like to group `myData' to eventually find summary data grouping by all possible combinations of var2, var3, and var4.
I can create a list with all possible combinations of variables as character values with
groupNames <- names(myData)[2:4]
myGroups <- Map(combn, 
              list(groupNames), 
              seq_along(groupNames),
              simplify = FALSE) %>%
              unlist(recursive = FALSE)
My plan was to make separate data sets for each variable combination with a for() loop, something like
### This Does Not Work
for (i in 1:length(myGroups)){
     assign( myGroups[i]%>%
             unlist() %>%
             paste0(collapse = "")%>%
             paste0("Data"), 
               myData %>% 
               group_by_(lapply(myGroups[[i]], as.symbol)) %>%
               summarise( n = length(var1), 
                             avgVar2 = var2 %>%
                                       mean()))
}
Admittedly I am not very good with lists, and looking up this issue was a bit challenging since dpyr updates have altered how grouping works a bit.
If there is a better way to do this than separate data sets I would love to know.
I've gotten a loop similar to above working when I am only grouping by a single variable.
Any and all help is greatly appreciated! Thank you!
By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.
The group_by() method in tidyverse can be used to accomplish this. When working with categorical variables, you may use the group_by() method to divide the data into subgroups based on the variable's distinct categories.
group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".
Group_by() function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by() function alone will not give any output. It should be followed by summarise() function with an appropriate action to perform.
I have created a function based on the answer of @Gregor and the comments that followed:
library(magrittr)
myData <- tbl_df(data.frame( var1 = rnorm(100), 
                         var2 = letters[1:3] %>%
                                sample(100, replace = TRUE) %>%
                                factor(), 
                         var3 = LETTERS[1:3] %>%
                                sample(100, replace = TRUE) %>%
                                factor(), 
                         var4 = month.abb[1:3] %>%
                                sample(100, replace = TRUE) %>%
                                factor()))
combSummarise
combSummarise <- function(data, variables=..., summarise=...){
  # Get all different combinations of selected variables (credit to @Michael)
    myGroups <- lapply(seq_along(variables), function(x) {
    combn(c(variables), x, simplify = FALSE)}) %>%
    unlist(recursive = FALSE)
  # Group by selected variables (credit to @konvas)
    df <- eval(parse(text=paste("lapply(myGroups, function(x){
               dplyr::group_by_(data, .dots=x) %>% 
               dplyr::summarize_( \"", paste(summarise, collapse="\",\""),"\")})"))) %>% 
          do.call(plyr::rbind.fill,.)
    groupNames <- c(myGroups[[length(myGroups)]])
    newNames <- names(df)[!(names(df) %in% groupNames)]
    df <- cbind(df[, groupNames], df[, newNames])
    names(df) <- c(groupNames, newNames)
    df
}
combSummarise
combSummarise (myData, var=c("var2", "var3", "var4"), 
               summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"), 
               summarise=c("length(var1)", "mean(var1)", "max(var1)"))
or
combSummarise (myData, var=c("var2", "var4"), 
           summarise=c("length(var1)"))
etc
Inspired by the answers by Gregor and dimitris_ps, I wrote a dplyr style function that runs summarise for all combinations of group variables.
summarise_combo <- function(data, ...) {
  groupVars <- group_vars(data) %>% map(as.name)
  groupCombos <-  map( 0:length(groupVars), ~combn(groupVars, ., simplify=FALSE) ) %>%
    unlist(recursive = FALSE)
  results <- groupCombos %>% 
    map(function(x) {data %>% group_by(!!! x) %>% summarise(...)} ) %>%
    bind_rows()
  results %>% select(!!! groupVars, everything())
}
Example
library(tidyverse)
mtcars %>% group_by(cyl, vs) %>% summarise_combo(cyl_n = n(), mean(mpg))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With