Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Between group difference in mean values for all numeric variables

I am trying to calculate trying to calculate the difference in mean values between two groups across multiple numeric variables. For instance, if I had the following data:

Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          4.9         3.0          1.4         0.2 versicolor
3          4.7         3.2          1.3         0.2 versicolor
4          4.6         3.1          1.5         0.2     setosa
5          5.0         3.6          1.4         0.2     setosa

I would like to subtract, for instance, the mean values of 'versicolor' from the mean values of 'setosa', and save this as a new dataframe.

Result looking something like this:

Sepal.Length Sepal.Width Petal.Length Petal.Width    
1          0.1         0.3          0.18         0.0     

I would really like to do this using dplyr, which I am currently learning. Also, ideally, the solution could be applied to a much larger dataframe (100's of variables) and could specifically select numeric variables to apply the function to.

If you could break down the code somewhat line by line that would be really excellent.

Thanks a lot.

like image 516
Lachlan Avatar asked Dec 19 '25 21:12

Lachlan


2 Answers

There are many possible ways to do this and structure the output like you want it. One option is to re-structure the data frame to long/wide and bring it to columnar form, and then simply subtract the columns you want, i.e.

library(dplyr)

iris1 %>% 
 group_by(Species) %>% 
 summarise_all(list(mean)) %>% 
 pivot_longer(cols = Sepal.Length:Petal.Width) %>% 
 pivot_wider(names_from = Species, values_from = value) %>% 
 mutate(versicolor_setosa = setosa - versicolor)

which gives,

# A tibble: 4 x 4
  name         setosa versicolor versicolor_setosa
  <chr>         <dbl>      <dbl>             <dbl>
1 Sepal.Length   4.90       4.8             0.1000
2 Sepal.Width    3.4        3.1             0.300 
3 Petal.Length   1.43       1.35            0.0833
4 Petal.Width    0.2        0.2             0     
like image 194
Sotos Avatar answered Dec 21 '25 11:12

Sotos


Here's a way to do it with dplyr:

iris %>%
  filter(Species %in% c("versicolor", "setosa")) %>%
  group_by(Species) %>%
  summarise_all(mean) %>%
  summarise_at(-1, diff)

# A tibble: 1 x 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1        0.930      -0.658         2.80        1.08
like image 30
Aron Strandberg Avatar answered Dec 21 '25 11:12

Aron Strandberg