Between group difference in mean values for all numeric variables

Question

I am trying to calculate trying to calculate the difference in mean values between two groups across multiple numeric variables. For instance, if I had the following data:

Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1          5.1         3.5          1.4         0.2     setosa
2          4.9         3.0          1.4         0.2 versicolor
3          4.7         3.2          1.3         0.2 versicolor
4          4.6         3.1          1.5         0.2     setosa
5          5.0         3.6          1.4         0.2     setosa

I would like to subtract, for instance, the mean values of 'versicolor' from the mean values of 'setosa', and save this as a new dataframe.

Result looking something like this:

Sepal.Length Sepal.Width Petal.Length Petal.Width    
1          0.1         0.3          0.18         0.0

I would really like to do this using dplyr, which I am currently learning. Also, ideally, the solution could be applied to a much larger dataframe (100's of variables) and could specifically select numeric variables to apply the function to.

If you could break down the code somewhat line by line that would be really excellent.

Thanks a lot.

Sotos · Accepted Answer

There are many possible ways to do this and structure the output like you want it. One option is to re-structure the data frame to long/wide and bring it to columnar form, and then simply subtract the columns you want, i.e.

library(dplyr)

iris1 %>% 
 group_by(Species) %>% 
 summarise_all(list(mean)) %>% 
 pivot_longer(cols = Sepal.Length:Petal.Width) %>% 
 pivot_wider(names_from = Species, values_from = value) %>% 
 mutate(versicolor_setosa = setosa - versicolor)

which gives,

# A tibble: 4 x 4
  name         setosa versicolor versicolor_setosa
  <chr>         <dbl>      <dbl>             <dbl>
1 Sepal.Length   4.90       4.8             0.1000
2 Sepal.Width    3.4        3.1             0.300 
3 Petal.Length   1.43       1.35            0.0833
4 Petal.Width    0.2        0.2             0

Aron Strandberg · Answer

Here's a way to do it with dplyr:

iris %>%
  filter(Species %in% c("versicolor", "setosa")) %>%
  group_by(Species) %>%
  summarise_all(mean) %>%
  summarise_at(-1, diff)

# A tibble: 1 x 4
  Sepal.Length Sepal.Width Petal.Length Petal.Width
         <dbl>       <dbl>        <dbl>       <dbl>
1        0.930      -0.658         2.80        1.08

Between group difference in mean values for all numeric variables

Tags:

r

data-manipulation

dplyr

mean

Lachlan

2 Answers

Sotos

Aron Strandberg

Recent Activity

Donate For Us

Between group difference in mean values for all numeric variables

Tags:

r

data-manipulation

dplyr

mean

Lachlan

2 Answers

Sotos

Aron Strandberg

Related questions

Recent Activity

Donate For Us