I am trying to calculate trying to calculate the difference in mean values between two groups across multiple numeric variables. For instance, if I had the following data:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 versicolor
3 4.7 3.2 1.3 0.2 versicolor
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
I would like to subtract, for instance, the mean values of 'versicolor' from the mean values of 'setosa', and save this as a new dataframe.
Result looking something like this:
Sepal.Length Sepal.Width Petal.Length Petal.Width
1 0.1 0.3 0.18 0.0
I would really like to do this using dplyr, which I am currently learning. Also, ideally, the solution could be applied to a much larger dataframe (100's of variables) and could specifically select numeric variables to apply the function to.
If you could break down the code somewhat line by line that would be really excellent.
Thanks a lot.
There are many possible ways to do this and structure the output like you want it. One option is to re-structure the data frame to long/wide and bring it to columnar form, and then simply subtract the columns you want, i.e.
library(dplyr)
iris1 %>%
group_by(Species) %>%
summarise_all(list(mean)) %>%
pivot_longer(cols = Sepal.Length:Petal.Width) %>%
pivot_wider(names_from = Species, values_from = value) %>%
mutate(versicolor_setosa = setosa - versicolor)
which gives,
# A tibble: 4 x 4 name setosa versicolor versicolor_setosa <chr> <dbl> <dbl> <dbl> 1 Sepal.Length 4.90 4.8 0.1000 2 Sepal.Width 3.4 3.1 0.300 3 Petal.Length 1.43 1.35 0.0833 4 Petal.Width 0.2 0.2 0
Here's a way to do it with dplyr:
iris %>%
filter(Species %in% c("versicolor", "setosa")) %>%
group_by(Species) %>%
summarise_all(mean) %>%
summarise_at(-1, diff)
# A tibble: 1 x 4
Sepal.Length Sepal.Width Petal.Length Petal.Width
<dbl> <dbl> <dbl> <dbl>
1 0.930 -0.658 2.80 1.08
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With