I am currently learning about the tidyverse package using Grolemund and Wickham's brilliant book, R4DS. While playing with the code, I realised that I was getting the same output using slight variations in how I wrote the arguments within the filter() verb. I would like to get confirmation on one of the following (a) if the output I am getting is the exact same for the variations (b) if the output is somehow different but I have not realised that (c) if the output is same but the way to get at it is different
The two variations are as follows:
library(nycflights13)
library(tidyverse)
#Variation 1
flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
#Variation 2
flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
For both, I get the same tibble output
# A tibble: 47 x 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
<int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
1 2013 7 1 1310 1057 133 1551 1338 133 UA
2 2013 7 1 1707 1448 139 1943 1742 121 UA
3 2013 7 1 2058 1735 203 2355 2030 205 AA
4 2013 7 2 2001 1735 146 2335 2030 185 AA
5 2013 7 3 2215 1909 186 45 2200 165 UA
6 2013 7 9 1937 1735 122 2240 2030 130 AA
7 2013 7 10 40 1909 331 301 2200 301 UA
8 2013 7 10 1629 1520 69 2048 1754 174 UA
9 2013 7 10 1913 1721 112 2214 2001 133 UA
10 2013 7 17 1657 1446 131 2007 1745 142 UA
# ... with 37 more rows, and 9 more variables: flight <int>, tailnum <chr>, origin <chr>,
# dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Apologies if there is some mistake in the question format. This is my first time posting here.
The two versions give exactly the same results. We can test this by storing the results of each and using the identical function:
test1 <- flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
test2 <- flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
identical(test1, test2)
#> [1] TRUE
They both benchmark similarly too:
library(microbenchmark)
microbenchmark(
multi_filter = {
flights %>%
filter(arr_delay >= 120) %>%
filter(dest == "IAH" | dest == "HOU") %>%
filter(carrier == "UA" | carrier == "AA" | carrier == "DL") %>%
filter(month >= 7 & month <= 9)
},
single_filter = {
flights %>%
filter(arr_delay >= 120,
dest == "IAH" | dest == "HOU",
carrier == "UA" | carrier == "AA" | carrier == "DL",
month >= 7 & month <= 9)
})
#> Unit: milliseconds
#> expr min lq mean median uq max neval cld
#> multi_filter 26.7920 27.33855 29.40675 27.70585 28.74525 40.6825 100 a
#> single_filter 32.0836 32.77430 34.41295 33.26740 33.67100 55.9700 100 b
Calling filter several times actually seems a little faster in this benchmark. However, the difference isn't massive. The flights data frame is large, with over 300,000 rows, so a few milliseconds in a data frame this big is unlikely to translate into a difference in most real-life applications.
In this case, I think it largely comes down to individual preference.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With