I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.
More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.
So the full dataset looks something like:
Site Date Ladybug
Baumgarten 6/24/2014 2
Baumgarten 8/6/2014 0
Baumgarten 8/20/2014 3
Baumgarten 7/8/2014 0
Baumgarten 7/22/2014 1
Berkevich 7/9/2014 0
Berkevich 7/23/2014 4
Berkevich 8/8/2014 0
Berkevich 8/22/2014 0
Boehm 6/24/2014 2
# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"), Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014", "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L, 4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA, -10L))
And my desired subsetted dataset would look something like:
Site Date Ladybugs
Baumgarten 8/20/2014 3
Berkevich 7/9/2014 0
Boehm 6/24/2014 2
I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.
Any advice would be much appreciated. Thanks.
Assuming your data is a data.frame named df, you could use dplyr and do the following:
library(dplyr)
df %>%
group_by(Site) %>%
sample_n(1)
# Source: local data frame [3 x 3]
# Groups: Site [3]
#
# Site Date Ladybug
# (fctr) (fctr) (int)
# 1 Baumgarten 8/20/2014 3
# 2 Berkevich 8/22/2014 0
# 3 Boehm 6/24/2014 2
Using data.table you can use:
require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]
This gives you
Site Date Ladybug
1: Baumgarten 8/20/2014 3
2: Berkevich 7/23/2014 4
3: Boehm 6/24/2014 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With