Randomly subsetting 1 observation per site and date

Question

I have read many posts on the site about randomly subsetting a large dataset for observations based on date -- for the first, last, or a specific date. However, I have a different challenge that requires me to subsample a large dataset by site AND date. I want to keep all sites in the subsetted dataset, but only include 1 date observation per site.

More specifically, I have a large dataset (for community ecology!) of insect community observations (n=2000) across 4 years. They were observed from ~900 sites, but each site has between 1 and 6 date observations within a year, with no sites repeated between years (this is why previous posts looking to subset a specific date range cannot apply here). Subsetting in this particular way is critical because of type of statistical analysis I am using - including spatial autocorrelation terms in the analysis means that I can only include one observation per site.

So the full dataset looks something like:

Site        Date        Ladybug
Baumgarten  6/24/2014   2
Baumgarten  8/6/2014    0
Baumgarten  8/20/2014   3
Baumgarten  7/8/2014    0
Baumgarten  7/22/2014   1
Berkevich   7/9/2014    0
Berkevich   7/23/2014   4
Berkevich   8/8/2014    0
Berkevich   8/22/2014   0
Boehm       6/24/2014   2

# dput(data)
dd <- structure(list(Site = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L), .Label = c("Baumgarten", "Berkevich", "Boehm"), class = "factor"),  Date = structure(c(1L, 8L, 6L, 4L, 2L, 5L, 3L, 9L, 7L, 1L), .Label = c("6/24/2014", "7/22/2014", "7/23/2014", "7/8/2014",  "7/9/2014", "8/20/2014", "8/22/2014", "8/6/2014", "8/8/2014" ), class = "factor"), Ladybug = c(2L, 0L, 3L, 0L, 1L, 0L,  4L, 0L, 0L, 2L)), .Names = c("Site", "Date", "Ladybug"), class = "data.frame", row.names = c(NA,  -10L))

And my desired subsetted dataset would look something like:

Site        Date        Ladybugs
Baumgarten  8/20/2014   3
Berkevich   7/9/2014    0
Boehm       6/24/2014   2

I have dates entered in both MM/DD/YYYY and DOY format (since sites don't repeat between years, DOY x site subsetting will still work to ensure no repeating sites), so code that uses either could work.

Any advice would be much appreciated. Thanks.

JasonAizkalns · Accepted Answer

Assuming your data is a data.frame named df, you could use dplyr and do the following:

library(dplyr)

df %>%
  group_by(Site) %>%
  sample_n(1)

# Source: local data frame [3 x 3]
# Groups: Site [3]
#  
#         Site      Date Ladybug
#       (fctr)    (fctr)   (int)
# 1 Baumgarten 8/20/2014       3
# 2  Berkevich 8/22/2014       0
# 3      Boehm 6/24/2014       2

Rentrop · Answer

Using data.table you can use:

require(data.table)
setDT(DT)
DT[,.SD[sample(.N,1)], by=Site]

This gives you

         Site      Date Ladybug
1: Baumgarten 8/20/2014       3
2:  Berkevich 7/23/2014       4
3:      Boehm 6/24/2014       2

Randomly subsetting 1 observation per site and date

Tags:

r

subset

2 Answers

JasonAizkalns

Rentrop

Recent Activity

Donate For Us

Randomly subsetting 1 observation per site and date

Tags:

r

subset

2 Answers

JasonAizkalns

Rentrop

Related questions

Recent Activity

Donate For Us