This has got to be a simple answer. I want to subset my data for testing purposes. I have a data frame where I want to keep all columns of information, just simply reduce the number of observations PER individual. So, I have a unique Identifier and about 50 individuals. I want to select only 2 individuals AND and I want to select only 500 data points from those 2.
My data frame is called wloc08. There are 50 unique IDs. I am only taking 2 of those individuals but of those 2, I'd like only 500 data points from each.
subwloc08=subset(wloc08, subset = ID %in% c("F07001","F07005"))
somewhere in this statement can I use [?
 reduced= subwloc08$ID[1:500,]
Doesn't work.
If you're only dealing with 2 individuals, you could get away with subsetting each separately and then rbinding each subset:
wloc08F07001 <- wloc08[which(wloc08$ID == "F07001")[1:500], ]
wloc08F07005 <- wloc08[which(wloc08$ID == "F07005")[1:500], ]
reduced <- rbind(wloc08F07001, wloc08F07005)
To make this more generalizable, especially if you are dealing with large amounts of data, you might consider looking at the data.table package. Here is an example
library(data.table)
wloc08DT<-as.data.table(wloc08)  # Create data.table
setkey(wloc08DT, "ID")           # Set a key to subset on
# EDIT: A comment from Matthew Dowle pointed out that by = "ID" isn't necessary
# reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500], by = "ID"]
reduced <- wloc08DT[c("F07001", "F07005"), .SD[1:500]]
To break down the syntax of the last step:
c("F07001", "F07005"): This will subset your data by finding all rows where the key is equal to F07001 or F07005. It will also instigate "by without by" (see ?data.table for details)
.SD[1:500]: This will subset the .SD object (the subsetted data.table) by selecting rows 1:500.
EDIT This part was removed thanks to a correction by Matthew Dowle. The "by without by" is initiated by step 1. Formerly: (by = "ID": This tells [.data.table to perform the operation in step 2 for each ID individually, in this case only the IDs that you indicated in step 1.)
You could use lapply:
do.call("rbind",
        lapply(c("F07001", "F07005"),
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))
Your command reduced = subwloc08$ID[1:500,] didn't work since subwloc08$ID is a vector. However, reduced = subwloc08$ID[1:500] would have worked but would have returned the first 500 values of subwloc08$ID (not the whole rows of subwloc08).
If you want to run this command for the first 30 subjects, you could use unique(wloc08$ID)[1:30] instead of c("F07001", "F07005"):
do.call("rbind",
        lapply(unique(wloc08$ID)[1:30],
               function(x) wloc08[which(wloc08$ID == x)[1:500], ]))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With