I have a data frame containing data about student lateness to various classes. Each row contains data about a late student and his class: date and time of the class, name of the class, class size, number of minutes late, and the gender of the student. In order to get the total percentage of late students for all classes, I need to count the number of rows (late students) and compare that with the total number of students that attended class.
I can't simply sum the class sizes for all of the rows; that would count the students of a given class several times, once for each late student in the class. Instead, I need to count each class size only once for each meeting of the class.
Key: minutes late, class name, students in attendance, gender of tardy student, minutes late.
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/12/10 Stats 30 M 1
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/15/10 Stats 40 F 3
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
11/16/10 Radar 22 M 2
In this case, there are three different class meetings and 11 late students. How could I make sure each class meeting's class size is only counted once?
If I understand what you want correctly, this is easier to do with the plyr package, rather than tapply or by because it understands what amounts to a multivariate grouping. For instance:
ddply(df, .(DATE,CLASS), transform, PERCENT_LATE=length(MINUTES.LATE)/CLASS.SIZE))The argument to length here can be any of the column names. ddply will split your dataframe for each combination of DATE and CLASS factor levels. The number of rows in each mini dataframe should then correspond to how many late students there were (since there is an entry for each late student). That is where the length(any variable) comes in. Divide it by the class size column for the fraction.
To follow on @Gavin's comment re: the redundant output, using summarise:
df.out <- ddply(x, .(DATE, CLASS), summarise
, NLATE = length(c(DATE, CLASS)) / 2
, SIZE = unique(CLASS.SIZE)
, PCLATE = 100 * (length(c(DATE, CLASS)) / 2 )/ unique(CLASS.SIZE)
)
> df.out
DATE CLASS NLATE SIZE PCLATE
1 11/12/10 Stats 3 30 10.00
2 11/15/10 Stats 3 40 7.50
3 11/16/10 Radar 5 22 22.73
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With