Sorry for the semi-convoluted title. Hopefully I can demonstrate what it is I'm getting at here. I have the following data frames (simplified down from my real data):
DF1
ID Year Grade Season Score
A 2016 6 Fall 259
B 2017 7 Fall 264
C 2016 6 Fall 263
D 2018 6 Winter 269
E 2016 8 Spring 277
DF2
Grade Season Score Pctl
6 Fall 261 1
6 Fall 264 2
7 Fall 264 9
8 Spring 267 5
6 Winter 269 6.5
8 Spring 277 3
And I want to get:
DF3
ID Year Grade Season Score Pctl
A 2016 6 Fall 259 1
B 2017 7 Fall 264 9
C 2016 6 Fall 263 1
D 2018 6 Winter 269 6.5
E 2016 8 Spring 277 3
Simply put, given the Season, Grade, and Score in DF1, I want to append the corresponding Pctl from DF2. I understand that in most cases this is as simple as using merge() and providing specified "by" arguments.
However, while the Season and Grade merge seamlessly, as they have exact values to correspond to, this is not always the case for Scores.
First, take for example ID "A". Their score is 259 which falls below the Grade:6, Season:Fall, and Score:261. In this case, I want the pctl to be assigned as 1. Similarly, when the score is above the 99 pctl score (not demonstrated here), it should be assigned as 99.
A second complication is demonstrated by ID "C" where their Score:263, falls between Pctl 1 and 2. That is, it is not a 1:1 mapping. When this occurs, I simply want to round (up or down as long as it is consistent) to the nearest score and provide that Pctl. For the purpose of this demonstration I rounded down and appended Pctl 1.
From my research, is seems as though this might be able to be done through conditional join or with the GenomicRanges Package but the solutions were unclear to me.
A data.table approach:
library(data.table)
setkey(setDT(DF1), Grade, Season, Score)
setkey(setDT(DF2), Grade, Season, Score)
DF3 <- DF2[DF1, roll = "nearest"]
In this case, the Score of 263 will actually get a 2, since this is nearer to 264 than 261. This will be consistent through different scenarios you apply.
The output:
Grade Season Score Pctl ID Year
1: 6 Fall 259 1.0 A 2016
2: 6 Fall 263 2.0 C 2016
3: 6 Winter 269 6.5 D 2018
4: 7 Fall 264 9.0 B 2017
5: 8 Spring 277 3.0 E 2016
However, you also mention the 99th percentile.
Another thing is that there may indeed be scores that are exactly half-way between the two values, like for instance a Score of 428 below where the other table contains only 427 and 429. In this case, the percentile will consistently be matched with the lower one.
Data:
DF1
ID Year Grade Season Score
1: A 2016 6 Fall 259
2: B 2017 7 Fall 264
3: C 2016 6 Fall 263
4: D 2018 6 Winter 269
5: E 2016 8 Spring 277
6: F 2017 7 Spring 550
7: G 2015 6 Fall 428
DF2
Grade Season Score Pctl
1: 6 Fall 261 1.0
2: 6 Fall 264 2.0
3: 7 Fall 264 9.0
4: 8 Spring 267 5.0
5: 6 Winter 269 6.5
6: 8 Spring 277 3.0
7: 7 Spring 550 99.6
8: 6 Fall 427 84.0
9: 6 Fall 429 88.0
Adding the line for the 99th percentile:
library(data.table)
setkey(setDT(DF1), Grade, Season, Score)
setkey(setDT(DF2), Grade, Season, Score)
DF3 <- DF2[DF1, roll = "nearest"][Pctl > 99, Pctl := 99]
As you can see, the Score of 428 has been matched with the Score of 427 in the joining table, and 99.6 has been replaced with 99:
DF3
Grade Season Score Pctl ID Year
1: 6 Fall 259 1.0 A 2016
2: 6 Fall 263 2.0 C 2016
3: 6 Fall 428 84.0 G 2015
4: 6 Winter 269 6.5 D 2018
5: 7 Fall 264 9.0 B 2017
6: 7 Spring 550 99.0 F 2017
7: 8 Spring 277 3.0 E 2016
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With