Merge by 3 Values (Two specified, One in a range)

Question

Sorry for the semi-convoluted title. Hopefully I can demonstrate what it is I'm getting at here. I have the following data frames (simplified down from my real data):

DF1 
ID  Year    Grade   Season  Score
A   2016      6     Fall     259
B   2017      7     Fall     264
C   2016      6     Fall     263
D   2018      6     Winter   269
E   2016      8     Spring   277

DF2
Grade   Season  Score   Pctl
 6       Fall    261    1
 6       Fall    264    2
 7       Fall    264    9
 8      Spring   267    5
 6      Winter   269    6.5
 8      Spring   277    3

And I want to get:

DF3
ID  Year    Grade   Season  Score   Pctl
A   2016    6        Fall    259     1
B   2017    7        Fall    264     9
C   2016    6        Fall    263     1
D   2018    6        Winter  269     6.5
E   2016    8        Spring  277     3

Simply put, given the Season, Grade, and Score in DF1, I want to append the corresponding Pctl from DF2. I understand that in most cases this is as simple as using merge() and providing specified "by" arguments.

However, while the Season and Grade merge seamlessly, as they have exact values to correspond to, this is not always the case for Scores.

First, take for example ID "A". Their score is 259 which falls below the Grade:6, Season:Fall, and Score:261. In this case, I want the pctl to be assigned as 1. Similarly, when the score is above the 99 pctl score (not demonstrated here), it should be assigned as 99.

A second complication is demonstrated by ID "C" where their Score:263, falls between Pctl 1 and 2. That is, it is not a 1:1 mapping. When this occurs, I simply want to round (up or down as long as it is consistent) to the nearest score and provide that Pctl. For the purpose of this demonstration I rounded down and appended Pctl 1.

From my research, is seems as though this might be able to be done through conditional join or with the GenomicRanges Package but the solutions were unclear to me.

arg0naut91 · Accepted Answer

A data.table approach:

library(data.table)

setkey(setDT(DF1), Grade, Season, Score)
setkey(setDT(DF2), Grade, Season, Score)

DF3 <- DF2[DF1, roll = "nearest"]

In this case, the Score of 263 will actually get a 2, since this is nearer to 264 than 261. This will be consistent through different scenarios you apply.

The output:

   Grade Season Score Pctl ID Year
1:     6   Fall   259  1.0  A 2016
2:     6   Fall   263  2.0  C 2016
3:     6 Winter   269  6.5  D 2018
4:     7   Fall   264  9.0  B 2017
5:     8 Spring   277  3.0  E 2016

However, you also mention the 99th percentile.

Another thing is that there may indeed be scores that are exactly half-way between the two values, like for instance a Score of 428 below where the other table contains only 427 and 429. In this case, the percentile will consistently be matched with the lower one.

Data:

DF1

   ID Year Grade Season Score
1:  A 2016     6   Fall   259
2:  B 2017     7   Fall   264
3:  C 2016     6   Fall   263
4:  D 2018     6 Winter   269
5:  E 2016     8 Spring   277
6:  F 2017     7 Spring   550
7:  G 2015     6   Fall   428

DF2

   Grade Season Score Pctl
1:     6   Fall   261  1.0
2:     6   Fall   264  2.0
3:     7   Fall   264  9.0
4:     8 Spring   267  5.0
5:     6 Winter   269  6.5
6:     8 Spring   277  3.0
7:     7 Spring   550 99.6
8:     6   Fall   427 84.0
9:     6   Fall   429 88.0

Adding the line for the 99th percentile:

library(data.table)

setkey(setDT(DF1), Grade, Season, Score)
setkey(setDT(DF2), Grade, Season, Score)

DF3 <- DF2[DF1, roll = "nearest"][Pctl > 99, Pctl := 99]

As you can see, the Score of 428 has been matched with the Score of 427 in the joining table, and 99.6 has been replaced with 99:

DF3

   Grade Season Score Pctl ID Year
1:     6   Fall   259  1.0  A 2016
2:     6   Fall   263  2.0  C 2016
3:     6   Fall   428 84.0  G 2015
4:     6 Winter   269  6.5  D 2018
5:     7   Fall   264  9.0  B 2017
6:     7 Spring   550 99.0  F 2017
7:     8 Spring   277  3.0  E 2016

Merge by 3 Values (Two specified, One in a range)

Tags:

merge

r

Eli Groves

1 Answers

arg0naut91

Recent Activity

Donate For Us

Merge by 3 Values (Two specified, One in a range)

Tags:

merge

r

Eli Groves

1 Answers

arg0naut91

Related questions

Recent Activity

Donate For Us