I have large dataframe called df
with some ID's.
I have another dataframe (id_list
) with a set of matching ID's and its associated features for each ID. The ID are not sequentally ordered in both dataframes.
Effectively i would like to look up from the larger dataframe df
to the id_list
and add two columns namely Display
and Type
to the current dataframe df
.
There are numerous confusing examples. What could be the most effective way of doing this. I tried using match()
, %in%
and failed miserably.
Here is a reproducible example.
df <- data.frame(Feats = matrix(rnorm(20), nrow = 20, ncol = 5), ID = sample.int(10, 10))
id_list <- data.frame(ID = sample.int(10,10),
Display = sample(c('clear', 'blur'), 20, replace = TRUE),
Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 20, replace = TRUE))
Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 ID
1 3.14944573 -0.52285062 3.14944573 -0.52285062 3.14944573 2
2 -0.41096007 0.38256691 -0.41096007 0.38256691 -0.41096007 1
3 0.03629351 -0.02514005 0.03629351 -0.02514005 0.03629351 7
4 0.91257290 1.35590761 0.91257290 1.35590761 0.91257290 5
5 -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311 3
6 3.14944573 -0.52285062 3.14944573 -0.52285062 3.14944573 4
7 -0.41096007 0.38256691 -0.41096007 0.38256691 -0.41096007 10
8 0.03629351 -0.02514005 0.03629351 -0.02514005 0.03629351 6
9 0.91257290 1.35590761 0.91257290 1.35590761 0.91257290 8
10 -0.26927311 -2.10213773 -0.26927311 -2.10213773 -0.26927311 9
ID Display Type
1 6 clear indigo
2 1 blur blue
3 7 clear red
4 4 clear red
5 3 blur red
6 10 clear yellow
7 2 clear blue
8 8 blur green
9 5 clear blue
10 9 clear green
The resulting end df should be of size [20 x 8].
You can use merge
from base R or left_join
from dplyr
to do this pretty easily. (There's also data.table::merge
, which maybe someone else can give an answer with.) You probably want to take steps to ensure that you don't lose any data if there's an entry in your data frame that doesn't have a corresponding ID in the lookup. If that's not the case, you can change all.x
to false or null in merge
, or switch from left_join
to inner_join
. To illustrate, I added a dummy row to the data with an ID that doesn't exist in the lookup table.
df <- data.frame(Feats = matrix(rnorm(10), nrow = 5, ncol = 5), ID = sample.int(10, 10))
dummy <- df[1, ]
dummy$ID <- 12
df <- rbind(dummy, df)
id_list <- data.frame(ID = sample.int(10,10),
Display = sample(c('clear', 'blur'), 10, replace = TRUE),
Type = sample(c('red', 'green', 'blue', 'indigo', 'yellow'), 10, replace = TRUE))
With merge
, you set either by
as the column name from both data frames to join by, or by.x
and by.y
if they have different names. all.x = T
will keep all observations in the first data frame even if they don't match an observation in the second data frame.
merged1 <- merge(df, id_list, by = "ID", sort = F, all.x = T)
merged1
#> ID Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 Display
#> 1 10 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 clear
#> 2 5 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 clear
#> 3 2 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 clear
#> 4 7 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 clear
#> 5 4 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 clear
#> 6 6 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 clear
#> 7 8 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 blur
#> 8 3 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 clear
#> 9 1 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 clear
#> 10 9 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 clear
#> 11 12 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 <NA>
#> Type
#> 1 indigo
#> 2 yellow
#> 3 blue
#> 4 indigo
#> 5 yellow
#> 6 indigo
#> 7 green
#> 8 red
#> 9 red
#> 10 blue
#> 11 <NA>
dplyr::left_join
keeps all observations from the first data frame and merges in any matching ones from the second.
joined <- dplyr::left_join(df, id_list, by = "ID")
head(joined)
#> Feats.1 Feats.2 Feats.3 Feats.4 Feats.5 ID Display
#> 1 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 12 <NA>
#> 2 -1.44053344 1.0086988 -1.44053344 1.0086988 -1.44053344 10 clear
#> 3 0.99220217 -0.3125813 0.99220217 -0.3125813 0.99220217 5 clear
#> 4 1.03881289 1.1277627 1.03881289 1.1277627 1.03881289 2 clear
#> 5 -0.01678186 -0.1519029 -0.01678186 -0.1519029 -0.01678186 7 clear
#> 6 0.07130125 1.1715833 0.07130125 1.1715833 0.07130125 4 clear
#> Type
#> 1 <NA>
#> 2 indigo
#> 3 yellow
#> 4 blue
#> 5 indigo
#> 6 yellow
Created on 2018-07-13 by the reprex package (v0.2.0).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With