I am working with R and have a data frame that includes names and corresponding office addresses. But the addresses are not perfectly matched with names and I have to replace the wrong addresses based on the most frequent value. Here is an example:
ID Clients ADDRESS FIRST_NAME LAST_NAME
1 A Cobben Cui Z
2 B Beerstaat Cui Z
3 C River Cui Z
4 D Cobben Cui Z
5 E Cobben Cui Z
6 F Cobben Cui Z
6 F Hart Mike G
7 H Hart Mike G
8 I Sparren Mike G
And I would like a data frame that looks like this:
ID Clients ADDRESS FIRST_NAME LAST_NAME
1 A Cobben Cui Z
2 B Cobben Cui Z
3 C Cobben Cui Z
4 D Cobben Cui Z
5 E Cobben Cui Z
6 F Cobben Cui Z
6 F Hart Mike G
7 H Hart Mike G
8 I Hart Mike G
I think it can be done by creating a frequency table and I was wondering whether there is a less complicated method.
Thanks a lot!
A dplyr approach using add_count and which.max may look like so:
library(dplyr)
d %>%
add_count(LAST_NAME, ADDRESS) %>%
group_by(LAST_NAME) %>%
mutate(ADDRESS = ADDRESS[which.max(n)]) %>%
ungroup() %>%
select(-n)
#> # A tibble: 9 × 5
#> ID Clients ADDRESS FIRST_NAME LAST_NAME
#> <int> <chr> <chr> <chr> <chr>
#> 1 1 A Cobben Cui Z
#> 2 2 B Cobben Cui Z
#> 3 3 C Cobben Cui Z
#> 4 4 D Cobben Cui Z
#> 5 5 E Cobben Cui Z
#> 6 6 F Cobben Cui Z
#> 7 6 F Hart Mike G
#> 8 7 H Hart Mike G
#> 9 8 I Hart Mike G
DATA
d <- structure(list(ID = c(1L, 2L, 3L, 4L, 5L, 6L, 6L, 7L, 8L), Clients = c(
"A",
"B", "C", "D", "E", "F", "F", "H", "I"
), ADDRESS = c(
"Cobben",
"Beerstaat", "River", "Cobben", "Cobben", "Cobben", "Hart", "Hart",
"Sparren"
), FIRST_NAME = c(
"Cui", "Cui", "Cui", "Cui", "Cui",
"Cui", "Mike", "Mike", "Mike"
), LAST_NAME = c(
"Z", "Z", "Z",
"Z", "Z", "Z", "G", "G", "G"
)), class = "data.frame", row.names = c(
NA,
-9L
))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With