Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tidyverse: Replacing entire strings based on partial matches

I'm looking to replace entire string entries within data based on partial matches using functions in the stringr package.

The only method I've tried has been replacing exact matches using str_replace_all() but this becomes tedious and unwieldy when there are dozens of variations to correct for. I'm looking to replace based on partial matches. In my reprex below, I replace variants of "Spaniard" and "Colombian" by direct specification. However, I would love to perform those replacements based on something like meeting the condition that "Spa" or "Col" exists in the words.

library(tidyverse)
library(stringr)

data <- c(
  "Spanish",
  "SPANIARD",
  "Spainiard",
  "Colombian",
  "Columbian",
  "Ecuador",
  "Equador",
  "Ecuadorian",
  "VENEZUELAN"
)

str_replace_all(data,
                c(
                  "Spanish" = "Spaniard",
                  "SPANIARD" = "Spaniard",
                  "Spainiard" = "Spaniard",
                  "Columbian" = "Colombian"
                ))
#> [1] "Spaniard"   "Spaniard"   "Spaniard"   "Colombian"  "Colombian" 
#> [6] "Ecuador"    "Equador"    "Ecuadorian" "VENEZUELAN"

Created on 2019-05-21 by the reprex package (v0.2.1)

So str_replace_all() works as advertised, but I'm looking for a way to streamline this process in the tidyverse. Any help is much appreciated.

like image 727
Chris A. Avatar asked Oct 23 '25 06:10

Chris A.


2 Answers

I prefer to use a distance measure (e.g., Jaro-winkler's distance, or some other distance measure), but they do have their drawbacks. Be weary of what you could be changing with partial matching. If you are doing partial matching it would be wise to see what the possibilities are. But, you can do what you outlined in tidyverse using case_when with startsWith or grepl:

tibble(data = data) %>%
  mutate(
    v1 = tolower(data),
    new_name = case_when(
      startsWith(v1, "spa") ~ "Spanaird",
      startsWith(v1, "col") ~ "Colombian",
      startsWith(v1, "eq") | startsWith(v1, "ec") ~ "Equadorian",
      startsWith(v1, "ven") ~ "Venezuelan",
      TRUE ~ as.character(data)))

# A tibble: 9 x 3
  data       v1         new_name  
  <chr>      <chr>      <chr>     
1 Spanish    spanish    Spanaird  
2 SPANIARD   spaniard   Spanaird  
3 Spainiard  spainiard  Spanaird  
4 Colombian  colombian  Colombian 
5 Columbian  columbian  Colombian 
6 Ecuador    ecuador    Equadorian
7 Equador    equador    Equadorian
8 Ecuadorian ecuadorian Equadorian
9 VENEZUELAN venezuelan Venezuelan

To see the possibilities you can do this (or several other things):

tibble(data = data) %>%
  arrange(data) %>%
  count(tolower(data)) 
like image 114
Andrew Avatar answered Oct 24 '25 19:10

Andrew


An option would be to use distance method for partial matching

vals <- c("Spaniard", "Equador", "Colombian", "Venezuelan")
library(stringdist)
vals[amatch(tolower(data), tolower(vals),maxDist=5)]
#[1] "Spaniard"   "Spaniard"   "Spaniard"   "Colombian"  "Colombian"  
#[6] "Equador"    "Equador"    "Equador"    "Venezuelan"

It can be piped in a tidyverse work flow

library(tidyverse)
tibble(v1 = data) %>%
    mutate(v1 = vals[amatch(tolower(v1), tolower(vals), maxDist = 5)])
like image 30
akrun Avatar answered Oct 24 '25 21:10

akrun