Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create column based on matched characters from another column and set all remaining unmatched characters to specified value

I have a dataframe with the following structure:

dat <- tribble(
  ~"fragment", ~"sentence",  
  "Hello", "Hello world", 
  "test",  "This is a test",
)
> dat
  fragment sentence       
  <chr>    <chr>          
1 Hello    Hello world    
2 test     This is a test

I would like to generate a new column (output) that is the same as sentence, but the characters not in fragment are set to a specified character (e.g., "x").

The desired output looks like this:

  fragment sentence            output        
  <chr>    <chr>               <chr>         
1 Hello    Hello world         Hello xxxxx   
2 test     This test is a test xxxx test xx x xxxx

Importantly, the fragment should be matched exactly and only once. For example, the fragment "all" in the sentence "all ball" would result in "all xxxx". In this same scenario, the fragment "al" would not match anything in "all ball".

EDIT (forgot to add my attempt)

I believe a combination of mutate and str_detect or str_replace_all could get me there, but I've had no luck so far.

dat %>% 
  mutate(output = str_replace_all(sentence, fragment, "x")
like image 479
babylinguist Avatar asked Dec 14 '25 13:12

babylinguist


2 Answers

With a little bit of regex, you can match everything except the word in fragment and spaces, and replace those characters with "x".

dat %>% 
  mutate(output = mapply(\(x, y) gsub(paste0("(", x, "| )(*SKIP)(*FAIL)|."), "x", y, perl = TRUE),
                         fragment, sentence))

  fragment sentence              output                    
1 Hello    Hello world           Hello xxxxx          
2 test     This is a test        xxxx xx x test       
3 another  This is a second test xxxx xx x xxxxxx xxxx

If any of dat$fragment constitutes the pattern:

pattern = paste0(dat$fragment, collapse = "|")
dat %>% 
  mutate(output = gsub(paste0("(", pattern, "| )(*SKIP)(*FAIL)|."), "x", sentence, perl = TRUE))

  fragment sentence              output                      
1 Hello    Hello world           Hello xxxxx          
2 test     This is a test        xxxx xx x test       
3 another  This is a second test xxxx xx x xxxxxx test

Data (with an additional row):

dat <- tribble(
  ~"fragment", ~"sentence",  
  "Hello", "Hello world", 
  "test",  "This is a test",
  "another", "This is a second test",
  "test", "test test",
  "all", "all ball",
)

Edit:

dat %>% 
  mutate(output = mapply(\(x, y) gsub(paste0("(\\b", x, "\\b| )(*SKIP)(*FAIL)|."), "x", y, perl = TRUE),
                         fragment, sentence), 
         output2 = str_locate_all(output, fragment) %>% 
           mapply(FUN = \(x, y, z){
             if(nrow(x) > 1){
               str_sub(y, x[-1, 1], x[-1, 2]) <- strrep("x", str_length(z))
               y
             } else 
               y
           }, y = output, z = fragment))

# A tibble: 5 × 4
  fragment sentence              output                output2              
  <chr>    <chr>                 <chr>                 <chr>                
1 Hello    Hello world           Hello xxxxx           Hello xxxxx          
2 test     This is a test        xxxx xx x test        xxxx xx x test       
3 another  This is a second test xxxx xx x xxxxxx xxxx xxxx xx x xxxxxx xxxx
4 test     test test             test test             test xxxx            
5 all      all ball              all xxxx              all xxxx             

Data for edit:

dat <- tribble(
  ~"fragment", ~"sentence",  
  "Hello", "Hello world", 
  "test",  "This is a test",
  "another", "This is a second test",
  "test", "test test",
  "all", "all ball",
)
like image 120
Maël Avatar answered Dec 16 '25 04:12

Maël


Here is one approach to do it using dplyr::rowwise() and in each row we apply a custom function to each string. It might not be the most performant approach, since we have a nested loop (one within rowwise and another one in the map function), so there might be better altneratives.

library(stringr)
library(dplyr)
library(purrr)

dat <- tribble(
  ~"fragment", ~"sentence",  
  "Hello", "Hello world", 
  "test",  "This is a test",
)

xxx_string <- function(sent, frag) {
  
  map(unlist(strsplit(sent, " ")),
    ~ {
      if(.x != frag) {
        sub(".*", paste(rep("x", nchar(.x)), collapse = ""), .x)
      } else .x
    }) %>% 
  paste(., collapse = " ")
  
}

dat %>% 
  rowwise() %>% 
  mutate(output = xxx_string(sentence, fragment))

#> # A tibble: 2 x 3
#> # Rowwise: 
#>   fragment sentence       output        
#>   <chr>    <chr>          <chr>         
#> 1 Hello    Hello world    Hello xxxxx   
#> 2 test     This is a test xxxx xx x test

Created on 2023-02-14 by the reprex package (v2.0.1)

like image 28
TimTeaFan Avatar answered Dec 16 '25 05:12

TimTeaFan



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!