I have a dataframe with the following structure:
dat <- tribble(
~"fragment", ~"sentence",
"Hello", "Hello world",
"test", "This is a test",
)
> dat
fragment sentence
<chr> <chr>
1 Hello Hello world
2 test This is a test
I would like to generate a new column (output) that is the same as sentence, but the characters not in fragment are set to a specified character (e.g., "x").
The desired output looks like this:
fragment sentence output
<chr> <chr> <chr>
1 Hello Hello world Hello xxxxx
2 test This test is a test xxxx test xx x xxxx
Importantly, the fragment should be matched exactly and only once. For example, the fragment "all" in the sentence "all ball" would result in "all xxxx". In this same scenario, the fragment "al" would not match anything in "all ball".
EDIT (forgot to add my attempt)
I believe a combination of mutate and str_detect or str_replace_all could get me there, but I've had no luck so far.
dat %>%
mutate(output = str_replace_all(sentence, fragment, "x")
With a little bit of regex, you can match everything except the word in fragment and spaces, and replace those characters with "x".
dat %>%
mutate(output = mapply(\(x, y) gsub(paste0("(", x, "| )(*SKIP)(*FAIL)|."), "x", y, perl = TRUE),
fragment, sentence))
fragment sentence output
1 Hello Hello world Hello xxxxx
2 test This is a test xxxx xx x test
3 another This is a second test xxxx xx x xxxxxx xxxx
If any of dat$fragment constitutes the pattern:
pattern = paste0(dat$fragment, collapse = "|")
dat %>%
mutate(output = gsub(paste0("(", pattern, "| )(*SKIP)(*FAIL)|."), "x", sentence, perl = TRUE))
fragment sentence output
1 Hello Hello world Hello xxxxx
2 test This is a test xxxx xx x test
3 another This is a second test xxxx xx x xxxxxx test
Data (with an additional row):
dat <- tribble(
~"fragment", ~"sentence",
"Hello", "Hello world",
"test", "This is a test",
"another", "This is a second test",
"test", "test test",
"all", "all ball",
)
Edit:
dat %>%
mutate(output = mapply(\(x, y) gsub(paste0("(\\b", x, "\\b| )(*SKIP)(*FAIL)|."), "x", y, perl = TRUE),
fragment, sentence),
output2 = str_locate_all(output, fragment) %>%
mapply(FUN = \(x, y, z){
if(nrow(x) > 1){
str_sub(y, x[-1, 1], x[-1, 2]) <- strrep("x", str_length(z))
y
} else
y
}, y = output, z = fragment))
# A tibble: 5 × 4
fragment sentence output output2
<chr> <chr> <chr> <chr>
1 Hello Hello world Hello xxxxx Hello xxxxx
2 test This is a test xxxx xx x test xxxx xx x test
3 another This is a second test xxxx xx x xxxxxx xxxx xxxx xx x xxxxxx xxxx
4 test test test test test test xxxx
5 all all ball all xxxx all xxxx
Data for edit:
dat <- tribble(
~"fragment", ~"sentence",
"Hello", "Hello world",
"test", "This is a test",
"another", "This is a second test",
"test", "test test",
"all", "all ball",
)
Here is one approach to do it using dplyr::rowwise() and in each row we apply a custom function to each string. It might not be the most performant approach, since we have a nested loop (one within rowwise and another one in the map function), so there might be better altneratives.
library(stringr)
library(dplyr)
library(purrr)
dat <- tribble(
~"fragment", ~"sentence",
"Hello", "Hello world",
"test", "This is a test",
)
xxx_string <- function(sent, frag) {
map(unlist(strsplit(sent, " ")),
~ {
if(.x != frag) {
sub(".*", paste(rep("x", nchar(.x)), collapse = ""), .x)
} else .x
}) %>%
paste(., collapse = " ")
}
dat %>%
rowwise() %>%
mutate(output = xxx_string(sentence, fragment))
#> # A tibble: 2 x 3
#> # Rowwise:
#> fragment sentence output
#> <chr> <chr> <chr>
#> 1 Hello Hello world Hello xxxxx
#> 2 test This is a test xxxx xx x test
Created on 2023-02-14 by the reprex package (v2.0.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With