I have a tibble:
df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))
I want to remove rows that are substrings of other rows, resulting in:
result <- tibble(x = c('abcd', 'abd', 'efg'))
The solution must be quite efficient as there are ~1M rows of text.
str_extract(df$x, "foo") == "foo" is to test if "foo" is a substring of any element in df$x. It will be always at least 1, because x is always a substring of itself. If this number is higher, it is also a substring of another element, so we need to remove them using filter(!).
library(tidyverse)
df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))
df %>% filter(! (x %>% map_lgl(~ sum(str_extract(df$x, .x) == .x, na.rm = TRUE) > 1)))
#> # A tibble: 3 x 1
#> x
#> <chr>
#> 1 abcd
#> 2 abd
#> 3 efg
Created on 2022-02-18 by the reprex package (v2.0.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With