Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove substring rows from tibble

I have a tibble:

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

I want to remove rows that are substrings of other rows, resulting in:

result <- tibble(x = c('abcd', 'abd', 'efg'))

The solution must be quite efficient as there are ~1M rows of text.

like image 579
Dean Power Avatar asked Mar 24 '26 10:03

Dean Power


1 Answers

str_extract(df$x, "foo") == "foo" is to test if "foo" is a substring of any element in df$x. It will be always at least 1, because x is always a substring of itself. If this number is higher, it is also a substring of another element, so we need to remove them using filter(!).

library(tidyverse)

df <- tibble(x = c('a', 'ab', 'abc', 'abcd', 'abd', 'efg'))

df %>% filter(! (x %>% map_lgl(~ sum(str_extract(df$x, .x) == .x, na.rm = TRUE) > 1)))
#> # A tibble: 3 x 1
#>   x    
#>   <chr>
#> 1 abcd 
#> 2 abd  
#> 3 efg

Created on 2022-02-18 by the reprex package (v2.0.0)

like image 114
danlooo Avatar answered Mar 26 '26 00:03

danlooo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!