Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast string counting in dplyr

Tags:

regex

r

dplyr

purrr

I need to count instances of a pattern (fixed) within a string that's held in a column in a tibble. mutate(x = str_count(col, pattern)) does exactly what I want but it's not fast enough to process the volume of strings I have to evaluate.

Simplified tibble

test = tibble(
  id=c(1,2,3),
  seq=c("ATCG","ATTT","CGCG")
)

Works but is too slow in microbenchmark testing

test %>% mutate(CpG = str_count(seq, "CG"))

Theoretically faster on single sequences but not working on my tibble column

This just gives the single value of the first row's count

test %>% mutate(CpG = sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0))

I tried to get purrr::map to work, but I'm failing at it...here are my attempts that won't run:

test %>% mutate(CpG = map(~sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0)))
test %>% mutate(CpG = map(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0)))

Edit: Time Benchmarks

Looks like stringi::stri_count_fixed is the optimal choice on the test set.

microbenchmark(
  mutate(test, CpG = stringr::str_count(seq, "CG")),
  mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0))),
  mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))),
  mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))),
  mutate(test, CpG = stringi::stri_count_fixed(seq,"CG"))
)  
Unit: microseconds
                                                                                          expr     min       lq     mean   median       uq      max neval
                                             mutate(test, CpG = stringr::str_count(seq, "CG")) 125.502 133.9995 159.0844 147.9045 164.2925  441.242   100
 mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x,      fixed = TRUE)[[1]] > 0))) 167.210 183.4695 215.7746 202.8890 230.5275  450.177   100
                        mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))) 789.889 827.3810 955.8919 887.3400 989.7415 1845.212   100
                             mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 150.315 159.4750 178.9039 169.1840 185.1345  316.479   100
                                      mutate(test, CpG = stringi::stri_count_fixed(seq, "CG")) 112.015 120.6615 131.4281 125.0470 135.4330  213.184   100

And, here are the results with one portion of my real data (~20,000 instances of 1500 nt sequences) run on a remote server with ~3X the available RAM and cores of my initial run. I continue to be impressed with stringi's implementation, which somehow read through everything in blazing 20-something milliseconds.

Unit: milliseconds
                                                                                          expr        min         lq       mean     median        uq        max neval
                                             mutate(test, CpG = stringr::str_count(seq, "CG"))  402.55513  408.04101  419.82213  414.31085  425.4113  481.08128   100
 mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x,      fixed = TRUE)[[1]] > 0)))  429.84148  436.02402  474.08864  438.84198  447.3058 1054.48253   100
                        mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq)))  301.26062  309.76674  310.97071  310.08229  310.8837  336.12834   100
                             mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 1981.78309 1990.84206 2050.41928 1999.07389 2020.3675 2980.62566   100
                                      mutate(test, CpG = stringi::stri_count_fixed(seq, "CG"))   23.33313   23.55215   23.90881   23.66268   23.8916   27.40499   100

Since I could run this on the native DNAStringSet since my data starts out as a GRanges object, I microbenchmarked the vector of sequences, too, and got ranges in the 240-250 ms ballpark.

I have another use-case coming up with 1,000-100,000 nt long sequences with longer search patterns and will plan to update with those results when I move back to that project.

like image 324
GenesRus Avatar asked Sep 08 '25 11:09

GenesRus


1 Answers

You can use Biostrings::vcountPattern

library(Biostrings)
test %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq)))
#  # A tibble: 3 x 3
#     id seq     CpG
#  <dbl> <chr> <int>
#1     1 ATCG      1
#2     2 ATTT      0
#3     3 CGCG      2

Biostrings::vcountPattern is considerably faster than the str_count solution for a vector of longer (DNA) strings; here are some microbenchmark results

# Sample data
df <- tibble(
    id = 1:100,
    seq = replicate(
        100,
        paste0(sample(c("A", "C", "G", "T"), 100000, replace = T), collapse = "")))


library(microbenchmark)
res <- microbenchmark(
    str_count = df %>% mutate(CpG = str_count(seq, "CG")),
    vmatchPattern = df %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq))))
#Unit: milliseconds
#          expr       min       lq     mean   median       uq      max neval
#     str_count 156.37642 162.1629 167.2005 164.4958 169.2456 251.3426   100
# vmatchPattern  95.68998 102.4176 108.2428 105.2501 108.5511 322.3143   100

library(ggplot2)
autoplot(res)

enter image description here

like image 112
Maurits Evers Avatar answered Sep 11 '25 02:09

Maurits Evers