I need to count instances of a pattern (fixed) within a string that's held in a column in a tibble. mutate(x = str_count(col, pattern))
does exactly what I want but it's not fast enough to process the volume of strings I have to evaluate.
test = tibble(
id=c(1,2,3),
seq=c("ATCG","ATTT","CGCG")
)
test %>% mutate(CpG = str_count(seq, "CG"))
This just gives the single value of the first row's count
test %>% mutate(CpG = sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0))
I tried to get purrr::map to work, but I'm failing at it...here are my attempts that won't run:
test %>% mutate(CpG = map(~sum(gregexpr("CG", seq, fixed=TRUE)[[1]] > 0)))
test %>% mutate(CpG = map(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0)))
Looks like stringi::stri_count_fixed
is the optimal choice on the test set.
microbenchmark(
mutate(test, CpG = stringr::str_count(seq, "CG")),
mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x, fixed=TRUE)[[1]] > 0))),
mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))),
mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))),
mutate(test, CpG = stringi::stri_count_fixed(seq,"CG"))
)
Unit: microseconds
expr min lq mean median uq max neval
mutate(test, CpG = stringr::str_count(seq, "CG")) 125.502 133.9995 159.0844 147.9045 164.2925 441.242 100
mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x, fixed = TRUE)[[1]] > 0))) 167.210 183.4695 215.7746 202.8890 230.5275 450.177 100
mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))) 789.889 827.3810 955.8919 887.3400 989.7415 1845.212 100
mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 150.315 159.4750 178.9039 169.1840 185.1345 316.479 100
mutate(test, CpG = stringi::stri_count_fixed(seq, "CG")) 112.015 120.6615 131.4281 125.0470 135.4330 213.184 100
And, here are the results with one portion of my real data (~20,000 instances of 1500 nt sequences) run on a remote server with ~3X the available RAM and cores of my initial run. I continue to be impressed with stringi
's implementation, which somehow read through everything in blazing 20-something milliseconds.
Unit: milliseconds
expr min lq mean median uq max neval
mutate(test, CpG = stringr::str_count(seq, "CG")) 402.55513 408.04101 419.82213 414.31085 425.4113 481.08128 100
mutate(test, CpG = purrr::map_int(seq, ~sum(gregexpr("CG", .x, fixed = TRUE)[[1]] > 0))) 429.84148 436.02402 474.08864 438.84198 447.3058 1054.48253 100
mutate(test, CpG = Biostrings::vcountPattern("CG", DNAStringSet(seq))) 301.26062 309.76674 310.97071 310.08229 310.8837 336.12834 100
mutate(test, CpG = lengths(regmatches(seq, gregexpr("CG", seq)))) 1981.78309 1990.84206 2050.41928 1999.07389 2020.3675 2980.62566 100
mutate(test, CpG = stringi::stri_count_fixed(seq, "CG")) 23.33313 23.55215 23.90881 23.66268 23.8916 27.40499 100
Since I could run this on the native DNAStringSet since my data starts out as a GRanges object, I microbenchmarked the vector of sequences, too, and got ranges in the 240-250 ms ballpark.
I have another use-case coming up with 1,000-100,000 nt long sequences with longer search patterns and will plan to update with those results when I move back to that project.
You can use Biostrings::vcountPattern
library(Biostrings)
test %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq)))
# # A tibble: 3 x 3
# id seq CpG
# <dbl> <chr> <int>
#1 1 ATCG 1
#2 2 ATTT 0
#3 3 CGCG 2
Biostrings::vcountPattern
is considerably faster than the str_count
solution for a vector of longer (DNA) strings; here are some microbenchmark
results
# Sample data
df <- tibble(
id = 1:100,
seq = replicate(
100,
paste0(sample(c("A", "C", "G", "T"), 100000, replace = T), collapse = "")))
library(microbenchmark)
res <- microbenchmark(
str_count = df %>% mutate(CpG = str_count(seq, "CG")),
vmatchPattern = df %>% mutate(CpG = vcountPattern("CG", DNAStringSet(seq))))
#Unit: milliseconds
# expr min lq mean median uq max neval
# str_count 156.37642 162.1629 167.2005 164.4958 169.2456 251.3426 100
# vmatchPattern 95.68998 102.4176 108.2428 105.2501 108.5511 322.3143 100
library(ggplot2)
autoplot(res)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With