I have to shuffle elements of a string. I wrote a code:
sequ <- "GCTTCG"
set.seed(2017)
i <- sample(1:nchar(sequ))
separate.seq.letters <- unlist(strsplit(sequ, ""))
paste(separate.seq.letters[i], collapse = "")
[1] "GTCGTC"
This code shuffles elements one time. The main question would be is there a better (more effective) way to do that? For very long sequences and huge amount of shuffles strsplit, paste commands takes some extra time.
Making use of the Rcpp package to handle in C is probably fastest.
Below I've done some benchmarking of a handful of approaches suggested so far, including:
Except for the stringi function, here are the others wrapped into functions for testing:
f_question <- function(s) {
i <- sample(1:nchar(s))
separate.seq.letters <- unlist(strsplit(s, ""))
paste(separate.seq.letters[i], collapse = "")
}
f_comment <- function(s) {
s1 <- unlist(strsplit(s, ""))
paste(s1[sample(nchar(s))], collapse="")
}
library(Biostrings)
f_biostring <- function(s) {
probes <- DNAStringSet(s)
lapply(probes, sample)
}
Rcpp::cppFunction(
'std::string shuffleString(std::string s) {
int x = s.length();
for (int y = x; y > 0; y--) {
int pos = rand()%x;
char tmp = s[y-1];
s[y-1] = s[pos];
s[pos] = tmp;
}
return s;
}'
)
For testing, load libraries and write function to generate sequences of length n:
library(microbenchmark)
library(tidyr)
library(ggplot2)
generate_string <- function(n) {
paste(sample(c("A", "C", "G", "T"), n, replace = TRUE), collapse = "")
}
sequ <- generate_string(10)
# Test example....
sequ
#> [1] "TTATCAAGGC"
f_question(sequ)
#> [1] "CATGGTACAT"
f_comment(sequ)
#> [1] "GATTATAGCC"
f_biostring(sequ)
#> [[1]]
#> 10-letter "DNAString" instance
#> seq: TAGATCGCAT
shuffleString(sequ)
#> [1] "GATTAATCGC"
stringi::stri_rand_shuffle(sequ)
#> [1] "GAAGTCCTTA"
Testing all functions with small n (10 - 100):
ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
string <- generate_string(n)
op <- microbenchmark(
QUESTION = f_question(string),
COMMENT = f_comment(string),
BIOSTRING = f_biostring(string),
RCPP = shuffleString(string),
STRINGI = stringi::stri_rand_shuffle(string)
)
by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))
times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
geom_point(position = pd) +
geom_line(position = pd) +
theme_bw()

Biostrings approach is pretty slow.
Dropping this and moving up to 100 - 1000 (code stays same except ns):

The R-based functions (from the question and comment) are comparable, but falling behind.
Dropping these and moving up to 1000 - 10000:

Looks like the custom Rcpp function is the winner, particularly as the string length grows. However, if choosing between these, consider that the stringi function, stri_rand_shuffle, will be more robust (e.g., better tested and designed to handle corner cases).
You could have a look at stri_rand_shuffle(), from the stringi package. It is written entirely in C and should be very efficient. According to the documentation, it
Generates a (pseudo)random permutation of code points in each string.
Let's try it out:
replicate(5, stringi::stri_rand_shuffle("GCTTCG"))
# [1] "GTTCCG" "CCGTTG" "CTCTGG" "CCGGTT" "GTCGCT"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With