Better way to shuffle elements of a string in R

Question

I have to shuffle elements of a string. I wrote a code:

sequ <- "GCTTCG"
set.seed(2017)
i <- sample(1:nchar(sequ))
separate.seq.letters <- unlist(strsplit(sequ, ""))
paste(separate.seq.letters[i], collapse = "")
[1] "GTCGTC"

This code shuffles elements one time. The main question would be is there a better (more effective) way to do that? For very long sequences and huge amount of shuffles strsplit, paste commands takes some extra time.

Simon Jackson · Accepted Answer

Making use of the Rcpp package to handle in C is probably fastest.

Below I've done some benchmarking of a handful of approaches suggested so far, including:

Approach in the QUESTION
Approach in COMMENT by @akrun
Approach using BIOSTRINGS package, suggested by @knb
Approach using the STRINGI package, suggested by @Rich
A custom RCPP function, based on this post.

Except for the stringi function, here are the others wrapped into functions for testing:

f_question <- function(s) {
  i <- sample(1:nchar(s))
  separate.seq.letters <- unlist(strsplit(s, ""))
  paste(separate.seq.letters[i], collapse = "")
}

f_comment <- function(s) {
  s1 <- unlist(strsplit(s, ""))
  paste(s1[sample(nchar(s))], collapse="")
}

library(Biostrings)
f_biostring <- function(s) {
  probes <- DNAStringSet(s)
  lapply(probes, sample)
}

Rcpp::cppFunction(
  'std::string shuffleString(std::string s) {
    int x = s.length();
    for (int y = x; y > 0; y--) { 
      int pos = rand()%x;
      char tmp = s[y-1];
      s[y-1] = s[pos];
      s[pos] = tmp;
    }
    return s;
  }'
)

For testing, load libraries and write function to generate sequences of length n:

library(microbenchmark)
library(tidyr)
library(ggplot2)

generate_string <- function(n) {
  paste(sample(c("A", "C", "G", "T"), n, replace = TRUE), collapse = "")
}

sequ <- generate_string(10)

# Test example....

sequ
#> [1] "TTATCAAGGC"

f_question(sequ)
#> [1] "CATGGTACAT"
f_comment(sequ)
#> [1] "GATTATAGCC"
f_biostring(sequ)
#> [[1]]
#>   10-letter "DNAString" instance
#> seq: TAGATCGCAT
shuffleString(sequ)
#> [1] "GATTAATCGC"
stringi::stri_rand_shuffle(sequ)
#> [1] "GAAGTCCTTA"

Testing all functions with small n (10 - 100):

ns <- seq(10, 100, by = 10)
times <- sapply(ns, function(n) {
  string <- generate_string(n)

  op <- microbenchmark(
    QUESTION     = f_question(string),
    COMMENT      = f_comment(string),
    BIOSTRING    = f_biostring(string),
    RCPP         = shuffleString(string),
    STRINGI      = stringi::stri_rand_shuffle(string)
  )
  by(op$time, op$expr, function(t) mean(t) / 1000)
})
times <- t(times)
times <- as.data.frame(cbind(times, n = ns))

times <- gather(times, -n, key = "fun", value = "time")
pd <- position_dodge(width = 0.2)
ggplot(times, aes(x = n, y = time, group = fun, color = fun)) +
  geom_point(position = pd) +
  geom_line(position = pd) +
  theme_bw()

enter image description here

Biostrings approach is pretty slow.

Dropping this and moving up to 100 - 1000 (code stays same except ns):

enter image description here

The R-based functions (from the question and comment) are comparable, but falling behind.

Dropping these and moving up to 1000 - 10000:

enter image description here

Looks like the custom Rcpp function is the winner, particularly as the string length grows. However, if choosing between these, consider that the stringi function, stri_rand_shuffle, will be more robust (e.g., better tested and designed to handle corner cases).

Rich Scriven · Answer

You could have a look at stri_rand_shuffle(), from the stringi package. It is written entirely in C and should be very efficient. According to the documentation, it

Generates a (pseudo)random permutation of code points in each string.

Let's try it out:

replicate(5, stringi::stri_rand_shuffle("GCTTCG"))
# [1] "GTTCCG" "CCGTTG" "CTCTGG" "CCGGTT" "GTCGCT"

Better way to shuffle elements of a string in R

Tags:

string

r

shuffle

neringab

2 Answers

Simon Jackson

Rich Scriven

Recent Activity

Donate For Us

Better way to shuffle elements of a string in R

Tags:

string

r

shuffle

neringab

2 Answers

Simon Jackson

Rich Scriven

Related questions

Recent Activity

Donate For Us