Efficiently randomly sampling List while maintaining order

Question

I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.

Is there a clean and more efficient way of doing this?

import scala.util.Random

def sampledList[T](ls: List[T], sampleSize: Int) = {
  Random
    .shuffle(ls.indices.toList)
    .take(sampleSize)
    .sorted
    .map(idx => ls(idx))
}

val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though

sampledList(sampleList, 5) // List(e, u, i, r, n)

EDIT: It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.

kosii · Accepted Answer

If by

maintaining the order of the values

you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:

import scala.util.Random

def sampledList[T](ls: List[T], sampleSize: Int) = {
  Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}

This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).

Régis Jean-Gilles · Answer

Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.

import scala.util.Random
import scala.annotation.tailrec

def sampledList[T](ls: List[T], sampleSize: Int) = {
  @tailrec
  def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
    require(listSize >= sampleSize, 
      s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
    )
    list match {
      case hd :: tl => 
        if (Random.nextInt(listSize) < sampleSize)
          rec(tl, listSize-1, hd :: sample, sampleSize-1)
        else rec(tl, listSize-1, sample, sampleSize)
      case Nil =>
        require(sampleSize == 0, // Should never happen
          s"sampleSize must be zero at the end of processing, but got $sampleSize"
        )
        sample
    }
  }
  rec(ls, ls.size, Nil, sampleSize).reverse
}

The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.

Efficiently randomly sampling List while maintaining order

Tags:

performance

list

random

scala

TimY

2 Answers

kosii

Régis Jean-Gilles

Recent Activity

Donate For Us

Efficiently randomly sampling List while maintaining order

Tags:

performance

list

random

scala

TimY

2 Answers

kosii

Régis Jean-Gilles

Related questions

Recent Activity

Donate For Us