Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently randomly sampling List while maintaining order

I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.

Is there a clean and more efficient way of doing this?

import scala.util.Random

def sampledList[T](ls: List[T], sampleSize: Int) = {
  Random
    .shuffle(ls.indices.toList)
    .take(sampleSize)
    .sorted
    .map(idx => ls(idx))
}

val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though

sampledList(sampleList, 5) // List(e, u, i, r, n)

EDIT: It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.

like image 897
TimY Avatar asked Dec 01 '25 23:12

TimY


2 Answers

If by

maintaining the order of the values

you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:

import scala.util.Random

def sampledList[T](ls: List[T], sampleSize: Int) = {
  Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}

This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).

like image 165
kosii Avatar answered Dec 05 '25 00:12

kosii


Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.

import scala.util.Random
import scala.annotation.tailrec

def sampledList[T](ls: List[T], sampleSize: Int) = {
  @tailrec
  def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
    require(listSize >= sampleSize, 
      s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
    )
    list match {
      case hd :: tl => 
        if (Random.nextInt(listSize) < sampleSize)
          rec(tl, listSize-1, hd :: sample, sampleSize-1)
        else rec(tl, listSize-1, sample, sampleSize)
      case Nil =>
        require(sampleSize == 0, // Should never happen
          s"sampleSize must be zero at the end of processing, but got $sampleSize"
        )
        sample
    }
  }
  rec(ls, ls.size, Nil, sampleSize).reverse
}

The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.

like image 34
Régis Jean-Gilles Avatar answered Dec 04 '25 22:12

Régis Jean-Gilles