I would like to take random samples from very large lists while maintaining the order. I wrote the script below, but it requires .map(idx => ls(idx)) which is very wasteful. I can see a way of making this more efficient with a helper function and tail recursion, but I feel that there must be a simpler solution that I'm missing.
Is there a clean and more efficient way of doing this?
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random
.shuffle(ls.indices.toList)
.take(sampleSize)
.sorted
.map(idx => ls(idx))
}
val sampleList = List("t","h","e"," ","q","u","i","c","k"," ","b","r","o","w","n")
// imagine the list is much longer though
sampledList(sampleList, 5) // List(e, u, i, r, n)
EDIT:
It appears I was unclear: I am referring to maintaining the order of the values, not the original List collection.
If by
maintaining the order of the values
you understand to keeping the elements in the sample in the same order as in the ls list, then with a small modification to your original solution the performances can be greatly improved:
import scala.util.Random
def sampledList[T](ls: List[T], sampleSize: Int) = {
Random.shuffle(ls.zipWithIndex).take(sampleSize).sortBy(_._2).map(_._1)
}
This solution has a complexity of O(n + k*log(k)), where n is the list's size, and k is the sample size, while your solution is O(n + k * log(k) + n*k).
Here is an (more complex) alternative that has O(n) complexity. You can't get any better in terms of complexity (though you could get better performance by using another collection, in particular a collection that has a constant time size implementation). I did a quick benchmark which indicated that the speedup is very substantial.
import scala.util.Random
import scala.annotation.tailrec
def sampledList[T](ls: List[T], sampleSize: Int) = {
@tailrec
def rec(list: List[T], listSize: Int, sample: List[T], sampleSize: Int): List[T] = {
require(listSize >= sampleSize,
s"listSize must be >= sampleSize, but got listSize=$listSize and sampleSize=$sampleSize"
)
list match {
case hd :: tl =>
if (Random.nextInt(listSize) < sampleSize)
rec(tl, listSize-1, hd :: sample, sampleSize-1)
else rec(tl, listSize-1, sample, sampleSize)
case Nil =>
require(sampleSize == 0, // Should never happen
s"sampleSize must be zero at the end of processing, but got $sampleSize"
)
sample
}
}
rec(ls, ls.size, Nil, sampleSize).reverse
}
The above implementation simply iterates over the list and keeps (or not) the current element according to a probability which is designed to give the same chance to each element. My logic may have a flow, but at first blush it seems sound to me.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With