Is sample() function in rand crate biased toward beginning of range?


#1

I guess this is more of a statistics question.
Given the implementation of the sample() function, it seems to me that it is more likely to pick items that are toward the beginning of the range.
My reasoning is that as i increases in the enumeration, chances of landing a k within bound of reservoir (whose length is amount) reduces:

let k = rng.gen_range(0, i + 1 + amount);

Thus line 970 is more likely to return None and therefore reducing the chance of choosing elem to be placed in the spot within reservoir:

if let Some(spot) = reservoir.get_mut(k) {
    *spot = elem;
}

Am I missing something here? Does this implementation really randomly pick elements from the pool and is not biased?


#2

The implementation really randomly picks elements from the input iterator and is not biased. As suggested by the variable name, this is an implementation of reservoir sampling.

As you suspect, elements that are toward the beginning of the range are more likely to be put into the reservoir. However those elements are also more likely to be taken out of the reservoir by a later element that takes its place. Altogether this balances out so that each element in the iterator has an equal probability of being in the final reservoir once the end of the iterator is reached. The expression “i + 1 + amount” is carefully chosen to make this balance work.

You can find lots of different explanations of reservoir sampling online that may make a different amount of sense depending on your math and statistics background.


#3

According to my experiment, it’s probably usable :slight_smile:

use rand::{thread_rng, sample};

fn main() {
    let mut rng = thread_rng();
    let mut histogram = vec![0, 0, 0];
    for _i in 0..100_000_000 {
        let sample = sample(&mut rng, 0..histogram.len(), 1);
        histogram[sample[0]] += 1;
    }
    println!("{:?}", histogram);
    println!("{:?}", histogram.iter().map(|&i| i as f64 / histogram.iter().sum::<i32>() as f64).collect::<Vec<_>>());
}

v

[33331955, 33333154, 33334891]
[0.33331955, 0.33333154, 0.33334891]

#4

In @nodakai’s experiment, the length of the iterator is 3 and the size of the reservoir is 1. The sample function starts by filling the reservoir with the first element in the iterator, so the probability of each element being in the reservoir is:

before i=0: [100%, 0%, 0%]

The first element is definitely in the reservoir.

Then the sample function begins replacing elements in the reservoir, starting with i=0. For i=0 it generates a uniformly distributed random integer in the range [0, i+1+amount) which is [0, 2), so either 0 or 1 with equal probability. If the random integer is 0, it replaces the element in the reservoir with the second element in the iterator:

after i=0: [50%, 50%, 0%]

The first element in the iterator has a 50% probability of having been removed from the reservoir, and the second element in the iterator has a 50% probability of having been added to the reservoir.

Moving on to i=1, the sample function generates a uniformly distributed random integer in the range [0, i+1+amount) which is [0, 3), so 0 or 1 or 2 with equal probability. If the random integer is 0, it replaces the element in the reservoir with the third element in the iterator:

after i=1: [33.3%, 33.3%, 33.3%]

The element currently in the reservoir before i=1 has a 1/3 probability of being removed when i=1, so it has a 2/3 probability of not being removed. The probability for each of the first 2 elements in the iterator to be in the reservoir after i=1 is the probability that they were in the reservoir before i=1, multiplied by the probability that they were not removed from the reservoir at i=1, so 50% * 2/3 which is 33.3%.

The result is that all 3 elements in the iterator have a 1/3 chance of being in the final reservoir of size 1. In general for an iterator of length N and a reservoir of size M, each element has a M/N probability of being in the reservoir.

As a second example, for N=6 and M=3 the probabilities work like this:

before i=0: [100%, 100%, 100%, 0%, 0%, 0%]
after i=0: [75%, 75%, 75%, 75%, 0%, 0%]
after i=1: [60%, 60%, 60%, 60%, 60%, 0%]
after i=2: [50%, 50%, 50%, 50%, 50%, 50%]


#5

thank you guys, these clarify it for me.