[Solved] Strange Performance Regression


#1

Hi,
for my random choice algorithm I offer two variants:

  • one variant returns a new vec with random samples
  • one variant writes the samples in place

The in place variant should be faster, because it doesn’t need to allocate a vec. And indeed, in my old code the in place variant was faster. But then I changed on both variants

while accumulated_weights < current_spoke {

to

while i < weights.len() && accumulated_weights < current_spoke {

And it regressed from

test benches::bench_random_choice_1000_it_f64          ... bench:       3,787 ns/iter (+/- 198)
test benches::bench_random_choice_in_place_1000_it_f64 ... bench:       3,515 ns/iter (+/- 185)

to

test benches::bench_random_choice_1000_it_f64          ... bench:       4,113 ns/iter (+/- 113)
test benches::bench_random_choice_in_place_1000_it_f64 ... bench:       4,427 ns/iter (+/- 103)

Both variants regressed, but the in place variant regressed a lot more, than the return a vec variant.

Anyone any idea how this could happen?
My code can be found here.


#2

For the in-place variant, shouldn’t it be j instead of i?

while j < weights.len() && accumulated_weights < current_spoke {

i is always smaller or equal j, that could explain the effect.

Probably you should change the variable names (i <-> j) to be consistent with the “normal” variant.


#3

Yes, that’s correct.

Why is that? i and j are equivalent. Anyway, I workarounded this problem in version 0.2.0. :slight_smile:


#4

I was looking at a too new version, so never mind…


#5

Btw: If someone wants to examine anyway this problem (@troplin ?) , this commit has still this issue. I workarounded the problem, but this could be a compiler (llvm) bug.

Here are the code snippets

 pub fn random_choice_f64<'a, T>(samples: &'a [T], weights: &[f64], n: usize) -> Vec<&'a T> {
        if weights.len() == 0 || n == 0 {
            return Vec::new();
        }

        let sum: f64 = weights.iter().fold(0.0, |acc, &i| acc + i);
        let spoke_gap: f64 = sum / n as f64;

        // next_f64() ∈ [0.0, 1.0)
        let spin = thread_rng().next_f64() * spoke_gap;

        let mut i: usize = 0;
        let mut accumulated_weights = weights[0];
        let mut choices: Vec<&T> = Vec::with_capacity(n);
        let mut current_spoke: f64 = spin;

        for _ in 0..n {
            // add condition i < weights.len(), because float leads to inaccurate
            // calculations which can lead to i >= weights.len() 
            while i < weights.len() && accumulated_weights < current_spoke {  // <-- Here is the problem!
                i += 1;
                accumulated_weights += weights[i];
            }
            choices.push(&samples[i]);
            current_spoke += spoke_gap;
        }

        choices
    }

    pub fn random_choice_in_place_f64<T: Clone>(samples: &mut [T], weights: &[f64]) {
        if weights.len() < 2 {
            return;
        }

        let sum: f64 = weights.iter().fold(0.0, |acc, &i| acc + i);
        let n: usize = weights.len();
        let spoke_gap: f64 = sum / n as f64;

        // next_f64() ∈ [0.0, 1.0)
        let spin = thread_rng().next_f64() * spoke_gap;

        let mut j: usize = 0;
        let mut accumulated_weights = weights[0];
        let mut current_spoke: f64 = spin;

        for i in 0..n {
            // add condition j < weights.len(), because float leads to inaccurate
            // calculations which can lead to j >= weights.len() 
            while j < weights.len() && accumulated_weights < current_spoke { // <-- Here is the problem!
                j += 1;
                accumulated_weights += weights[j];
            }
            samples[i] = samples[j].clone();
            current_spoke += spoke_gap;
        }
    }

#6

I don’t see it the reason.

But I think the in-place algorithm is flawed, because i can become larger than j and thus it will overwrite items that are yet to be selected.

I think the algorithm is flawed (not random) anyway because each item i is guaranteed to be selected at least floor(weight[i] / spoke_gap) times and only the remaining items are distributed somewhat randomly.

An extreme case is your benchmark case where all weights are equal and n == samples.len(). In that case, the input will always be the same as the output because all weights are equal to spoke_gap and that means that accumulated_weights and spoke_gap will always grow with the same rate.
Not exactly random if you ask me.


#7

How does the random number generator affect the algorithmic complexity of the various loops? I’m having a hard time following whether the ThreadRng and its reseeding characteristics could change either the number of iterations from run to run, or the overall cost.

Since there’s not a SeedableRng used in the benchmarks, could that be related, making it difficult to distinguish between a regression and a difference in rng behavior?


#8

@troplin You are right with the in place variant. This is severe and I can’t find a working solution for this. Maybe I have to drop the in place variant.

The rest is not completly true. Look at the pseudo code of the stochastic universal sampling algorithm. Well yes, the input can be the same as the output, if everything is equally distributed, but this is in fact something wanted and for a benchmark test it doesn’t matter how the samples are distributed.


#9

I will add a test tomorrow, but this was always reproducable.


#10

I couldn’t reproduce this with the same seed…


#11

I believe there is no solution for this.

It is true. If the weights are equally distributed and n == samples.len() then output is guaranteed to be the same as the input, there’s absolutely no randomness. And all samples with weight >= spoke_gap are guaranteed to be selected at least once.
Sure, the algorithm exists, it has a name and probably also valid use cases but I wouldn’t call it random choice.


#12

In my opinion it is still random. When the weights are not equally distributed, you can only tell the lower and the upper bound how often a sample is getting selected.
Well yes, you have a guarantee when they are equally distributed, but this is still a damn good approximation on reality.


#13

But the bounds are very tight, the number is always between floor(weight[i] / spoke_gap) and ceil(weight[i] / spoke_gap) which is only a difference by 1 in the general case and 0 if the weight is a multiple of spoke_gap.

With a real random selection, every combination is possible if the weights are nonzero, with this algorithm only a very small subset of all combinations are possible at all.


#14

Look at the other case:
You create a random number between 0 and sum and chose the respective sample and you do this n times.
Not only the runtime would be greater, but you would tend to choose the same sample n times, if the probabilities are equally distributed. This is what the roulette wheel algorithm does.
I don’t know any better “random” algorithm than stochastic universal sampling. The stochastic universal sampling algorithm is not perfect, but has a much more realistic approximation with a great runtime.


#15

Why would that be true? That would mean that the random number generator is flawed.
If n is big enough, the selection will be distributed according to the weights, this is called Law of large numbers.

What you call “Roulette wheel algorithm” is real random selection (if the random number generator delivers true randomness). It doesn’t get more “realistic”. What you want is fairness, but chance is not fair!


#16

Let me precise my statement. You are of couse right, when n is big enough, the distribution gets preserved, but there are algorithms like fast slam which also work with 15 particles (see page 32). And there are situations where particles are equally distributed and if you have less particles, your interest is that the distributions gets preserved.

Well, yes, that’s true, but you have to look at the context of the applications. Distributions have to get preseved even for smaller samples.