Non-deterministic results with float 32 addition

Hello all,
First time posting here, let me know if there is a better place for this kind of question :slight_smile:

I am seeing non-deterministic behavior when summing f32 values, and am looking for best practice/advice for how to deal with this. I am working on some machine learning algorithms in rust, as side projects. I am using f32 values as this is fairly common in these algorithms for speed. However, I am seeing depending on how I sum a vector of f32 values, I get different results.
For example...

let records = 300000;
let vec = vec![0.23500371; records];
println!("Sum Result: {}", vec.iter().sum::<f32>());
// Sum Result: 70366.76

Now the same value, using multiplication to simulate summing.

println!("Multiplication Results {}", vec[0] * (records as f32));
// Multiplication Results 70501.11

Finally, summing, but chunks of the vector, and then summing those values. This is to mimic, sometimes I need to sum a vector by another vector, and I would expect the sub-sums summed together would equal the same as summing the entire vector, but again different results...

let chunks = vec.chunks_exact(10);
let remainder = chunks.remainder().to_vec();
let vec_c =|v| v.iter().sum::<f32>()).collect::<Vec<f32>>();
println!("Batch sum Result: {}", vec_c.iter().sum::<f32>());
// Batch sum Result: 70521.31

Here is the link to this in the playground.

I am OK, with the fact that there is going to be differences between these operations for instance between a f32 or a f64, just because of precision, but it seems summing chunks of a vector, and then for instance summing the same vector in it's entirety should produce the same results? This is probably just my lack of knowledge around how floating point numbers work in rust.
Is there anything I can do, or pre-processing of the numbers to ensure these different operations will produce the same results?

This is a fundamental fact of how floating-point arithmetic works. The rounding error in the result of each addition operation necessarily depends on the relative magnitude of the inputs, since that determines how much the smaller number has to be rounded to fit in the mantissa, given the exponent of the sum.

I'm not an expert in the area and can't recommend a specific solution for your problem, but for example, Kahan summation is a famous algorithm to reduce error when summing many floating point numbers.

In situations where you're summing a few numbers of statically known relationships, improvement can be gained simply by arranging to sum numbers of similar magnitude first ā€” since that has the least rounding error.


No, because floating point addition is not associative. The losses from rounding can be different in different places.

Here's a very simple example:

[src/] (-2.0 + 2.0) + f64::EPSILON = 2.220446049250313e-16
[src/] -2.0 + (2.0 + f64::EPSILON) = 0.0

Specifically, the problem is Catastrophic cancellation - Wikipedia.

Obligatory link to What every computer scientist should know about floating-point arithmetic.

If you want a quick way to improve the error behaviour for summations, you can replace .sum<f32>() with .tree_fold1(f32::add).unwrap_or(0.0) using*/itertools/trait.Itertools.html#method.tree_fold1.


To add on the other answers, the results are deterministic. For example your same code in Java:

int records = 300000;
float d = 0.23500371f;

float sum = 0;
for(int i = 0; i < records; i++) sum += d;

float prod = d * records;

float chunked = 0;
for(int i = 0; i < records; i += 10) {
    float chunk = 0;
    for(int j = 0; j < 10; j++) chunk += d;
    chunked += chunk;



Which is the same answer as your rust code.


Thanks All, this makes a lot of sense. I understand now, that this is just a floating point property I need to consider when designing applications.
Also, thanks for the great resources!

I would assume a good takeaway from the idea of Kahan summation might be to simply use an f64 accumulator if you sum up a lot of f32s. You can convert the final result back to f32 if you like. Seems like a simpler, and more performant, and more accurate approach than trying to encode a higher-precision accumulator using two f32s. If Iā€™m not mistaken, summing up f32s this way should give quite accurate results (almost up to the full accuracy that an f32 can represent) for a max. of about 1 billion items. (In my head, Iā€™m roughly calculating: ~7 significant decimals for f32, and ~16 significant decimals for f64, then 16 - 7 = 9 and 1 billion is 109).


Like the others said, but in three easy to understand concepts...

1 . Floating point gives a big range with the cost of loss of accuracy.
2 . Every floating point operation has the potential to introduce a small amount of error.
3 . If you use floating point you are only guaranteed close results.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.