How does multi-threading break floating-point additions?

This might sound super weird at first, but I encountered this problem while I tried to implement the kmeans algorithm as benchmark for a parallelizing DSL I help developing.

To keep it simple: The program groups data points into clusters and iteratively refines this grouping until some abort condition is met. This grouping step happens in parallel (using the tokio runtime) and afterwards the cluster center is calculated from the groups sequentially. And this is where I discovered my problem:

The center is calculated by adding the f32 coordinates of each group member and then dividing them by the number of elements in the group (the code for this is below).
Given the exact same input data, this yields a deterministic result for each run. Until you change the number of threads to be used in the tokio runtime.
Then, although this change has nothing in common with aforementioned summation, the resulting floating-point numbers all vary, just enough so that it adds up to large differences over time. Initially, both numbers are only about 0.003 off, but over more than hundred iterations these things add up.

Is this intended behavior or is there some sort of bug hidden either in my code or the compiler? The thing is that I can reproduce this behavior on two different machines with different operating systems and different nightly versions. But I just do not see the reason for this different behavior. Both the input data (that was to be grouped) as well as the initial centers were exactly identical in both runs, just the number of threads used in tokio (somewhere completely else) changed.

Sidenote for the technically interested folks, this is the code in question:

pub struct Value {
    /// the contents that form the observation
    pub values: Vec<f32>,
    /// cluster the value belongs to
    pub associated_cluster: usize,

pub struct Centroid {
    pub coordinates: Vec<f32>,

impl Centroid {
    pub fn from_assignments(values: &Vec<Value>, num_centroids: usize) -> Vec<Self> {
        let mut sums = vec![vec![0f32; values[0].values.len()]; num_centroids];
        let mut elements_in_cluster = vec![0; num_centroids];

        // form the sums for all centroids
        for val in values {
            for idx in 0..val.values.len() {
                sums[val.associated_cluster][idx] += val.values[idx];
            elements_in_cluster[val.associated_cluster] += 1;

        for centroid_no in 0..num_centroids {
            for sum in sums[centroid_no].iter_mut() {
                *sum /= elements_in_cluster[centroid_no] as f32;

            .map(|coords| Self {
                coordinates: coords,

This is the part that I would investigate first. Does the parallel step output the groups in a deterministic order? If you end up processing the exact same groups but in a different order, you could end up with subtly different results. This is because floating point math does not obey normal associative and commutative properties. For example, a + b + c may produce different output than a + c + b, due to differences in rounding.

Precisely how did you check that these data were identical? Were all the numbers bit-for-bit identical, or could they be off by a small amount that is not visible when logging them? Is the order also identical from run to run?


This was the solution. The parallel step did indeed output the results deterministically, but due to the way I distributed the elements into tasks, the elements were ordered differently which in turn led to different results. After I altered the distribution function to produce the same ordering the issue went away. Thank you very much, I would never have remembered that a + b + c ≠ a + c + b when considering floating point operations.

I wonder if you're losing precision from the repeated addition step too, since there's a risk that it will end up adding very small numbers to very large numbers when it gets near the final total.

The "Adding Numbers Of Very Different Magnitudes" section could apply due to repeated addition of small numbers to create a large one.

Yes. Which raises the questions:

How do know if your result is good to start with?

Which of the versions of your code are more accurate anyway, the threaded or unthreaded?

To quote a project manager of mine, from a time before FP hardware was common place and our programming language supported fixed point maths:

If you think you need floating point to solve your problem then you don't understand your problem. If you really do need floating point to solve your problem then you have a problem you won't understand.

But this might help: "What Every Computer Scientist Should Know About Floating-Point Arithmetic" : What Every Computer Scientist Should Know About Floating-Point Arithmetic.

You and @Douglas are both right. This is probably not very precise and with increasing amount of numbers to sum over, the effects from adding small numbers to large numbers becomes an issue, but this is the definition of kmeans, and when the input data is presented as float there is not much I can do.

Although in my case I do not really care about the results of the computation as long as the results for different thread numbers match. I only implemented it as a benchmark to test two parallelism frameworks.

The accuracy is the same, as the addition happens outside the parallel portion of the code.

Have you tried using 64 bit floats to see if the accuracy improves? Or looking for a fixed point way to represent the numbers with ints?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.