Structuring an algorithm for multiple threads (Solved)

Hi all, I am learning Rust by implementing an algorithm that I have already implemented in Kotlin.

I am trying now to work out how I can structure the algorithm to fit most neatly within Rusts ownership model. (I am an experienced coder but without formal CompSci training).

The algorithm is a clustering algorithm for partitioning an image. The object of the excercise is to partition the image into ‘clusters’ each of which represent a group of pixels that are close to each other. Here ‘close’ means wiithin five dimensional space (x, y, r, g b).

  • Each Pixel is associated with a Cluster.
  • There are far fewer Clusters than Pixels, so the values in a Cluster are affected by multiple Pixels
  • Each Cluster has a running total of red, green blue, x, y and count of pixels
  • There is no need for a Cluster to know its Pixels, only to be able to calculate the average of each of the five values (x, y, r, g, b)

The algorithm for each iteration:

for each pixel
    find nearest cluster
    if this is not the current cluster 
        subtract x,y,r,g,b from current cluster (count--)
        set current cluster to new cluster
        add x,y,r,g,b to new cluster (count++)

for each cluster
    recalculate average x, y, r, g, b for use in distance calculations

It seems to me that iterating over the pixels is the best place to introduce threading. Rayon works well.
Problem 1: how do I associate a Cluster with a Pixel. I could imagine using a HashMap, but that would need to be updated by multiple threads.

Problem 2: how do I update the Cluster returned from the HashMap - the Cluster will be accessed by multiple threads.

The simplest way may be to create a ‘gate’ (mutex?) at the point where a Pixel needs to reference and update the Clusters.This retains the parallelism for the distance calculations, but looses it for the updates, and the bottleneck may slow the algorithm down.

Alternately the updates could be sent down a channel and handled by multiple threads. There could however be an update for each pixel (certainly during the earlier iterations). Would this amount of allocation/de-allocation kill the performance?

As soon as I add mutex or channel I need per-thread storage of the cloned values, so do I lose the ability to use Rayon?

I would appreciate the views of experienced Rustaceans on which approach to take.

Thanks

[Edit]

Thanks to the comments in this thread, and some experimentation I have now written the clustering algorithm. I am pleased to say that on large images (eg 5472 x 3648 pixels) it is orders of magnitude faster than the Java or Kotlin equivalent. There may be opportunities to improve the speed, but I cannot spot them at the moment.
I have also learnt a bit about rust.

[/Edit]

A very common pattern in rust (in linked lists for example) is to store your items in a vec and only keep track of their index in that vec. That way, lookups are O(1) and each item can be gated (rwlock, mutex or just an atomic) not the entire vec.

This reminds me a lot of the Wild linker talk by David Lattimore (2:34:00)

Problem 2: how do I update the Cluster returned from the HashMap - the Cluster will be accessed by multiple threads.

Do you realistically expect multiple threads attempt to access at the same time and (and especially) the same values? If so, they will spend a lot of time waiting for that mutex. If you expect them access different values, put an rwlock on the cluster, and keep each values under an atomic (if possible). That way multiple 'readers' can modify the cluster.

Great points, thanks. I need to think through the pros and cons…..

Good logical thinking. I need to record the original and final cluster for each pixel, as I need to decrement the counts when a pixel leaves a cluster.

‘Findnearest cluster’ is one bottleneck though this is reduced in a couple of ways

  • each cluster has a list of neighbours and only these are searched for closest.
  • distance is ‘taxi cab’ distance, so only involves addition and comparison, not squares and square roots.

The other bottleneck will be scanning lots of pixels to update the clusters (some of these images are 40 million pixels).

Each write to the cluster will either increment or decrement all the values, so individual atomics may not be the way to go.

Fair. But the point was that atomics are much faster than a mutex.

For sure, and this sounds like a good idea, but where do I store the association between a pixel and a cluster.

In Kotlin I used an array with the same size as the image in pixels, so I could use the pixel offset (x,y) to calculate the array offset. In rust I may create an array of tuple (or tuple struct) with one element per pixel. Each element could store x,y,r,g,b current cluster, new cluster. Are arrays allocated one as a memory block? - that would be preferable to millions of allocs that I would have had in Kotlin with this approach.

I need to look at what you mean by a rwlock. Is this a way of enforcing a single thread through a block? By your comments it would seem not. I need to do some more reading….

That's because the arena (the vec) is only half of a linked list application. You need an additional mapping (e.g., PixelId -> ClusterId, via HashMap) to connect pixels to entries in the arena.
To be clear, this is not the only way to implement a linked list (and you are not bulding one anyway). But the arena + map feels appropriate here.

By RwLock, I mean RwLock in std::sync - Rust

Edit: A rust Vec is a single allocation. We often use Vec::with_capacity or Vec::resize_with (to actually fill the vec as with_capacity leaves the len as 0) to avoid re-allocs if we know the size before hand.

1 Like

Thanks that looks useful. If I put a RwLock effectively around the whole cluster then I could update all the cluster values within it(?). This sounds similar in effect to a ‘synchronized’ method in java/kotlin.

It's basically a mutex that allows multiple readers but only one writer at a time. Instead of a single .lock() you have .read() and .write.

In my previous comment I was talking about RwLock with Atomics but if you have all your inner fields as Atomics, it makes the rwlock (or mutex for that matter) redundant. It's more helpful if you have many reads and few writes. You could probably also just get it working with a mutex and little complexity as a starting benchmark and see where things go from there. Premature optimization and all that.

1 Like

For sure, but this is as much a learning exercise as anything else - I learn best by doing. Rust’s approach to memory management (pay in development time rather than in execution time) lends itself well to this kind of solution….

But the vec is a container for other things. If I want a vector containing 1000 ‘instances’ of a struct I think that I would still need to create these and add them to the vector. Lots of allocations like this killed that performance in that approach in kotlin. It would be nice to be able to one allocation of all the instance in one go and treat them as a collection. I can’t make out from the docs whether this is what rust’s arrays do.

No, the vec can own anything you put in it. The only rule is that all the elements need to be the same concrete type. You create the struct, then move it into the vec.

As I thought. So If I create a struct for each pixel I could be allocating 20 million of them which is not going to be fast…..

Maybe my statement was misleading. Adding a million structs to a vec doesn't mean you are making a million individual heap allocations then moving them.

It can be optimized a lot, especially in your case because a pixel struct can likely implement the Copy, Clone and Default traits which help a lot. That is still still be a lot of work, but there's only so much you can do. A quick way to zeroize the struct (Pixel::default()) lets you do vec![Pixel::default(); 1_000_000]; pretty efficiently.

Ok, thanks I will look at this. I was rather hoping I could allocate a contiguous block of memory and treat it as 1_000_000 instances of a struct that only contains fixed size primitives. That would have been really neat. The way you suggest implies a million initialisations(?).

I appreciate the time that you are spending getting me up to speed on these issues.

This is what I get, using a Mutex to guard the labile data as a separate struct. RwLock is not needed as the reads are all of the cached values in the outer layer.

use std::sync::Mutex;

struct Cluster {
    totals: Mutex<ClusterTotals>,
    x: u32,
    y: u32,
    red: u32,
    green: u32,
    blue: u32
}

struct ClusterTotals {
    x: u32,
    y: u32,
    red: u32,
    green: u32,
    blue: u32,
    count: u32
}

impl ClusterTotals {
    pub fn new() -> ClusterTotals {
        ClusterTotals {
            x: 0,
            y: 0,
            red: 0,
            green: 0,
            blue: 0,
            count: 0
        }
    }
    pub fn add(& mut self, dx: u32, dy: u32, dr: u32, dg: u32, db: u32) {
        self.x += dx;
        self.y += dx;
        self.red += dx;
        self.green += dx;
        self.blue += dx;
        self.count += 1;
    }
    pub fn remove(& mut self, dx: u32, dy: u32, dr: u32, dg: u32, db: u32) {
        self.x -= dx;
        self.y -= dx;
        self.red -= dx;
        self.green -= dx;
        self.blue -= dx;
        self.count -= 1;
    }
    pub fn x(&self) -> u32 {
        self.x / self.count
    }
    pub fn y(&self) -> u32 {
        self.y / self.count
    }
    pub fn red(&self) -> u32 {
        self.red / self.count
    }
    pub fn green(&self) -> u32 {
        self.green / self.count
    }
    pub fn blue(&self) -> u32 {
        self.blue / self.count
    }
}
impl Cluster {
    pub fn new() -> Cluster {
        Cluster {
            totals: Mutex::new(ClusterTotals::new()),
            x: 0,
            y: 0,
            red: 0,
            green: 0,
            blue: 0,
        }
    }
    
    pub fn add(& mut self, dx: u32, dy: u32, dr: u32, dg: u32, db: u32) {
        self.totals.lock().expect("Poisoned").add(dx, dy, dr, dg, db);
    }
    pub fn remove(& mut self, dx: u32, dy: u32, dr: u32, dg: u32, db: u32) {
        self.totals.lock().expect("Poisoned").remove(dx, dy, dr, dg, db);
    }
    pub fn recalculate(& mut self) {
        // cache values of the division - both for speed and logic (x,y,r,g,b should not
        // appear to  change during an iteration, only between iterations.
        let total = self.totals.lock().expect("Poisoned");
        if (total.count < 1) {
            // throw exception?
        } else {
            self.x = total.x();
            self.y = total.y();
            self.red = total.red();
            self.green = total.green();
            self.blue = total.blue();
        }
    }
    pub fn x(&self) -> u32 {
        self.x
    }
    pub fn y(&self) -> u32 {
        self.y
    }
    pub fn red(&self) -> u32 {
        self.red
    }
    pub fn green(&self) -> u32 {
        self.green
    }
    pub fn blue(&self) -> u32 {
        self.blue
    }
}
  • Give the clusters a good, old bitmap, which indicates associated pixels.
  • The bitmap is made out of AtomicU64
  • In the distribution loop, don't update the sums, only the bitmap of the clusters.
  • In the average loop, use the bitmap of each cluster to calculate the sums and averages

You can also split the bitmaps according to the strides you assign to each CPU core to get rid of the atomic types.

1 Like

Hey, @mroth I like the way you think. That is a very good suggestion. You say that bitmap uses AtomicU64. Does that imply that accessing each bit is atomic? I need to read about this.

The clusters may number in the hundreds or thousands, so the memory burden may be high. Still, memory is cheap and easily recycled.

The only downside is that the average calculation will be a constant cost. In the original algorithm the running totals are only modified when the pixels changes cluster. As the algorithm progresses the number of changes drops dramatically. To be fair that is likely to be a small problem. I am not yet proficient in Rust so the thought of writing competing algorithms to test seems a bit daunting.