How does a crossbeam channel work?

I have been wondering how crossbeam channels are implemented, and have been looking at the source here:

It seems like it is using "stamps" to check whether the shared buffer has been written. The stamp is part of the "slot", together with the user data .

/// A slot in a channel.
struct Slot<T> {
    /// The current stamp.
    stamp: AtomicUsize,

    /// The message in this slot.
    msg: UnsafeCell<MaybeUninit<T>>,

If my interpretation of what is going on is correct, is there an assumption that the msg is written to shared memory at the same time as the stamp? I was just thinking for a large msg, could there be a risk that the msg has not yet been completely written to shared memory (when it is read)? Or are there some concurrency primitives in here which I haven't noticed/understood which mean this cannot happen?

The concurrency primitive is the AtomicUsize. If you write to an atomic integer with atomic ordering Release, and then later read from that atomic integer with the ordering Acquire, then it is guaranteed that all writes that happened before the Release-write are visible from the thread performing the Acquire-read.

1 Like

By all writes, does that mean writes to any memory location, any variable? My understanding ( which may well be wrong ) was that it is only the variable in question ( say the target of a compare/exchange primitive, or in this case "stamp" ) which is protected.

It depends on the atomic ordering you use, but yes, with the orderings I described, "all writes" really means every write to any memory location in the process. It also influences compiler optimizations, and an atomic write with Release cannot be moved backwards over a write by the optimizer. Similarly for Acquire.

If you don't care about other data, you can use the Relaxed atomic ordering.

1 Like

Ok, thanks for explaining that. What I am now wondering about is the expense of having the stamp variable in the array of slots. I guess there is a good reason for this, to somehow enable the channel to run faster, with less locking, but I cannot yet quite see the idea. If anyone can explain it, I would be interested!

You mean, compared to putting the stamps in another vector next to the messages? or something else?

I mean compared to not having the stamps at all.

I am not familiar with how the Slot type fits into with the rest of crossbeam, but you're going to need somewhere to store whether the slot contains a message, and you are going to need an atomic integer somewhere to properly synchronize access to the data.

Right. But if the channel can buffer 1,000 messages (say), is it necessary to have 1,000 atomic integers, rather than say one or two. Apparently, yes, perhaps for efficiency, but I don't understand why yet.

Edit: after more thought, I guess it relates to "cache line ping pong", as described for example here:

I plan to look more carefully at the code and what is happening to try and understand it better.

As a partial explanation of my understanding for why the the slots have extra the stamp field, when threads are communicating, you want to avoid ( if possible ) execution stalling because memory needs to be synchronised, that is for core memory caches to be flushed or reloaded. In other words, you do not want a single memory location that is a bottleneck.

For a channel, you want the writer ( or writers ), to be writing to one place, and the reader ( or readers ) to be reading "somewhere else", rather than having single memory locations ( say "head" and "tail" ) which are a bottleneck.

Incidentally, it's important when using a bounded channel ( that is a channel with a finite, pre-allocated number of slots ) to make sure it is large enough to allow your threads to run in parallel. I became aware of this simply by playing around with the size of my bounded crossbeam channel, and then starting to wonder why it had such a significant effect on processing speed.

Of course, once you realise, it becomes obvious, if the channel is not big enough, why the threads cannot proceed in parallel, but it's easy ( for me at any rate ) to miss such considerations at first.

It's also a neat way of finding out how much you are gaining from the parallel processing : simply set the channel size to a small number, and ( for most algorithms ) you find out how fast it would run on a single core.


This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.