Fastest way of communication between threads of same program?

Assume:

  1. one machine, possibly multiple CPUs
  2. x86_64 Linux + Rust
  3. all messages are 32 bytes
  4. we are willing to 'waste RAM' on buffers

Taking into account cache flushing behaviour, what is the fastest way for two Rust threads of the same process to communicate ?

EDIT: using unsafe is fine

kiinda broad - what ikind of access?

regular memory of course, but you probably need something to lock/spin around to protect it from being corrupted. More detail is necessary.

For the broadest answer is probably one of the channels or mpmc queues from the crossbeam crate. You can sculpt faster if you know what you are doing and what you are doing it to.

3 Likes

There's also the question of whether you want low latency or high throughput.

11 Likes

I'm no expert, but from what I've read: for what you're asking you probably want to spin on each thread on a circular buffer controlled by atomic read and write pointers to communicate. Each thread should only be talking to one other thread on each buffer to prevent contention on the pointers (and the pointers should be on separate cache lines, but I don't have much guidance there)

This gets you potentially sub-microsecond latency, for whatever that's worth on a non-real time os.

I will say though: it's not that likely to be useful to be this aggressive, but sure, give it a try.

+1 on spinning AKA busy waiting (maybe with a timeout to fall back to blocking so as to not burn CPU time when there's no data left, assuming the latency hit for the thread become runnable again is acceptable. The Erlang VM has a flag +sbwt to configure the max busy wait time for its scheduler threads for this reason.)

I might try to keep the two communicating threads on the same core by pinning them to avoid IPIs and choosing the thread by hashing user ID or some other input in order to always choose threads on the same core for each user or whatever it is. Assuming you do that, there might be some way you could avoid the LOCK instruction prefix altogether, tho it probably doesn't matter if there's no actual contention.

1 Like

I don't understand this question. Can you list the options ?

Great question. Let us say, for every k in {10 millisecond, 1 millii second, 100 micro second}, how do we achieve highest throughput if we want 99% of msgs to have latency <= k.

This sounds like the direction I am interested in. Do you have pointers to Rust crates / further reading ?

I am trying to use all the cores/threads on a multi cpu machine. Some threads are on the same core, but some thread will be on different cores. If we have, say, 4 cores, picking a random pair of threads, they'll be on the same core with only 25% probability.

This is what I mean by hashing some key of the input to pick two threads on the same core if you can. Lets say, for example, the communication between the two threads is always between tasks associated with the same user ID. If you hash the user ID to pick the thread or core number, then the communication won't leave the core.

Similarly, some network interfaces are capable of hashing (source IP, source port) pairs to pick the CPU to interrupt so that all packets for a particular source land in a queue on the same CPU to obviate extra synchronization.

Can you even do an unlocked cmpxchg on intel? At most it would cut down on some MESI traffic but you would be giving up parallel execution since you've have to be on the same hyperthread I think (same phys core can still have issues since even atomic ops are not really atomic at the uop level and looking at the entire pipeline). So yes, yu don't need concurrency primitives is you are running single threaded. And MESI traffic is prob the least of the worries because you just traded it for a lot more instructions and maybe a cache miss.

It really really absolutely positively depends on what he is doing.

I don't understand this question. Can you list the options ?

Are you writing out single values from one thread to be read by another (like ints)? Are they structured values? How many producers? How many consumers? Are these values write one or changed (eg sharing a tree or matrix with changing values)? How often do you write out a values? Do readers even get behind and need to fast forward?

I took a quick look on crates.io to see if there was anything particularly aggressive in this way and didn't find anything - in general you are going to see Condvar / Mutexs that put the thread to sleep when the buffer is empty, as otherwise you are always using 100% CPU, which is a high price to pay for near but not guaranteed real time latency. Pretty much only sound really needs that kind of consistently low latency.

If what you are doing is a batch workpool or some sort of server, wake up latency in the order of a millisecond is generally fine, since you are likely to have more requests to process by the time you finish the first after the wakeup. If you are in this normal case, then just using the higher level crossbeam / rayon etc. libraries is near certainly going to be fine, they don't have to leave much if anything on the table in Rust.

If you do want to give the spinlock code a spin (ha.), I threw together a pretty garbage impl here: Rust Playground

This has a bunch of problems like leaking items in progress if both sides drop, but it worked the first time it compiled, so I'm counting it as a win.

Some crossbeam impls do spin I thought. If he just want to publush small structs adn only the most recent one matters he can just keep setting a pointer and it will work on intel fine - don't even need atomics. if he needs to consumer a list, he can get away with just CASing a linked list. and it goes up from there i guess.

As far as I know, Rust doesn’t delegate any of its VM specification to the underlying architecture— You can’t assume that storing into a pointer-sized value in Rust acts exactly the same as a MOV assembly instruction. There may be compiler optimizations, for instance, that rely on knowing all potential synchronization points between threads.

If you want to rely on architecture-specific behavior like this, I think you’ll need to do it via inline assembly (asm!). That flags to the compiler that you care about the underlying semantics in a way that doesn’t necessarily transfer to higher-level Rust code.

On the other hand, Rust’s atomics take advantage of this particular Intel behavior; they generally compile down to simple moves in the appropriate cases.

4 Likes

If rust is going to break the underlying model of the machine it should be very explicit about that. Making asignment no longer be atomic would be an enormous issues for performance work. Rust should really make up its mind what it aims for and where its priorities are. Does hypothetically running on architectures I'll never touch not trump real performance gains? Put it in an unsafe block fine, but I'm sure asm blocks have their own optimization issues too.

Rust is quite explicit about it. It is the first item on the list of behavior considered undefined. It is also how C/C++ works.

You can use relaxed atomic operations on intel to get the behavior I assume you are after. Relaxed load/stores compile down to simple mov instructions.

6 Likes

Assignment of a pointer on Intel isn't a data race. It would only be on if rust defined it that way, and at that point you can call anything whatever you want because you're just defining yourself correct.

Like I said, if you are going to break the machine guarantees, be really up front about it. Rust is just a means to and end. And you should tell users why you are breaking the arch. A blanket "we might do something" isn't really meaningful for real word decision making. Give users the change to understand the issue as to why.

There's no UB in assembly, but C/C++/Rust have it. For example let's see this trivial C code.

int send_and_recv(int *shared, int req) {
    *shared = req;
    for (int i = 0; i < 10; i++); // wait a bit
    return *shared;
}

int which_compiles_into(int *shared, int req) {
    *shared = req;
    return req;
}

The send_and_recv function tries to send a value to another thread by writing it into shared pointer, wait a few cycle, and return received value by reading shared pointer. But data race is UB in C which allows the compiler to compile two functions into identical machine code.

2 Likes

where's the race/UB? those are the same.

Both functions compile to just returning req which could be wrong.

That's just not understanding your computer and compiler. Its a strawman argument - there's no reason to assume those could be or wouldn't be the same. Interleaved instructions can clearly interfere with that. This is completely different - its going the other way. That is adding guarantees not supported by the computer and I'm talking about taking them away.

Also, a little weird to drop into a discussion where UB was fine to tell everybody not to do UB because we might decide to take it away fro you someday. That's a little too paternalistic.

Data race is that more than one thread is accessing same memory location and at least one is modifying it. This code is wrong if another thread modifies the shared pointer as expected. But since it's data race compiler can be wrong on such case. By default the compiler assumes every modifications in memory is not observable to other threads. You can tell the compiler that "this operation is intended to be observable to other threads/to observe modifications from other thread" using language's atomic constructs.

1 Like