There's been an ongoing debate in the Go world for a while about supporting "shardable" values. The fundamental observation here is that there are some data-structures that can benefit enormously from having core-local state. One example of this is a better RwLock that keeps one reader lock for every core, and where the reader takes the lock for the core it's running on (which should basically never be contended), and the writer takes all the locks. This significantly increases reader scalability at some cost to write lock acquisition time. Distributed counters are another example where significant multi-core performance gains can be achieved.
The original petition to the Go authors was for an easy way to tell what core a thread is running on. This information can be retrieved through the CPUID (e.g., with raw-cpuid or cpuid) or RDTSCP instructions, but some kind of indirection that allows getting it on all targets would be nice. The Go developers did not want to expose this kind of information through their runtime (see first post in the linked proposal), but instead want to expose a notion of a sharded value. I won't go into the details of the design here (see the link), but I think Rust could benefit from having a similar kind of primitive; probably in the form of thread::core_id() (with a corresponding thread::max_core_id()). We could then build an external crate on top of that that provides a sharded value similar to the proposed Go one.
I'd be curious to hear whether such a primitive would be useful to others with performance-critical applications?
One problem with core-local state, as far as I see it, is that neither Rust nor Go are actually able to control on which CPU core an OS thread will be executing. Unless some pinning trickery is used, the OS scheduler can decide to move a thread from one CPU core to another whenever it feels like it. In practice, this means that the result of RDTSCP/CPUID can be invalidated right after being read, so even the "naive" approach to core-local state that is described above seems vulnerable to a race condition.
From this perspective, a thread-local approach would seem more robust, if harder to make efficient (especially in Rust, where threads are spawned and controlled by the user, not by the runtime like in Go). It should be noted that is much easier to make thread-local state efficient when operating in a thread pool approach, where the number of threads is constant and potentially known in advance, perhaps this specific implementation should be explored further?
The observation from past work on this is that core-local state is still advantageous even if you sometimes get the wrong core ID. Think of the counter example: incrementing another core's counter atomically is possible, it's just more expensive. What matters is that most operations operate on an uncontended value. It also helps that the OS generally tries to maintain affinity between threads and cores (e.g., for cache effects), so you actually observe less switching than may be anticipated.
You're totally right though that thread-local state is often the way to go. Unfortunately, that doesn't work so well for cases where you occasionally need to access all the state (e.g., with a distributed RwLock), because you have no easy way of getting at the thread-local state. I suppose you could make one lock per thread and wrap them all in Arc to also store them somewhere else, especially in a thread pool context, but you also want to keep that number of locks as small as you can (which is == the number of cores).
In this case, you need every operation on the sharded state to be atomic, otherwise a context switch in the wrong place will lead a thread to read garbled half-written state. The problem is that atomic operations can be very expensive, we're talking about 50~100 CPU cycles per uncontended read-modify-write instruction.
The advantage of using thread-local state is that data race free access to the "local" state is guaranteed by construction, and expensive synchronization is only needed when you need to access another thread's "remote" state. In the local case, you can be almost as efficient as a sequential program would be.
You’re totally right though that thread-local state is often the way to go. Unfortunately, that doesn’t work so well for cases where you occasionally need to access all the state (e.g., with a distributed RwLock), because you have no easy way of getting at the thread-local state. I suppose you could make one lock per thread and wrap them all in Arc to also store them somewhere else, especially in a thread pool context, but you also want to keep that number of locks as small as you can (which is == the number of cores).
In practice, a well-designed multi-threaded application generally uses ~1 thread per CPU core or hardware thread, so the memory usage advantage of core-local vs thread-local is not that big.
I agree that thread-local makes cross-thread synchronization more tricky and expensive, but the point of sharding values is to make this synchronization as rare as possible. In this case, making access to the "local" values as cheap as in the sequential case is a clear win, even if it makes "remote" accesses a bit more expensive.
That's a fair point. I guess this is more an argument for having a primitive for sharding values more so than having it specifically be per-core. Manually setting a value that's to be sharded across all threads in a thread pool, including mechanisms for accessing all the things jointly, is a bit of a pain.
I think there's also an argument for having core-local data (or at least for having a mechanism for getting at a core id) in that not all applications use a thread pool in the first place. It's true that this is a common application pattern, but it's not the only one (not even the only reasonable one) depending on the use-case. But I concede that thread-local is probably often the way to go.
This is not the same thing as what you're proposing in this thread, but the seastar C++ framework has a notion of a sharded service. The shards are connected (and communicate, as infrequently as possible) via lock-free queues, and seastar itself pins threads to cpus so the whole setup is orchestrated. The reduced communication and memory isolation between the sharded services is key - cache coherence costs need to be kept to a minimum, which becomes increasingly more important with growing core count. So even if you had per-cpu state that's mostly uncontended, a writer that comes along (in the rwlock scenario) that locks across all cpus will necessarily invalidate the caches on all remote cpus and will require RFO traffic on the interconnects. That's beside the fact that you don't want to lock at all if you're after high performance, especially if you're considering per-cpu structures .
Thought you might be interested in its design (if you're not already aware of it) while we're on a related topic.
A generic thread-local sharding primitive optimized for write-intensive scenarios with a constant number of threads could perhaps look like this:
There are N objects of some type, where N is user-selected. These objects are aligned on cache line boundaries to avoid false sharing.
There is one RWLock-ish synchronization primitive per object, which is used by other threads to request synchronization. It could be implemented as an atomic bitfield combining a counter and a flag: other "remote" threads increment the counter to notify their intention to read the data, and the "local" thread sets the flag to acknowledge that the data has been comitted to RAM and is ready to be read. Once remote threads are done, they decrement the counter.
Each thread gets a numerical identifier, representing a pointer to one of the object + synchronization primitive duos.
A thread can operate on its local data however it wants, exactly like it would in sequential mode, but should periodically check the counter to see if other threads want to read it. If so, it should commit any pending writes to the local object, then set the "sharing" flag in the bitfield. As long as that flag is set, modifying the data is forbidden. Once the reader count has dropped to zero, the local thread can unset the sharing flag (with a CAS to avoid a race with incoming readers) and resume writing.
That's just a 5-min synchronization protocol idea though, and finding a nice and user-friendly interface around that would be an interesting project. Also, an interesting limitation of this protocol is that it relies on the local thread writing frequently in order to detect read requests, otherwise they will stall forever. That's basically a rather extreme reader-writer lock, which is fully and solely optimized for write performance.
Yeah, that's a really interesting idea! I think an eventual implementation should probably provide one sharding mechanism that's optimized write-intensive scenarios (like the one you describe), and one that's optimized for read-intensive workloads (e.g., a distributed RwLock where reads are very common).
There's another interesting synchronization protocol implementation idea, which leverages IPIs via the sys_membarrier syscall. Seastar also tries to use it.
Hmmm... That is quite a different synchronization method with different trade-offs, though:
The protocol above is intended for write-intensive workloads, whereas sys_membarrier is aimed at read-intensive workloads
The protocol above provides local (pairwise) synchronization, sys_membarrier only allows for global synchronization (disturbing all CPUs in the system).
The protocol above can be fully implemented in user-space (only using syscalls when blocking is required), whereas sys_membarrier requires a round-trip through the kernel on every write transaction.
The protocol above is portable to any OS and CPU configuration that supports acquire/release synchronization, whereas sys_membarrier is Linux-specific (possibly x86-oriented too).
Can you clarify what kind of read-intensive workloads you are thinking about? I cannot spontaneously think of a situation where sharding would be a useful optimization on data which is mostly read...
Ah, I wasn’t suggesting it’s a replacement for your specific protocol; it was just a protocol using an interesting (IMO) technique that some may not have heard about.
It looks like the main use-case for such a primitive (system-wide, optimized for synchronizing read-mostly data) would be to speed up user-space RCU, which itself might be one possible answer to @jonhoo's read-intensive scenarios (depending on their precise characteristics)?
The most immediate example I can think of is when you want the threads to cache some piece of state, and only occasionally update that cached state from the master copy. Eventually consistent key-value stores can get fairly large speed-ups from these kinds of tricks. As you observe though, it's unclear that sharding is the right primitive here. I think the use-case is much stronger for something that lets you implement e.g., a scalable read-write lock.
From my understanding, the read caching layer that you describe here does not need any cross-thread communication: the data flow only seems to go in one direction, where each thread fetches from the master copy to the private cache from time to time, and threads never look at each other's cache. From this perspective, this looks like is a good fit for un-synchronized thread-local data, and I am not sure if a special sharding primitive would be useful there.
Where things would get more funky, however, is if threads were allowed to submit updates through the caching layer (as opposed to directly updating the master copy via some synchronization primitive). This would be necessary for write scalability (if that is a requirement), but would tremendously complicate the architecture as now conflicting writes must be resolved at the time where writes are commited from the local caches to the master copy. I'm not sure if this write conflict resolution part can be implemented in a generic way.
Do you mean one that scales well to large number of readers, with infrequent updates? As soon as updates become more frequent, I think that locks (even reader-writer ones), no matter how well-implemented, become somewhat antithetical to scalability.
Yeah, you're right, bad example. I also can't think of a better one, so maybe your initial observation is the right one; that we would only need such a sharding primitive for when there are writes. That said, of course, sometimes reads also do writes (a sharded reader-writer lock is one example where the readers have to take "their" lock).
For read-mostly workloads, there are also ways to do without reader locks altogether, if you can afford the overhead of an occasional copy/fresh write (or can make copies cheap, as in linked data structures).
The general idea is to have a publicly visible pointer to the current version of the data. A writer proceeds by building a modified copy of the data, then CASing it in place of the previous version. At this point, there are two Hard Problems to take care of:
Make sure that CAS is a reliable indicator that the data was not modified. If there are multiple writers and storage can be reused, this is not guaranteed. Various specialized memory management schemes exist to resolve this issue.
Figure out a way to tell when readers are done with the old data. There are multiple schemes for that, which can be broadly split in two categories: those that count readers looking at the data (easy but can get expensive) and those that track when readers go idle (much harder to get right, but may scale more).
For the cases where a copy is unaffordably inefficient, even accounting for the relative rarity of writes, and that cannot be improved upon, I agree that a distributed reader-writer lock can be the right choice.
It comes with its own expenses though (more atomic operations and memory barriers on the reader's happy path, writer path is very expensive), and that is a relatively narrow use case, so I am not sure if the pattern is general enough to warrant a generic implementation. What do you think?