Per-cpu atomic variables?

I'm wondering whether it's possible to use per-cpu variables in user-space rust, like the linux kernel uses internally?

I have an application that uses a single AtomicU64 as a counter, and it's becoming a bottleneck, presumably due to cache bouncing between the many cores in production. The counter is updated frequently (on every allocation), but only seldom do we need to find the total, and it doesn't need to be all that accurate. So what I'd love to do would be to have an array of AtomicU64 with one variable per core, spaced out by cache lines. Then each update could go to the variable for the current core (note: not thread) so we would have no cache bouncing. Finding the total would require summing over the array (and the associated cache penalty), and would give an answer that is somewhat unpredictable.

I think I could implement all of this, if I knew an efficient way to find an index of the currently running core, but I don't know any way to do this. It seems like what must be a common pattern, and I imagine there might be a crate to do this, but I don't know the name of the pattern (except per-cpu variables, which turns up nothing). Any suggestions for how to maintain such a counter cheaply? Thanks!

2 Likes

The OS can reschedule any thread at any point in time to a different cpu core.

Actually, it can't reschedule the thread during an atomic write, can it? Or the atomic write would no longer be atomic, which is why I'd be using an array of AtomicU64 rather than trusting the OS not to reschedule me. The point of checking the current core number would be to avoid cache bounces, not to ensure currectness. And the odds of a context switch happening between a check for which cpu is running and accessing the variable are slim enough not to be important for performance.

The trouble is that maybe there isn't a fast way to find which cpu we're running on.

crossbeam::SharededLock comes to mind as prior art of something which aims to avoid slowdown due to concurrent access to the same cache line in how it structures its locking information; here’s the source code, it just assigns a number to each thread, counting up, and then uses this number modulo the number of shards as an index … well … its approach of always taking 8 shards might be suboptimal for your purposes, but as long as the number is high enough, you can either rule out conflicts in cache-access alltogether (if you have a small number of threads) or at least make them appear much less frequently on average.

1 Like

While that is certainly true it is also unlikely. When possible a thread is given a chance to run to completion (the end of its time slice or blocked by I/O). In other words, both Linux and Windows try to do the least expensive thing whenever possible.

@droundy, for Windows, GetCurrentProcessorNumber looks useful. The last time I used it it did return zero through N-1 which would make a perfect index. I believe a similar function is available for Linux.

You're still faced with where to place the per-processor data so it's cache friendly.

I suspect thread locals would help and be less work. To get totals you would query each thread for it's total.

1 Like

Something useful about thread locals is they will be destroyed like normal when the thread exits, so if you create some sort of Counter type that's updated on each allocation, you can wire up its Drop implementation to update your global counter(s).

1 Like

The problem with thread-locals is precisely that they are freed when the thread exits, so I would lose the information. Also I'm not sure how I would sum to find the total over all threads.

...unless you...

Ask them. Or are you not keeping track of your threads?

I'm writing a library.

Thanks, that's helpful. It turns out linux does have something similar. I'm not sure whether either of those is fast enough to beat a couple of atomic increments even with cache problems, but I'll give it a shot if I can find time.

Now if only I could find the equivalent for Darwin! :slight_smile:

I think the problem with Drop is your threads will only "report" allocations to the global counter when they exit. That means if you have several long-running threads making lots of allocations, you won't know the full count until your program exits.

There is probably some way to have a global registry of atomic counters where each thread-local is allocated one on on startup and adds its count to. This snippet won't compile and has edge cases, but maybe it'll give you some inspiration.

static COUNTERS: Counters = ...;

struct Counters {
  running_threads: HashMap<ThreadId, Arc<AtomicUsize>>,
  completed: AtomicUsize,
}

impl Counters {
  fn create(&self, id: ThreadId) -> Arc<AtomicUsize> {
    let count = Arc::new(AtomicUsize::new(0));
    self.running_threads.insert(id, Arc::clone(&count));
    count
  }

  fn done(&self, id: ThreadId) {
    let counter: Arc<AtomicUsize> = self.running_threads.remove(id);
    let count = counter.load(Ordering::SeqCst);
    self.completed.fetch_add(count, Ordering::SeqCst);
  }

  fn total_allocations(&self) -> usize {
    let running = self.running_threads.values().map(|c| c.load(Ordering::SeqCst)).sum();
    running + self.completed.load(Ordering::SeqCst)
  }
}

struct Counter(Arc<AtomicUsize>);

impl Counter {
  fn new() -> Self {
    let id = std::thread::current().id();
    let counter: Arc<AtomicUsize> = COUNTERS.create(id);
    Counter(counter)
  }
 
  fn add(&self, count: usize) { self.0.fetch_add(count, Ordering::SeqCst); }
}

impl Drop for Counter {
  fn drop(&mut self) {
    let id = std::thread::current().id();
    COUNTERS.done(id);
  }
}
1 Like

When kernel code uses per-CPU variables, it does so in "atomic context" where preemption and thereby task migration are disabled AFAIK. Until Linux kernel version 4.18, user space code could not rely on similar guarantees so that this behaviour could only be approximated by thread-local storage or via sharding.

However, there is now support for so-called restartable sequences available to user space which allow detecting and handling preemption even though it does not prevent it, c.f. Restartable sequences [LWN.net]. However, application-level code might need to take care to coordinate with glibc when accessing the kernel API, c.f. Restartable sequences in glibc [LWN.net]. Finally, it might interesting to note that one of goals for glibc's own usage of restartable sequences appears to be efficiently implementing getcpu (2), c.f. Extending restartable sequences with virtual CPU IDs [LWN.net].

EDIT: There is also a more tutorial-like article on using librseq available at The 5-year journey to bring restartable sequences to Linux - EfficiOS but I am not sure how current it is w.r.t. the glibc integration.

2 Likes

I've just implemented per-thread atomic thread locals, with a shared data structure for iterating over all them, for a profiler I'm working on. I'd be interested to hear about @droundy's use case here, sounds like a profiler too?

Here is a sketch inspired by that code; unlike the previous example, it's based on leaking a new atomic for every new thread, on the theory that programs won't have infinite numbers of threads in practice. But something with Drop probably works too and would give you removal logic, though the docs have enough caveats that I am not relying on that... but in my use case I can rely on some other logic to know when to clean up.


// Or use a hashmap with thread id as key; I'm using Linux's getttid.
static ALL_COUNTERS: Mutex<Vec<&'static AtomicUsize>> = const_mutex(vec![]); // parking_lot mutex

fn create_counter() -> &'static AtomicUsize {
    // maybe want to make this as big as a cache line, too.
    let counter : &'static AtomicUsize = Box::leak(Box::new(AtomicUsize::new(0)));
    ALL_COUNTERS.lock().push(counter);
    counter
}

thread_local! {
    /// Stores references to CURRENT_FRAME_STORAGE
    static COUNTER: Cell<&'static AtomicUsize> = Cell::new(create_counter());
}

Now each thread can write to its own counter, and it's all possible to iterate over all counters.

1 Like

My solution uses thread affinity to do the thing that you want to do. It basically tells the scheduler to put a thread on a particular CPU (core), amoing avaiable set of cores.

@steffahn 's post This seems to be the best "direct" solution (based on my basic understanding as to how crossbeam implements it)

My solution would be as follows:

  1. First of all, you need to be able to separate each CPU-local-variable into different cache lines. The cross-beam crate seems to have done it well. Pad the structure with "general" cache lines size.
  2. Now, you need CPU-local variables. Now, there is no way to specify a cpu-local variable (as far as I know). BUT, there's a "trick":
    • At runtime, you can get the number of logical cores, and have a global variable (the cacheline-size-padded array of your data), with number of entries in the array as the number of logical cores.
    • Now, for each thread you create, you can set thread-affinity, which if you're not aware of, is a way to tell the scheduler (at least on Linux) to run a thread on a particular CPU.
    • Note that you'll have to get the information about available logical cores (CPUs) in order to set the right number of each spawned thread.
  3. On each thread spawned (with thread-affinity for a particular core), you can then access the array-entry corresponding to that number.

Thus, you have all threads whose affinity is set to a particular core, using the same variable entry in the array. There seem to be multiple crates implementing this in Rust, and they seem to be fully mature for prod (given that they weren't updated recently).
You can also refer to the glibc Linux man pages regarding this. (there are more man pages on simialr/related functions.

PS: Note that manually setting affinity has drawbacks such as the case where one of the CPU is being heavily utilized by the OS for something else, in which case, the threads affined to that CPU will perform poorly
I hope I didn't specify things incorrectly, and that it helps you.

It sounds like you could use an array of atomic counters (each wrapped in an align 64 new type for cache line) to a reasonable length (I'm guessing double the CPU count), and key on anything simple like thread id or job number. So long as they aren't strongly correlated in some way with core scheduling, your contention could be made arbitrarily low, and it sounds much simpler than trying to precisely target CPU IDs?

Perhaps something to use as a fallback.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.