Can anyone explain the virtues of thread_local!

Maybe there is a good blog post or some tutorial somewhere?

But I suspect this is the cause behind some very low CPU utilization on multithreaded code I have, which performs blazingly fast on small data sizes, but once the data is even a few MB in length, htop begins to look poor, with rare bursts of mutithreaded CPU activity scattered in between long periods of activity that looks like this:

Where grey bars in HTOP means IO.

My whole computer comes to mostly a standstill also when this is the case.

I think it's because (in this case) I have a 30MB PDF that i am processing in rayon but split over the pages (there are some 1.3k pages), and i think, if i understand correctly thread_local! keeps a copy in every thread, at the cost of greater memory footprint, but it means the algorithms can get to work faster as rayon doesn't have to bus/shuffle around this 30MB everywhere? Or maybe there is contention on it, or something?

But I'm kind of guessing this. I don't know how or when to use it.

Could any of you wise folk enlighten me? Thanks!!!

Are you asking us to explain the behavior of some code you have written? If so, that's obviously not going to be possible without the code itself.

No thanks :slight_smile: really just an explanation of why thread_local exists, and when to use it, maybe does it have analogues in other programming languages? Is it essentially just a wrapper over an OS API, or is a rusty trick to overcome the nature of how multiple CPU cores and shared caches work, etc? I'm completely guessing, and this is why I just don't use it.

It's similar to BufRead, i just don't have a well and low enough level understanding of what is going on, so I may be missing out on obvious tricks that will make my code so much faster.

With you're help, i'll then deduce if I can use it to solve my current "problem" :smiley:

Thread-locals have nothing to do with BufRead. They are two different concepts and solve completely different problems.

The problem that thread-locals solve is that sometimes, you want some sort of global state which is not thread-safe. In this case, it can't be truly global, because globals are accessible from all threads. If putting the state behind a mutex and locking it upon every access would be too tedious, but duplicating it for each thread is OK, then a thread-local can be used for creating and caching a separate, independent instance for each thread. It can then be used freely, as if it were a global, being assured that only one thread will ever access it.

I don't know how all of this relates to your specific problem, though.

5 Likes

I basically have something like this going on

parse_pdf(pdf: &[u8]) -> FooBar {
    let pages = generate_page_iter(pdf);
    let (foo, bars) = rayon::join(
        || {
            do_foo_on_entire_pdf(pdf)
        },
        || {
            pages
                .par_iter()
                .panic_fuse()
                .map(|page| {
                    do_bar_on_individual_page(pdf, page);
                })
                .collect::<Vec<_>>()
        }
    );
    FooBar {
        foo,
        bars,
    }   
}

I think it's here where I'm getting the htop problem.

I was wondering if maybe pdf: &[u8] should be thread_local or maybe just owned and cloned and moved into rayon, so that each thread is not jumping around to read the slice. But maybe it's not necessary?

Pinging @cuviper (- really hope it's not out of line to ping!) the god of multithreading.

It's similar to BufRead

I meant my lack of understanding and therefore missing potential obvious optimizations is similarly expressed in the absence of any BufReads in my code. I wonder how many other mistakes I am making.

Edit: you know what, do_bar_on_individual_page does call std::fs::write on the pdf, for every page of the pdf, which is obviously no bueno. and perhaps is the cause of that!

Edit 2: Yup. that was it! sheez. I'm sorry. but i hope @H2CO3's explanation of thread_local will benefit more readers in the future!!

Sharing &[u8] between threads is no problem. Adding a buffer for those writes should help reduce the number of syscalls though, doing more in bulk.

3 Likes