When join! is slower than await'ing on multiple Tokio tasks. How so?

In my decades long quest to find ever slower but more interesting ways to calculate fibonacci numbers I came up with this solution, which is a recursive fibonacci algorithm but spawning a Tokio task for each recursion, sending parameters and results over channels rather than making recursive calls.

use async_recursion::async_recursion;
type Sender = mpsc::UnboundedSender<u64>;
type Receiver = mpsc::UnboundedReceiver<u64>;

pub struct FiboTokio {
    par_rx: Receiver,
    res_tx: Sender,
}

impl FiboTokio {
    pub fn new(par_rx: Receiver, res_tx: Sender) -> Self {
        Self { par_rx, res_tx }
    }

    pub async fn fibo(n: u64) -> u64 {
        // Create parameter and result passing channels.
        let (par_tx, par_rx) = mpsc::unbounded_channel::<u64>();
        let (res_tx, mut res_rx) = mpsc::unbounded_channel::<u64>();

        // Create new fibo calculator object.
        let f1 = FiboTokio::new(par_rx, res_tx);

        // Run the fibo calculator on a task
        tokio::spawn(async move {
            f1.calc().await;
        });

        // Send the parameter to the fibo calculator.
        par_tx.send(n).unwrap();

        // Collect the results from the return channel.
        res_rx.recv().await.unwrap()
    }

    #[async_recursion]
    async fn calc(mut self) {
        // Get the number we want the fibo of from the parameter channel.
        let n = self.par_rx.recv().await.unwrap();
        let f = match n {
            0 => 0u64,
            1 => 1u64,
            _ => {
                let f1 = Self::fibo(n - 1);
                let f2 = Self::fibo(n - 2);
                let (f1, f2) = join!(f1, f2);
                f1 + f2
            }
        };
        // Return the resulting fibonacci number on the result channel.
        self.res_tx.send(f).unwrap();
    }
}

Running this on my MacBook Pro m1 for fibo(29) takes ~6.6 seconds and shows ~400% CPU. Timed crudely with time.

Now, here is the thing, when I replace that join! with separate `awaits like so:

                let f1 = Self::fibo(n - 1).await;
                let f2 = Self::fibo(n - 2).await;
                f1 + f2

it runs much faster. Completes in only 0.5 seconds and only uses 100% CPU.

This is somewhat surprising. We get concurrency, we get tasks spread over cores, it runs significantly more slowly!
The Asynchronous Programming in Rust book: join! - Asynchronous Programming in Rust specifically says not to do the latter "To correctly run the two futures concurrently, use futures::join!:"

In this case, weird as it is, that advice is not good!

What do you thing about this?

Edit: Oops for got the --release. Didn't make such a huge difference though. Timings above edited.

Interesting. Could you post a flamegraph?

The main difference appears to be the number of tasks that exist at each moment in time. With one approach, it is extremely large (in the millions). With the other approach, you never have more than one or two tasks. It's clear that the join! version will spend a lot more time on managing the collection of tasks, though the fact that it still lost with 4 CPUs is interesting.

Perhaps the await version is able to be faster by completely avoiding inter-core synchronization overhead? I don't know if that's plausible given the Tokio executor.

The unbounded channels could be replaced with oneshot channels, but that's not really relevant to the difference.

Sure. If you can give me quick instructions. I'm getting too tired to start digging into that.

There's probably several things going on here, most of them related to how much you must hate your CPU. :stuck_out_tongue:

  • The moment you join!, you throw the branch predictor out of the window. The branch prediction fail can be tolerated if the task is large enough. You have channel setup, a branch with 3 resulting code paths, a function call and simple addition of integers in the likely case. I assume from your benchmark, that trying to parallelize this is creating more work than simply executing it in order.
  • In addition to that, you transfer data to a different CPU core and therefore guarantee a L1 cache miss, which would be unlikely to happen if all work stayed on the same core.
  • I don't know what CPU your MacBook Pro M1 uses, but I'd also guess there is some dynamic overclocking going on, if only a single core is doing the heavy lifting and the other ones are idle. For example, if I look at my Intel® Core™ i5-12400F processor, the max turbo frequency is 4.40 GHz compared to the base frequency of 2.50 GHz, i.e. there is a potential speed-up of 76%. That's almost like running the task on 2 cores.
1 Like

Nooo....I love my CPU's. All of them. From by 8 bit AVRs to my old x86-64 to the Mac Book (arm64) to Nvidia Jetson's (arm64). Or at least I did until they started to be so unpredictable...

Now... I'm not sure how all this async and Tokio hangs together. Normally I would expect to spawn a task or many tasks and then join their join handles. Either by doing .await on the join handle (as per the docs: Spawning | Tokio - An asynchronous Rust runtime or using join! on a bunch of them as per: join! - Asynchronous Programming in Rust Indeed the compile warns that not doing so will result spawned code not being run.

However in my code above I don't do either of those things on the join handles of the tasks. I just drop them into space. Rather I do .await or join! on the futures returned by the receive ends of the channels those tasks have posted their results to. Some how this satisfies the compiler and gets the tasks to run.

It is a mystery to me. How!

Anyway. Today I have a new version of async fibo that does allow for joining on the task join handles and the results are very weird. More later...

1 Like

Exactly what warning are you getting? A spawned task does not require awaiting. That's pretty much the definition of spawning: the executor takes over polling the future that is the task. The compiler doesn't know anything about that, though.

So I started to have a theory about what is going go....

The basic idea of this adventure is to take the standard recursive fido(), like so:

pub fn fibo(n: u64) -> u64 {
    match n {
        0 => 0,
        1 => 1,
        _ => {
            let f1 = fibo(n - 1);
            let f2 = fibo(n - 2);
            f1 + f2
        }
    }
}

and to arrange for those two recursive fibs calls to happen in parallel async tasks (or sync threads).
Why? Just because. Well, the idea is to spawn thousands of tasks/threads that are in flight at the same time and see how they fly.

The theory was that when I use .await on things, on after the other, that effectively serialises the tasks. The docs are right, one needs to join! multiple tasks in one go to ensure they fly at the same time.

Well, I might expect that doing things serially here would remove the over heads of scheduling tasks onto threads and all that mechanics. Ergo the version that does not use join! would be faster.

But. Today I have run this code on an Nvidia Jetson AGX Orin machine. This cute little guy has 12 ARM cores and 64GB of RAM! The results are wildly different.

This is. what happens on my Mac Book Pro M1:

fibo::fibo(29)                      = 514229, in 2.269042ms
fibo::FiboAsync::<false>::fibo(29)  = 514229, in 1.545821167s (Without join!)
fibo::FiboAsync::<true>::fibo(29)   = 514229, in 72.904108042s (With join!)

This is what happens on the Jetson AGX Orin:

fibo::fibo(29)                      = 514229, in 2.038704ms
fibo::FiboAsync::<false>::fibo(29)  = 514229, in 9.210667971s (Without join!)
fibo::FiboAsync::<true>::fibo(29)   = 514229, in 1.956851583s (With join!)

That first fibo::fibo is the normal recursive fido. We see the MacBook and Jetson are about the same speed there. I have used a const generic bool to select using .awaits or using join!. As described in my OP.

These async fibo results are weird. Firstly what was slow became fast and what was fast became slow. And as an extra surprise the Jetson is 7 times faster in the worst cases than the MacBook.

This is all nuts. Any ideas?

Ah ha, you are right. I was getting confused with some other error at some point, or something.

So we are good to go re. spawning all the required tasks. The oddity is what is happening when join!ing or await ing those channel receiver end points. The difference is huge and very dependent on the machine I run it on.

Something to notice about the big picture here: you're creating channels and spawning tasks just to run one match and one + per task. That means that, no matter what, you're primarily measuring the overhead, not the computation: allocations (of tasks and channels), dispatch, scheduling, synchronization.

Your machines have different CPUs (affects cost of synchronization operations), different kernels (affects scheduling of threads), and different system libraries (affects cost of allocation). So, it's not greatly surprising that the results are very different — different system designs make different tradeoffs about the efficiency of various operations.

3 Likes

Yes, yes. I'm quite aware of that. If I wanted performance in filo() I would not be doing this. Maybe something more like my million digit Fibonacci number calculation: GitHub - ZiCog/fibo_4784969: The million digit Fibonacci number challenge. Calculate fibo(4784969) in various languages..

Indeed that is the point, I was wanting to compare the overheads of async tasks vs regular threads. And the reclusive fido() came to mind as an interesting way to spawn 100,000 of threads.

I've always maintained that regular threads are what you need for compute performance through parallelism when you ave multiple cores. While async is what you need for lots of tasks that are waiting on events. "Sync for work, async for waiting" as it were. In this fib() case we don't care about the actual work performance but we do create thousands of tasks that have very little compute and spend most of their time waiting. A good test no?

You are right, I have complicated my life by trying this on different machines with different OS, with vastly different amounts of memory and so on.

I'm going to play with it some more...

One way to partially mitigate one variable would be to use jemallocator (or any userland allocator) rather than relying exclusively on the system allocator. I suspect this ends up primarily reducing to a (multithreaded) allocator stress test more than anything else.

At least the perf flipflop illustrates that it isn't just LLVM seeing through your tricks and optimizing the sequential awaits.

The crux of the issue is that there is no inherent opportunity for parallelism in the problem being solved. So even though we can get tasks spread over cores, each task can only run after the two tasks which provide its input have run.

So Tokio can start spawning future tasks without relaying having the information to know that it needs to run them in a very particular order in order to minimize spawning too far out ahead of itself. So the real work this is doing is exercising running down the chain of serially dependent tasks with poll, mainly getting Pending returned, then when getting lucky and finding a task that is Ready, going through the expensive Waker mechanism to identify the next task to be polled, etc.

In other words, we've taken the knowledge we had at compile time regarding the serially dependent nature of tasks to be executed and thrown it out the window, and instead ask the Tokio scheduler at runtime to figure it all out. :slight_smile:

(Oh, and for the non join! case, where you just serially .await instead, try flipping the order of the two .awaits... i.e. .await the result of the n-2 value before the n-1 value. Does that slow down even further?)

3 Likes

Yes, yes, I know. I addressed this above: When join! is slower than await'ing on multiple Tokio tasks. How so? - #11 by ZiCog

The point is not to gain performance for fibo() by parallelising, the point is to use recursive fibo() as an interesting means to spin up millions of Tokio tasks and the channels communicating between them, to see how that flies.

Thing is when doing this, we end up with a huge number of tasks, each of which has very little computation to do but spends a lot of time waiting for results to the requests it made. Which if I understand correctly is exactly what async is built to do efficiently. As opposed to spinning up thousands of sync threads. So I thought this would make an interesting test.

Hmmm... jemallocator, another rabbit hole to jump into. Ok lets go:

The machines are a MacBook Pro with 16GB RAM and a Nvidia Jetson AGX Orin with 12 cores and 64GB RAM running Ubuntu. Both with ARM 64.

Included in the results is the traditional recursive fibonacci function just for reference.
And a version using standard threads for each recursion. There are two versions using Tokio tasks
for each recursion, slightly different. For each version we either join!() on the result channel receive handles or .await() each in turn. Selected by a generic bool.

Using standard allocator:

Mac Book Pro. n = 32:

fibo(32)                      = 2178309, in 13.56ms
FiboAsync::<false>::fibo(32)  = 2178309, in 6.29s    (Without join!)
FiboAsync::<true>::fibo(32)   = 2178309, in 1156.42s (With join!)
FiboAsync2::<false>::fibo(32) = 2178309, in 1625.00s (Without join!)
FiboAsync2::<true>::fibo(32)  = 2178309, in 2432.58s (With join!)
FiboThreads::fibo(32)         = 2178309, in 155.50s

Nvidia Jetson AGX Orin. n = 32:

fibo(32)                      = 2178309, in 6.89ms
FiboAsync::<false>::fibo(32)  = 2178309, in 53.22s (Without join!)
FiboAsync::<true>::fibo(32)   = 2178309, in 9.54s  (With join!)
FiboAsync2::<false>::fibo(32) = 2178309, in 15.09s (Without join!)
FiboAsync2::<true>::fibo(32)  = 2178309, in 12.03s (With join!)
FiboThreads::fibo(32)         = 2178309, in 327.21

This is already amazing:

  1. The standard fibo() is faster on the Jetson which is unexpected.
  2. The Mac is faster for the standard threads version than the Jeston.
  3. The Mac is very slow for all the async versions except the first which does not use join!(). Perhaps there is a lot of memory swapping happenning here.
  4. Conversly the first asyn version, not using join(), is a lot slower than the others on the Jetson.
  5. The sync version is a lot slower than all the async versions on the Jeston and way slower than the fastest async version on the Mac. Good news for async and Tokio.

Now using Jemalloc:

Mac Book Pro. n = 32:

fibo(32)                      = 2178309, in 18.30ms
FiboAsync::<false>::fibo(32)  = 2178309, in 4.71s (Without join!)
FiboAsync::<true>::fibo(32)   = 2178309, in 5.42s (With join!)
FiboAsync2::<false>::fibo(32) = 2178309, in 5.57s (Without join!)
FiboAsync2::<true>::fibo(32)  = 2178309, in 5.56s (With join!)
FiboThreads::fibo(32)         = 2178309, in 814.16s

Nvidia Jetson AGX Orin. n = 32:

fibo(32)                      = 2178309, in 6.91ms
FiboAsync::<false>::fibo(32)  = 2178309, in 24.48s (Without join!)
FiboAsync::<true>::fibo(32)   = 2178309, in 6.46s  (With join!)
FiboAsync2::<false>::fibo(32) = 2178309, in 6.00s  (Without join!)
FiboAsync2::<true>::fibo(32)  = 2178309, in 6.07s  (With join!)
FiboThreads::fibo(32)         = 2178309, in 550.41s

Wow! That leveled the playing field dramatically. Mac and Jetson are almost neck and neck for the async versions.
But here not using join!() on those channel receivers is three times slower than .awaiting then one after the other on the Jetson.

Seems to me that it should not make any difference if I join!() two channle receivers or .await then one after the other. After all the tasks doing the work are spawned and running and I need the results of both before proceeding anyway.

What goes on?

I tried flipping the order of the two .awaits. Made no difference.

Ah, sorry... missed the point of the exercise.

I'd try to get a profile, which can typically done pretty easily with:

cargo install flamegraph
cargo flamegraph -b <your binary crate> -- <your args>

It should produce a .svg file locally you can easily view in any browser. (Link for more information.)

(I also benchmark/profile on a MacBook M2... if you haven't already noticed, make sure it's plugged in and not running solely on battery, as it will otherwise be slower.)

Tried the flamegraph. Nice diagram. Looks like a battleship. No idea how to interpret it. Most of the names in the boxes are unreadable.

Sure I'm plugged into my Mac charger.

No, not quite.

This does all of the work of Self::fibo(n - 1) (spawn, recurse, await) and then all of the work of Self::fibo(n - 2) without any concurrency between the two calls. Given the difference doesn't show up with FiboAsync2, I'll guess that it's written instead more like

                let f1 = spawn(Self::fibo(n - 1));
                let f2 = spawn(Self::fibo(n - 2));
                f1.await + f2.await

instead, which will run the two calls in parallel and make the double await behave essentially identically to join!(f1, f2) (or actually negligibly better). It's important to remember that calling async fn does nothing until its awaited, unlike in other languages where it immediately runs to the first suspend and/or implicitly spawns the task to be progressed even before it's awaited. But that doesn't even matter to how you've written the first example, since you await one subtask before even building the other.

Cool to see proof that my intuition on this mostly being an allocation benchmark was accurate, though. The remaining extra overhead on AGX I would guess is context switching overhead in the spawn-task-per-inorder-frame which gets a return on that investment once you actually allow it to go wide.

What you describe re: multiple .awaits vs joining multiple JoinHndles is exactly what I would expect if I were doing that when spawning tasks. As described in the Tokio docs.

But that is not what I do. Here I spawn the task and discard the JoinHandle. There is no await or join on the task itself. Then I await or join on the receiver of a channel that gets the result from the task.

At which point I get confused. Does my task start to run as soon as I spawn it? Or does it not get scheduled until I await or join that receiver? If the later how on Earth does that work?

Anyway, let's ignore my second async attempt for now. The first one looks like this now.

use async_recursion::async_recursion;
type Sender = mpsc::UnboundedSender<u64>;
type Receiver = mpsc::UnboundedReceiver<u64>;

pub const USE_JOIN: bool = true;
pub const NOT_USE_JOIN: bool = !USE_JOIN;

pub struct FiboAsync<const USE_JOIN: bool> {
    par_rx: Receiver,
    res_tx: Sender,
}

impl<const USE_JOIN: bool> FiboAsync<USE_JOIN> {
    pub fn new(par_rx: Receiver, res_tx: Sender) -> Self {
        Self { par_rx, res_tx }
    }

    #[async_recursion]
    pub async fn fibo(n: u64) -> u64 {
        let (par_tx, par_rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let (res_tx, mut res_rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let f1 = FiboAsync::<USE_JOIN>::new(par_rx, res_tx);
        tokio::spawn(f1.calc());
        par_tx.send(n).unwrap();
        res_rx.recv().await.unwrap()
    }

    async fn calc(mut self) {
        let n = self.par_rx.recv().await.unwrap();
        let f = match n {
            0 => 0,
            1 => 1u64,
            _ => {
                if USE_JOIN {
                    let f1 = Self::fibo(n - 1);
                    let f2 = Self::fibo(n - 2);
                    let (f1, f2) = join!(f1, f2);
                    f1 + f2
                } else {
                    let f1 = Self::fibo(n - 1).await;
                    let f2 = Self::fibo(n - 2).await;
                    f1 + f2
                }
            }
        };
        self.res_tx.send(f).unwrap();
    }
}