Why this Rust program is taking longer than go counterpart?

Hello Rust Community!

I have some experience in writing go code and now learning rust. I know there are lot of comparison everywhere, but to benchmark both languages with very high concurrency I wrote a small piece of code in both Rust using tokio and in Golang. I does nothing just spawns bunch of goroutines and tokio tasks and exist after waiting 10 s. I set both programs to use equal threads. Here is my code:
Go:

package main

import (
	"fmt"
	"sync"
	"time"
)

var wg sync.WaitGroup

func main() {
	for i := 0; i < 1_000_000; i++ {
		wg.Add(1)
		go worker(i)
	}
	wg.Wait()
	fmt.Println("done")
}

func worker(i int) {
	time.Sleep(10 * time.Second)
	fmt.Println(i)
	wg.Done()
}

Rust:

use std::time::Duration;
use tokio::task::JoinSet;

async fn process_one(i: i32) {
    tokio::time::sleep(Duration::new(10, 0)).await;
    println!("{}", i);
}

#[tokio::main(flavor = "multi_thread", worker_threads = 16)]
async fn main() {
    let mut set = JoinSet::new();

    for i in 0..1_000_000 {
        set.spawn(async move {
            process_one(i).await;
        });
    }
    set.join_all().await;
    println!("done")
}

I compiled Rust code with cargo build --release

I executed both of them:
env GOMAXPROCS=16 /usr/bin/time -v ./routinetestgo

And for rust
/usr/bin/time -v ./routinetestrust

Following is the output:
Go:

        Command being timed: "./routinetestgo"
        User time (seconds): 17.64
        System time (seconds): 7.24
        Percent of CPU this job got: 166%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:14.96
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2758548
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 694710
        Voluntary context switches: 358359
        Involuntary context switches: 3836
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Rust:

        Command being timed: "./routinetestrust"
        User time (seconds): 6.54
        System time (seconds): 11.62
        Percent of CPU this job got: 112%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:16.13
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 377856
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 93967
        Voluntary context switches: 1292955
        Involuntary context switches: 1101
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

I carried out this execution multiple times but rust code always stays behind (Considering elapsed wall clock time). Is this normal to have such difference?...seems Rust is taking around ~7% more time. Or do I need to optimize rust code further to get comparable results? Can someone please explain the cause of this?

1 Like

What if you remove the printing?

I suspect what you might be benchmarking is either your console or different choices about locking for output between languages, more than the actual language.

11 Likes

The main function runs in a different thread from the Tokio worker threads. Try this:

#[tokio::main(flavor = "multi_thread", worker_threads = 16)]
async fn main() {
    tokio::spawn(async move {
        let mut set = JoinSet::new();

        for i in 0..1_000_000 {
            set.spawn(async move {
                process_one(i).await;
            });
        }
        set.join_all().await;
    }).await.unwrap();
    println!("done");
}

As another thing, try changing the runtime flavor to the current_thread runtime. Printing takes a lock, so having 16 threads competing for the same lock is pretty inefficient, and you'll be better off with a single-threaded runtime.

5 Likes

With the multithreaded runtime, would it also make sense to reduce the number of threads to the number of cores? (If so, this should be done for both languages to compare fairly.)

Well, if the primary workload is printing stuff to the console, you would be better off with 1 thread than one per core. But under normal circumstances yes. There's a reason that #[tokio::main] does this by default.

5 Likes

Sorry, a bit off topic. But in this case would funneling the output into a channel and then having a single thread doing the printing be better ?

The observable guarantees of this program are so limited that the "best" - most time-efficient, since that's what OP is investigating - implementation is probably a single-threaded program with hard-coded output and a ten-second wait. If that hypothetical program took more than 11 seconds to run on normal systems, vs. 14 and 16 seconds wall time in OP's examples, I'd be absolutely floored.

OP isn't actually exploring that problem, so that's probably not the answer I would want to give them. They're making what seems like a reasonable attempt to understand how tokio's scheduler differs from Go's. In that lens… I think the answer to "would it be better to funnel output into a channel" is maybe. Measure first.

Taking care with how IO gets scheduled can make a huge difference to some concurrent programs, and barely a difference at all to others. The one OP has presented probably will benefit from consolidating the IO into a single task, but it's so wildly oversimplified that it'd be difficult to draw any generalized conclusions from that even if it does turn out to be true under measurement.


For grins, I implemented OP's program as a shell one-liner: seq 1000000 && sleep 10. On my system, that program completes in just under 11 seconds, most of which is spent writing those million numbers to my terminal. If I redirect the output to a file, it completes in 10.09 seconds.

7 Likes

I'm curious about the timing differences. I don't have a good handle on what User time and System time are exactly.

Go:
User time (seconds): 17.64
System time (seconds): 7.24

Rust:
User time (seconds): 6.54
System time (seconds): 11.62

Whatever go does to get a shorter wall clock time appears (to me) to be related to minimising System time, even at the expense of greater User time.

If this is to be believed: unix - What do 'real', 'user' and 'sys' mean in the output of time(1)? - Stack Overflow

Rust took less CPU time "in process", but still took longer, so it seems that it must have been blocked or waiting for a lock much longer than go was.

It also seems that rust took less User CPU time, i.e. the non-kernel stuff was faster. But the syscalls took longer (i.e. the printing?).

Then the minor page faults and voluntary and involuntary context switches also look interesting!

Rust had fewer page faults (hardly surprising. It used less memory) and fewer involuntary context switches. But it had over a million voluntary context switches! Almost 4 times what go had

That might be the most interesting difference resulting from how tasks are getting scheduled.

1 Like

As others already mentioned that's probably related to locking mechanisms for the standard output rather than how tasks are scheduled (which generally doesn't require context switches, that's an essential part of async!). In particular Rust has a (reenteant) Mutex around its stdout to ensure atomicity of writes. However when lot of threads are trying to print at the same time this will result in contention, which will to some of them waiting for the Mutex to be released and hence voluntary context switches. I'm not sure about what Go does, but it likely uses a different strategy that doesn't require locking and/or requires less syscalls.

4 Likes

So many of these "why is X faster than rust on this simple test?" posts seems to boil down to stdout.

Is there any discussion of the println/stdout design out there on the internet? I assume it's a fairly old part of rust!

1 Like

Hello Everyone!

Thanks for so many inputs.
As @scottmcm suggested, I commented out the printing part and then measured the timings.
It indeed makes a lot of difference.
For Go:

        Command being timed: "./routinetestgo"
        User time (seconds): 7.76
        System time (seconds): 3.19
        Percent of CPU this job got: 84%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:12.91
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 2585524
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 648802
        Voluntary context switches: 145182
        Involuntary context switches: 226
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

For Rust:

        Command being timed: "./routinetestrust"
        User time (seconds): 5.38
        System time (seconds): 5.23
        Percent of CPU this job got: 95%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:11.16
        Average shared text size (kbytes): 0
        Average unshared data size (kbytes): 0
        Average stack size (kbytes): 0
        Average total size (kbytes): 0
        Maximum resident set size (kbytes): 377728
        Average resident set size (kbytes): 0
        Major (requiring I/O) page faults: 0
        Minor (reclaiming a frame) page faults: 93962
        Voluntary context switches: 400077
        Involuntary context switches: 1698
        Swaps: 0
        File system inputs: 0
        File system outputs: 0
        Socket messages sent: 0
        Socket messages received: 0
        Signals delivered: 0
        Page size (bytes): 4096
        Exit status: 0

Now rust completes faster. I also tried to move spawn loop into separate tokio task as suggested by alice, It makes slight improvement few times, but not much and it is also inconsistent across runs.

@derspiny my goal was to also measure the resource usage. Just look at the resident memory, there's a huge difference. And that too for doing almost nothing.

2 Likes

No, because the channel has a mutex too.

Channels can be implemented with mutexes, but are not necessarily. Tokio’s MPSC implementation uses atomic operations to pick a free slot in storage to write each value into, so there is no mutual exclusion.

2 Likes

Maybe just a bigger I/O buffer.

I think that's why you should probably never use println in a production app and use a high-performance logging implementation instead.

It's not just Rust - though Rust does absolutely add its own overhead here. Standard output is complicated because files, terminals, and shells are complicated.

Writing to standard output isn't any faster or slower than writing to any other file handle in principle. That's all standard output is - a file handle that, by convention, you can expect to already be open for writing when your program starts. There are very few constraints as to what that handle can be connected to, and the performance of writes to stdout does depend - heavily - on what's on the other end.

A lot of peoples' experience with standard output being slow is really the experience of writing to a terminal emulator being slow. Standard output for most interactive use is, by default, connected to the same (pseudo) TTY as the terminal the program was launched from. Any write to the file handle has to be delivered to the TTY, and thence to the terminal emulator. Most terminal emulators do complex work with the output they receive, implementing things like colours, terminal positioning commands, fonts, scrolling, split buffers, and myriad other things, all of which takes time. Scrolling often takes the most time, especially if the terminal has a large scroll buffer.

And all of this is opaque to the poor program that wrote the output.

I mentioned in my reply that I got a meaningful performance difference by routing my scratch program's standard output to a file. That wasn't done out of pure random choice; files behave differently from terminals, in a lot of ways, and I've seen those differences often enough that they form part of my intuition about the behaviour of unix programs. File output is buffered in a very different way from terminal output, and file handles generally don't involve any of the complex work that goes into putting text on the screen.

There are a bunch of other things stdout could be connected to, which entail their own performance tradeoffs. systemd socket activation (and, if you're a bit older, xinetd or inetd services) can connect a program's standard file handles to a network or local domain socket, for example, where the performance of writes may be dictated by the network those writes travel over and the behaviour of your OS's network stack. Redirecting output to /dev/null is pretty well known, and file handles connected to the null device are normally very fast (since they don't actually do anything). I could go on.

The bottom line, at least as I see it, is that writing to standard output is fine when you don't care about where that output goes or how long it takes to get there. If you need writes to fit within a specific performance envelope, though, then you need to do a bit more work. You can write to stdout very quickly, even in the terminal-output case, but it takes more work than scattering println! macros at appropriate sites. If that's not sufficient, then your program may need to take more control over where its output goes, with all the user-facing tradeoffs that can come with that,

8 Likes

std Rust wraps standout out in a reentrant Mutex, and println! locks and drops the StdOut lock every invocation. At least years ago when I had reason to care, explicitly getting the lock write!ing all output to that was significantly faster then some loop using println!, say.

So[1] std Rust's design is actually a factor.


  1. at least historically, didn't re-check ↩︎

2 Likes

TBH, println! is arguably not something that should ever be used in "real" code. It's important that it exists for demos and quick scripts and stuff, but if the output actually matters, you should be using writeln! in a way that you can #[test] it by writing to a string and not the console, and then the binary just passes in stdout (probably already-locked if that's the only stdout use expected).

Last I looked there was no println! in ripgrep, for example.

4 Likes

Yeah, I agree. ("[Y]ears ago when I had reason to care" involved online code challenges, not "real" code.)

There's not a blog post about it AFAIK, but I've seen BurntSushi express the same sentiment somewhere in the sigpipe related discussions.

2 Likes

That's right. See: writeln!(io::stdout()) vs println · Issue #17 · BurntSushi/advent-of-code · GitHub

8 Likes