File reading: async / sync performance differences (hyper, tokio)

Hello everybody!

Excuse me for this long post - it's the culmination point of some time of fiddling. I try to break it down into a minimal example, and a longer explanation.

When reading a file, tokio::fs::File seems less efficient (in CPU time) than its sync version. Do you think this will still improve, or is there an inherent performance disadvantage of file async when there is no or not much concurrency?

In this example, I count the bytes of a large file (an old CentOS image of 636 MB). It takes around 80 to 100 ms on my system (Rust 1.39, release build, after some runs so that the file is entirely in
Linux' memory cache).

use std::fs::File;
use std::io::Read;

static FILEPATH: &str = "/home/vh/tmp/CentOS-7-x86_64-Minimal-1503-01.iso";

fn main() {
    let mut counter: usize = 0;
    let mut file = File::open(FILEPATH).unwrap();
    let mut buf = [0u8; 64*1024];
    loop {
        match file.read(&mut buf) {
            Ok(n) => {
                if n == 0 {
                    // eof
                    break;
                } else {
                    counter += n;
                }
            },
            Err(_) => panic!("Error while reading file")
        }
    }
    println!("Read {} bytes.", counter);
}

However, its async counterpart takes between 535 and 620 ms, that is 6.5 times as much.

extern crate futures;
extern crate tokio; // 0.2.0-alpha.6

use futures::executor::block_on;   
use tokio::fs::File;  
use tokio::io::AsyncReadExt;

static FILEPATH: &str = "/home/vh/tmp/CentOS-7-x86_64-Minimal-1503-01.iso";

async fn async_main() {
    let mut file = File::open(FILEPATH).await.unwrap();
    let mut counter: usize = 0;
    let mut buf = [0u8; 64*1024];
    loop {
        match file.read(&mut buf).await {
            Ok(n) => {
                if n == 0 {
                    // eof
                    break;
                } else {
                    counter += n;
                }
            },
            Err(_) => panic!("Error while reading file")
        }
    }
    println!("Read {} bytes.", counter);
}

fn main() {
    block_on(async_main());
}

I know that file operations are rather sychronous on the kernel level... Maybe the synchronous syscalls are so optimised that this result was to be expected?

Background

Of course, my goal is to handle more concurrency than in the example above :slight_smile:

For a free audiobook web site, I want to create on-the-fly ZIP archives (store-only, without compression) of MP3 files. This is currently being done by a Node app that leaks so much memory that it gets restarted
twice a day. So for a long time I've been looking at Rust and its growing async system to replace it! Several
clients download large files with a slow speed. Many connections, no computation, only I/O-bound. Sounds like a nice use case for async!

Before going into the details of zipping, I played around with three examples of serving a file with Hyper, reading it chunk-by-chunk, more or less based on Hyper examples. I measured the time spent by 10 concurrent clients, each reading the file ten times with a five-second wait between: siege -c 10 -d 5 -r 10 $URL.

The first version spawns one thread per client for a loop with synchronous reading and transmitting the read bytes through a Channel. It has excellent performance and memory usage.

The second version is fully async, using Tokio's codec::FramedRead::new(file, codec::BytesCodec::new()) for reading the file. The third version (branch "fileserve_hyper-async-stream" in the repo from the other links; sorry, as a new user I'm only allowed to post two links) does the same, but with a custom implementation of Stream. With pre-1.39 nightly versions, this was more performant, but not with 1.39.

Version Peak RSS usage RSS usage afterwards Siege walltime CPU time
Sync in Channel 8.6 MB 5.4 MB 1:16 min 1:05 min
Async, codec::FramedRead 26.2 MB 7.1 MB 3:40 min 3:28 min
Async, own Stream 24.6 MB 7.5 MB 4:13 min 4:01 min

Of course, I don't have much practice yet in Rust and there may be many differences and possible optimisations between these versions. That's why I made the minimal example above.

I'd be inclined to continue with the threaded version - it seems to perform very well in this situation of limited concurrency. Do you think the async versions should eventually match and outperform the threaded version?

Thank you for reading!
Viktor.

5 Likes

Sorry I don't have any insight into this, but I am very interested in answers. Please post here as you learn more.

By monopolizing a whole thread for your task, synchronous code can perform that task faster than asynchronous code would be able to, however asynchronous code scales much better as the number of concurrent tasks increases, as it can perform a very large amount of tasks concurrently using only a few threads.

3 Likes

Hence, not every place you need to asynchronous code, you have to see which works best in the situation.

As for improving the speed, maybe, but that needs more digging into where the bottleneck is generating from.

In theory, yes, and imo async version should be more performant.

Do test of 1, 10, 100 users and see how it scales.

1 Like

Thank you for your answers!

To better understand the setup costs of async and sync "one-shot" code, I added calls to std::time::SystemTime (for the sync and the async minimal examples).

In the synchronous case, the main() function takes up the entire "real" time:

$ time target/release/rust-async-stream-test
Read 666894336 bytes in 100143 µs

real    0m0.102s
user    0m0.004s
sys     0m0.093s

In the asynchronous case, the insides of the async_main() function also take up nearly the entire time. So it is the async calls, not any reactor setup, that is slower.

$ time target/release/rust-async-stream-test
Read 666894336 bytes in 628086 µs.

real    0m0.637s
user    0m0.283s
sys     0m0.400s

Returning to the web serving examples with Hyper, I tried to increase the concurrency. Maybe 10 concurrent requests are just not enough to show any async benefit. siege is a bad fit for testing this, as it will fully load into memory each response, and my test file has 636 MB. So I switch to ApacheBench:
ab -n $REQUESTS -c $REQUESTS $URL

20 simultaneous requests Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 17 MB 6.7 MB 0:09 min 0:13 min
Async, codec::FramedRead 24 MB 8.6 MB 0:25 min 0:32 min
Async, own Stream 27 MB 7.3 MB 0:30 min 0:43 min

With 20 concurrent requests, the sync version still wins for both time and memory usage.
But let's go further:

100 simultaneous requests Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 71 MB 10 MB 0:51 min 1:04 min
Async, codec::FramedRead 27 MB 10 MB 1:52 min 2:49 min
Async, own Stream 28 MB 10 MB 1:56 min 3:27 min
250 simultaneous requests Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 162 MB 13 MB 2:16 min 2:59 min
Async, codec::FramedRead 74 MB 17 MB 4:05 min 6:51 min
Async, own Stream 29 MB 14 MB 4:24 min 8:28 min

A number of conclusions:

  • Between 20 and 100 users, peak memory usage becomes much better for the async versions, at the cost of CPU time.

  • All runs used more than one core (Hyper doing this all by itself), hence the difference betweent the maximum request time and the CPU time.

  • In all runs, the minimum and the maximum request time were very close to each other; basically, all requests started and ended at the same time. More clients means slower requests, as the bandwidth of "localhost" stays the same. This means Hyper does very good round-robin serving in all cases.

  • RSS usage moved quite a bit for the sync version, I noted the maximum usage; in all async runs, memory consumption went immediately to the noted maximum, stayed there, and came down only at the end of all requests.

  • The 250 requests-run showed a big surprise for me: in my "custom stream" implementation, peak memory usage practically does not move at all, only the requests get slower. The "tokio::codec" solution does take more peak memory.

  • While memory usage goes down after the runs, it stays larger than its initial 1 MB (this was to be expected).

  • As all the runs show, Rust/Hyper gives me excellent backpressure management for free, which is quite outstanding. In each request I'm serving 636 MB here (even if it comes from the file cache and goes to /dev/null).

  • In another test, I did a single, limited-rate download, and canceled it manually. All my Rust versions managed not to leave large file chunks in resident memory, which I'm also very happy about.

2 Likes

What is the total time to finish in all 3 cases? Async versions should finish faster for 100 and 250.

The total time is the maximum request time. All requests start and end at the same time. In my case, async was slower, but used less memory (which for me was always the point about being "scalable").

If maximum request time is the time it takes to finish all requests, then async code should finish faster.

Sync code memory usage maybe high because it may not be dropping memory faster than it is loading, just a guess.

If you can try, then try with 5,000 users, in that case async code is supposed to finish faster.

In this case the tasks are always very large, dwarfing the cost of spawning a thread for every task. This means that async code may not become faster at any number of tasks until you run into the limit on the number of threads.

The advantage of async code becomes much more pronounced if the tasks become smaller. For example serving individual web pages and their associated resources such as scripts, your tasks will be much smaller, and this is where avoiding the rather large cost of creating a thread can win a lot.

Note that this is the job of the executor, not hyper. In this case your executor is probably tokio, and that is where the round-robin serving happens.

1 Like

Yes. And yet, as I'm expecting downloads from slow clients, larger files also means more concurrent requests... for which async looked like the better match. Though these tests show that it probably does not matter much, as long as my site does not get slashdotted.

Thanks to you, and to tokio then! :slight_smile:

Sorry, I don't have the time for 5000 requests today, but here are 1000 (after increasing the nofile limit in /etc/security/limits.conf):

1000 simultaneous requests Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 573 MB 24 MB 10:29 min 13:00 min
Async, codec::FramedRead 62 MB 46 MB 17:03 min 32:55 min
Async, own Stream 49 MB 38 MB 16:18 min 35:00 min

While the gap seems to be narrowing for the total request time, this is not the case for CPU time.
This is explained by the sync version only going to about 120% CPUs, whereas the async versions
use two full cores and more.

In my view, the "scalable" promise of async was always about having less than linear memory consumption: being able to serve more clients with the same amount of memory. CPU time is a different thing. Whether async uses less CPU time, that may depend about how the OS / the libraries actually implement threads, and how fast the two types of context switching are.

2 Likes

I have a theory that when using a thread per task, the kernel will spend a bunch of time on context switching which might not be counted towards your process, because context switching isn't necessarily part of the cost for any particular process.

Also: A bit pedantic but the memory usage is definitely linear, it just has a much lower constant considering e.g. that threads have a 2 MB stack each.

1 Like

Good point, and very possible that my measured "CPU time" does not include those switches and should be higher. But in all my tests, the total wall time it took for the requests was always lower for the sync version, so this hidden cost does not reverse the picture. Maybe it will for 5000 requests.

You're right, that's how I should have worded it. Is it actually? Some part inside it certainly is linear, but when I look at the total numbers for the third version - 27 MB for 20 requests, 28 MB for 100 requests, 29 MB for 250 request, and 49 MB for 1000 requests - it looks pleasantly non-linear to me.

Can you test 1000 users for small sized data like 2KB? See if Sync version is still better?

I mean, every task you spawn on the executor uses up at least one byte of memory, so the memory usage has a linear function as a lower bound.

A quick look at the async version (of the code you pasted inline) in perf reveals that tokio::fs::File reads into an intermediate 16k buffer, and then copies the data into your buffer (and perf shows most of the time spent doing memcpy); as opposed to the synchronous version which read()s directly into your buffer. Since the file is buffer-cached like you said, we are not actually I/O bound, but CPU or RAM-latency bound. However, when you extend this over a real network, you're likely to be network-latency-bound. I'd suggest benchmarking after inducing a latency with tc qdisc.

Edit: also, try the synchronous version with a 16k buffer, as that's what you're really getting with tokio::fs::File..

4 Likes

It took some time to test for 5000 simultaneous requests, I don't think there are any new conclusions with regards to the run with 1000, except that peak RSS usage does not drop off by much afterwards (it does drop off more after a very long time!), and by now, my system comes to its limits, as some errors start showing up in the ApacheBench summary (and sometimes on stdout), even with the nofile limit lifted. But the total number of bytes transmitted did not seem far off.

5000 simultaneous requests Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 1661 MB 201 MB 47:39 min 61:56 min
Async, codec::FramedRead 250 MB 199 MB 94:48 min 196:47 min
Async, own Stream 163 MB 162 MB 86:02 min 196:01 min

Finding a good smaller file size is not easy. The suggested 2 KB are too small to test - the threaded version served 1000 requests within 200 milliseconds! Even 1000 requests for a 2 MB file finished within 1.7 seconds. Maybe I'd need to switch back to Sieve, where each "client" can repeat requests. Interestingly, 5000 simultaneous requests with ApacheBench do not give any errors with the first two versions, and only a few with the third one, when it's about a 2 MB file:

5000 simultaneous requests, 2 MB file Peak RSS usage RSS usage afterwards Max. request time CPU time
Sync in channel 174 MB 37 MB 0:15 min 0:12 min
Async, codec::FramedRead 180 MB 136 MB 0:19 min 0:37 min
Async, own Stream 153 MB 150 MB 0:18 min 0:38 min

It's interesting to compare the two 5000er runs. The memory usage from the async versions is roughly comparable - clearly, it's the number of concurrent connections that is responsible for the memory usage, not the file size. In this scenario, it does not take up less memory than the sync version! Whereas the sync version is more dependent on the file size (or the download duration). CPU time spent, however, is vastly lower, by more than the factor 300 between the two file sizes.

Here I'm back to the "minimal" example, where I'm just reading files and counting bytes. Changing the buffer size to 16k made the sync version marginally slower (95-120 ms instead of 80-100 ms) but the distance to its async counterpart stays in the same order.

Memories come up! It must have been 2004/2005 when I played last with that (in order to ensure fair treatment of a shared wi-fi router behind a Linux box in a student residence!). I admit I don't have the time to delve into that now - what I did do previously was a single curl --limit-rate and observing memory consumption, in order to test backpressure management, which worked marvelously for all cases.

Of course you're right that in a real-world scenario I will be network-latency-bound (and the 5000 clients also won't arrive in the very same millisecond). But with my tests above, I'm optimistic that backpressure management works, and thus whatever network-latency gets involved, it won't put the system into a worse state (concerning CPU and memory) than the numbers above.

My general conclusion is that, as expected, async shows great advantages with regards to memory consumption in high-concurrency situations, but as long as concurrency is low and is supposed to stay low, multi-threading can be more efficient. Definitions of "high" and "low" depend on the use case, of course. While probably no one will advocate for a multi-threaded WebSocket listener, for my modest needs, especially if I need a library that is currently synchronous (in my case, a ZIP archive creator), it looks like a good plan to use it as-is and keep things "sync" instead of starting immediately to rewrite the library for async.

3 Likes

The 400ms spent in the kernel is what I find fishy.. I think that in itself warrants further investigation. There's no reason the async version should spend more kernel time unless 1) there's simply something different about the way time is accounted for, or 2) there's some inefficiency in the way the kernel is being used. Possibility #2 is why I suggested benching the sync version with the smallest buffer to see if that was the issue, but as you found, it seems not to be the case.

Also, since you're using the alpha overhauled tokio and scheduler (though I doubt the reactor changed much?), I wouldn't be surprised if this was a bug, and it might be worth reporting this to tokio so they can explain or track down the issue.

2 Likes

I did open an issue: File reading: Performance regression on 0.2.0-alpha.6 compared to 0.1.22 · Issue #1844 · tokio-rs/tokio · GitHub

I tried to cobble together a version using the old Tokio, 0.1.22 and it turns out to be exactly as efficient as the synchronous version! (4-16 ms in userland and 85-120 ms in kernel space)

Independently on how this will turn out, I'll be curious to re-run these examples in one or two years time. I already tested the "sync" file serving example with Hyper in January 2018 and noted down the performance then. With current Rust and crates, the same code has doubled in speed since then :slight_smile:

2 Likes

That's because the version your wrote is synchronous. Using try_ready is not equivalent to await, instead it's a version of ? that returns the function when the value is NotReady.

Are you sure the version you posted even reads the entire file? As I read it, it should exit immediately when poll_read is not ready.

1 Like

Very well possible that my example is flawed. On the positive side, the Tokio devs are aware of a temporary regression in the filesystem I/O. I confirm that the file is read completely, on each run, as the correct file size gets printed on stdout.

I tried to move the try_ready in an explicit Future object, in a dedicated poll() method. The performance stays the same. Maybe this still is not fully asynchronous, but at least now my "BytesCounter" future should be fully equivalent to the "Display10" future in the Tokio docs which also uses try_ready:

To use the stream, a future must be built that consumes it. The following future will take a stream and display 10 items from it.

I admit that I'm piecing this together from old docs and tutorials without deeply understanding it. (I reserve my hope for better understanding to the new async/await syntax and the new library versions :slight_smile: ) If someone has a suggestion for a better "asynchronous" test with Tokio 0.1, I'm all ears.