Blocking Permit

Alternative title: "I block your reactor threads and `Sink` your battleship."

http://gravitext.com/2020/01/13/blocking-permit.html

This blog post is about a couple of benchmarks along with some new crate releases, but conclusions are inevitably drawn, about the nature of blocking operations in asynchronous runtimes.

Its sort of a followup to Futures 0.3, async♯await experience snapshot

To quote the new post:

…the massive async♯await upgrade has been completed, including a reasonably pleasant experience of re-writing all its tests and benchmarks in generic, async♯await style (here with a total of 34 uses of await and test LoC changes: 1636 insertions, 622 deletions). I've also released (besides blocking-permit) a new set of body-image-* 2.0.0 crates using the latest tokio, hyper, and futures.

What do you think? Anyone interested in helping to benchmark async-std in the same way as tokio, as used in these benchmarks?

1 Like

Some related questions:

  • Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread? If any of these do, I haven’t been able to find where it happens. Please point it out for me.
  • Does hyper’s HTTP 1 chunked transfer encoding implementation attempt to offload Base64 encode/decode from the reactor thread? Should it?

Anyone know the answers here? I suspect filing question type issues in these crate's github is unlikely the best way to request comment? Any alternative suggestions here?

@Nemo157 is pretty responsive (in past experience) so I've commented here:

To be honest, I didn't think about it at all when starting the project. I just followed the design passed down from on high, given that alexcrichton/{flate2, brotli2} already contained similar adapters for the Tokio 0.1 IO traits.

Since then, I've seen a similar sort of question asked a couple of times. My initial thoughts is that running the compression on the executor doesn't matter for the original usecases (most times I mention compression I mean either compression or decompression). Given that those usecases are web servers/clients my expectation is that the combination of low network bandwidth and low compression ratio would result in the compression bandwidth vastly exceeding the network bandwidth, once the compressor gets far enough ahead of the network that would result in a Pending result being returned and allow something else to run. As you mention this would likely result in compressing something like an 8K block at a time, depending on the underlying stream.

Some very basic benchmarking of flate2's Gzip implementation shows it around 500us to encode a 8KiB block, giving ~130Mb/s bandwidth. That is a short enough blocking time + high enough bandwidth I think my expectation is fine for internet usecases as mentioned above.

The more interesting question might be something like writing to an SSD using async file IO, given that SSDs can have way more bandwidth. For that sort of usecase, if you producing data fast enough and are trying to saturate the disk IO, I think it would make more sense to use blocking IO. But if necessary you could probably have an AsyncBufRead adapter underneath the encoder that yields every few blocks to allow other tasks a chance to run.

3 Likes

Thanks very much for the detailed answer to a question from out of the blue!

I see 512KiB blocks through hyper under certain conditions (of which I'm not exactly certain of cause). If I modified your basic benchmark correctly, that raised it to 43ms (edit: milliseconds) per block. Is that enough to be concerned? Should your and my Streams and Sinks be doing partial block processing when blocks get large?

It seems like futures? blocking-permit? or some other crate should have testing Stream/Sink wrappers that collect summary statistics (mean, max, 95th percentile, 99th percentile, etc.) of poll time so we can figure out what an acceptable amount of blocking (wall) time really is?

My evolving thesis is that if "blocking I/O" to NVMe SSD is below some threshold, then the direct strategy is fully justified and legitimate, just like your approach to compression.

Some updates on the above highlighted open questions:

Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread?

Nope. According to @sfackler (on #tokio-users discord, thanks!) there is no provision for offloading TLS en-/decrypt from the reactor threads (at least with the default TLS stack used here). However TLS specifies a maximum record size of 16KiB, so "you won't be doing more encrypt/decrypt work than that per call" and "hardware-accellerated AES runs at multiple gigabytes per second iirc". So this is a simpler case and (in my testing) unlikely to benefit from any attempt to move off a reactor thread.

However (and thanks again @Nemo157) with additional testing I do find that I'm receiving Bytes items from hyper 0.13.1's response body Stream at upwards of 512KiB in size, by default. There is an easy change for comparative benchmarking:

        let client = hyper::client::Client::builder()
+           .http1_max_buf_size(64 * 1024)
            .build(connector);

And this has measurable effects on the client benchmarks described in my original blog post:

  • On my laptop (less important) with in-process server, it actually makes fs and mmap (tmpfs) benches significantly faster.

  • On the ec2 virtual hosts (more important!), it makes things considerably slower (particularly the ram benches).

For approximately size-linear operations like decompression, I find that significant. Beyond the hyper client config tweak, my Sink implementations are in position to potentially do partial (sub-chunked?) processing. Hopefully I'll be able to compose that with another Sink wrapper type over my 6 others. Splitting the original Bytes or (1) MemMapBuf into many via something like Bytes::split_to and more Arcs.

As for my thesis, I'll be moving forward with using synchronous (blocking) tmpfs I/O on reactor threads in production, with the library retaining the option to offload (async) these, which thus far is really just a complicating solution in search of problem.

Just an answer on the TLS question: I can't provide the references, but people have benchmarked it before. Moving encryption/decryption away from the network thread is typically controproductive. The actual data encryption is very fast - also thanks due to hardware acceleration. And all kind of thread changes are expensive on their own. The TLS handhshake is another thing. It's a lot more computationally expensive - there it can make sense to move it off to another thread.

Also keep in mind that in servers already all threads will be busy doing request handling. There are typically no spare threads just for doing compression/decryption . If you would introduce them, you would have less CPU resources available for the remaining things, and the overhead of moving work across threads. Having external computation threads is more helpful for decreasing latency, and making sure that one CPU intensive task doesn't block other processing too much.

For work where the blocking time is rather small it doesn't pay off. If it's too much the work can potentially be split via some async yield expressions, so that other tasks have a chance to work in between. That can certainly be an approach for a compression crate.

1 Like

Thanks @Matthias247! Regarding TLS, that certainly matches my testing results.

in servers already all threads will be busy doing request handling…no spare threads

While I don't disagree with your assessment, please note the apparent conflict with Tokio's conventional wisdom. Tokio 0.2's async FS stuff uses a dedicated separate blocking thread pool. Its AsyncRead/AsyncWrite trait implementations are hard wired to move buffers and work to that pool. My testing finds that its the most inefficient approach of any available! What surprises me even more in my testing results is that even block_in_place is less efficient (regardless of Semaphore implementation) to direct blocking I/O.

Perhaps all this conventional wisdom echoes from yesterday's single threaded reactors and blocking I/O with spinning media? Last I saw v8/node-js didn't have "fearless concurrency" or threads?

For work where the blocking time is rather small it doesn't pay off.

Right, so next steps for me is to actually measure latency distribution for the blocking I/O. What exactly is "rather small" enough for my application?

If it's too much the work can potentially be split via some async yield expressions,

Are you referring to tokio::task::yield_now? The rustdoc for that is rather unhelpful. What does it do and what is the best definition of a task in relation to my Streams and Sinks? Apparently I'm not the only one that finds the task concept a bit ellusive, see "What is a task?" here: