Blocking Permit

dekellum · January 13, 2020, 4:03pm

Alternative title: "I block your reactor threads and `Sink` your battleship."

http://gravitext.com/2020/01/13/blocking-permit.html

This blog post is about a couple of benchmarks along with some new crate releases, but conclusions are inevitably drawn, about the nature of blocking operations in asynchronous runtimes.

Its sort of a followup to Futures 0.3, async♯await experience snapshot

To quote the new post:

…the massive async♯await upgrade has been completed, including a reasonably pleasant experience of re-writing all its tests and benchmarks in generic, async♯await style (here with a total of 34 uses of await and test LoC changes: 1636 insertions, 622 deletions). I've also released (besides blocking-permit) a new set of body-image-* 2.0.0 crates using the latest tokio, hyper, and futures.

What do you think? Anyone interested in helping to benchmark async-std in the same way as tokio, as used in these benchmarks?

dekellum · January 13, 2020, 10:24pm

Some related questions:

Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread? If any of these do, I haven’t been able to find where it happens. Please point it out for me.

Does hyper’s HTTP 1 chunked transfer encoding implementation attempt to offload Base64 encode/decode from the reactor thread? Should it?

Anyone know the answers here? I suspect filing question type issues in these crate's github is unlikely the best way to request comment? Any alternative suggestions here?

Does the newer async-compression worry about blocking a reactor thread? Should it?

@Nemo157 is pretty responsive (in past experience) so I've commented here:

Nemo157 · January 14, 2020, 10:13am

To be honest, I didn't think about it at all when starting the project. I just followed the design passed down from on high, given that alexcrichton/{flate2, brotli2} already contained similar adapters for the Tokio 0.1 IO traits.

Since then, I've seen a similar sort of question asked a couple of times. My initial thoughts is that running the compression on the executor doesn't matter for the original usecases (most times I mention compression I mean either compression or decompression). Given that those usecases are web servers/clients my expectation is that the combination of low network bandwidth and low compression ratio would result in the compression bandwidth vastly exceeding the network bandwidth, once the compressor gets far enough ahead of the network that would result in a Pending result being returned and allow something else to run. As you mention this would likely result in compressing something like an 8K block at a time, depending on the underlying stream.

Some very basic benchmarking of flate2's Gzip implementation shows it around 500us to encode a 8KiB block, giving ~130Mb/s bandwidth. That is a short enough blocking time + high enough bandwidth I think my expectation is fine for internet usecases as mentioned above.

The more interesting question might be something like writing to an SSD using async file IO, given that SSDs can have way more bandwidth. For that sort of usecase, if you producing data fast enough and are trying to saturate the disk IO, I think it would make more sense to use blocking IO. But if necessary you could probably have an AsyncBufRead adapter underneath the encoder that yields every few blocks to allow other tasks a chance to run.

dekellum · January 14, 2020, 3:15pm

Thanks very much for the detailed answer to a question from out of the blue!

I see 512KiB blocks through hyper under certain conditions (of which I'm not exactly certain of cause). If I modified your basic benchmark correctly, that raised it to 43ms (edit: milliseconds) per block. Is that enough to be concerned? Should your and my Streams and Sinks be doing partial block processing when blocks get large?

It seems like futures? blocking-permit? or some other crate should have testing Stream/Sink wrappers that collect summary statistics (mean, max, 95th percentile, 99th percentile, etc.) of poll time so we can figure out what an acceptable amount of blocking (wall) time really is?

My evolving thesis is that if "blocking I/O" to NVMe SSD is below some threshold, then the direct strategy is fully justified and legitimate, just like your approach to compression.

dekellum · January 16, 2020, 10:19pm

Some updates on the above highlighted open questions:

Does hyper-tls or native-tls or tokio-tls offload packet decryption from the reactor thread?

Nope. According to @sfackler (on #tokio-users discord, thanks!) there is no provision for offloading TLS en-/decrypt from the reactor threads (at least with the default TLS stack used here). However TLS specifies a maximum record size of 16KiB, so "you won't be doing more encrypt/decrypt work than that per call" and "hardware-accellerated AES runs at multiple gigabytes per second iirc". So this is a simpler case and (in my testing) unlikely to benefit from any attempt to move off a reactor thread.

However (and thanks again @Nemo157) with additional testing I do find that I'm receiving Bytes items from hyper 0.13.1's response body Stream at upwards of 512KiB in size, by default. There is an easy change for comparative benchmarking:

        let client = hyper::client::Client::builder()
+           .http1_max_buf_size(64 * 1024)
            .build(connector);

And this has measurable effects on the client benchmarks described in my original blog post:

On my laptop (less important) with in-process server, it actually makes fs and mmap (tmpfs) benches significantly faster.
On the ec2 virtual hosts (more important!), it makes things considerably slower (particularly the ram benches).

For approximately size-linear operations like decompression, I find that significant. Beyond the hyper client config tweak, my Sink implementations are in position to potentially do partial (sub-chunked?) processing. Hopefully I'll be able to compose that with another Sink wrapper type over my 6 others. Splitting the original Bytes or (1) MemMapBuf into many via something like Bytes::split_to and more Arcs.

As for my thesis, I'll be moving forward with using synchronous (blocking) tmpfs I/O on reactor threads in production, with the library retaining the option to offload (async) these, which thus far is really just a complicating solution in search of problem.

Matthias247 · January 17, 2020, 5:44am

Just an answer on the TLS question: I can't provide the references, but people have benchmarked it before. Moving encryption/decryption away from the network thread is typically controproductive. The actual data encryption is very fast - also thanks due to hardware acceleration. And all kind of thread changes are expensive on their own. The TLS handhshake is another thing. It's a lot more computationally expensive - there it can make sense to move it off to another thread.

Also keep in mind that in servers already all threads will be busy doing request handling. There are typically no spare threads just for doing compression/decryption . If you would introduce them, you would have less CPU resources available for the remaining things, and the overhead of moving work across threads. Having external computation threads is more helpful for decreasing latency, and making sure that one CPU intensive task doesn't block other processing too much.

For work where the blocking time is rather small it doesn't pay off. If it's too much the work can potentially be split via some async yield expressions, so that other tasks have a chance to work in between. That can certainly be an approach for a compression crate.

dekellum · January 17, 2020, 5:56pm

Thanks @Matthias247! Regarding TLS, that certainly matches my testing results.

in servers already all threads will be busy doing request handling…no spare threads

While I don't disagree with your assessment, please note the apparent conflict with Tokio's conventional wisdom. Tokio 0.2's async FS stuff uses a dedicated separate blocking thread pool. Its AsyncRead/AsyncWrite trait implementations are hard wired to move buffers and work to that pool. My testing finds that its the most inefficient approach of any available! What surprises me even more in my testing results is that even block_in_place is less efficient (regardless of Semaphore implementation) to direct blocking I/O.

Perhaps all this conventional wisdom echoes from yesterday's single threaded reactors and blocking I/O with spinning media? Last I saw v8/node-js didn't have "fearless concurrency" or threads?

For work where the blocking time is rather small it doesn't pay off.

Right, so next steps for me is to actually measure latency distribution for the blocking I/O. What exactly is "rather small" enough for my application?

If it's too much the work can potentially be split via some async yield expressions,

Are you referring to tokio::task::yield_now? The rustdoc for that is rather unhelpful. What does it do and what is the best definition of a task in relation to my Streams and Sinks? Apparently I'm not the only one that finds the task concept a bit ellusive, see "What is a task?" here:

github.com/tokio-rs/tokio

task: Introduce a new pattern for task-local storage

tokio-rs:master ← tokio-rs:lucio/task-local

opened 07:20PM - 16 Jan 20 UTC

LucioFranco

+274 -0

# Introduce a new pattern for task-local storage This PR introduces a new pat…tern for task-local storage. It allows for storage and retrieval of data in an asynchronous context. It does so using a new pattern based on past experience. This API is similar to the one from `std` except for changes that were needed to be made due to the differences required by async code. A quick example: ```rust tokio::task_local! { static FOO: u32; } FOO.scope(1, async move { some_async_fn().await; assert_eq!(FOO.get(), 1); }).await; ``` ## Background of task-local storage The goal for task-local storage is to be able to provide some ambiant context in an asynchronous context. One primary use case is for distributed tracing style systems where a request identifier is made available during the context of a request / response exchange. In a synchronous context, thread-local storage would be used for this. However, with asynchronous Rust, logic is run in a "task", which is decoupled from an underlying thread. A task may run on many threads and many tasks may be multiplexed on a single thread. This hints at the need for task-local storage. ### Early attempt Futures 0.1 included a [task-local storage][01] strategy. This was based around using the "runtime task" (more on this later) as the scope. When a task was spawned with `tokio::spawn`, a task-local map would be created and assigned with that task. Any task-local value that was stored would be stored in this map. Whenever the runtime polled the task, it would set the task context enabling access to find the value. There are two main problems with this strategy which ultimately lead to the removal of runtime task-local storage: 1) In asynchronous Rust, a "task" is not a clear-cut thing. 2) The implementation did not leverage the significant optimizations that the compiler provides for thread-local storage. ### What is a "task"? With synchronous Rust, a "thread" is a clear concept: the construct you get with `thread::spawn`. With asynchronous Rust, there is no strict definition of a "task". A task is most commonly the construct you get when calling `tokio::spawn`. The construct obtained with `tokio::spawn` will be referred to as the "runtime task". However, it is also possible to multiplex asynchronous logic within the context of a runtime task. APIs such as [`task::LocalSet`][local-set] , [`FuturesUnordered`][futures-unordered], [`select!`][select], and [`join!`][join] provide the ability to embed a mini scheduler within a single runtime task. Revisiting the primary use case, setting a request identifier for the duration of a request response exchange, here is a scenario in which using the "runtime task" as the scope for task-local storage would fail: ```rust task_local!(static REQUEST_ID: Cell<u64> = Cell::new(0)); let request1 = get_request().await; let request2 = get_request().await; let (response1, response2) = join!{ async { REQUEST_ID.with(|cell| cell.set(request1.identifier())); process(request1) }, async { REQUEST_ID.with(|cell| cell.set(request2.identifier())); process(request2) }, }; ``` `join!` multiplexes the execution of both branches on the same runtime task. Given this, if `REQUEST_ID` is scoped by the runtime task, the request ID would leak across the request / response exchange processing. This is not a theoretical problem, but was hit repeatedly in practice. For example, Hyper's HTTP/2.0 implementation multiplexes many request / response exchanges on the same runtime task. ### Compiler thread-local optimizations A second smaller problem with the original task-local storage strategy is that it required re-implementing "thread-local storage" like constructs but without being able to get the compiler to help optimize. A discussion of how the compiler optimizes thread-local storage is out of scope for this PR description, but suffice to say a task-local storage implementation should be able to leverage thread-locals as much as possible. ## A new task-local strategy Introduced in this PR is a new strategy for dealing with task-local storage. Instead of using the runtime task as the thread-local scope, the proposed task-local API allows the user to define any arbitrary scope. This solves the problem of binding task-locals to the runtime task: ```rust tokio::task_local!(static FOO: u32); FOO.scope(1, async move { some_async_fn().await; assert_eq!(FOO.get(), 1); }).await; ``` The `scope` function establishes a task-local scope for the `FOO` variable. It takes a value to initialize `FOO` with and an async block. The `FOO` task-local is then available for the duration of the provided block. `scope` returns a new future that must then be awaited on. `tokio::task_local` will define a new thread-local. The future returned from `scope` will set this thread-local at the start of `poll` and unset it at the end of `poll`. `FOO.get` is a simple thread-local access with no special logic. This strategy solves both problems. Task-locals can be scoped at any level and can leverage thread-local compiler optimizations. Going back to the previous example: ```rust task_local! { static REQUEST_ID: u64; } let request1 = get_request().await; let request2 = get_request().await; let (response1, response2) = join!{ async { let identifier = request1.identifier(); REQUEST_ID.scope(identifier, async { process(request1).await }).await }, async { let identifier = request2.identifier(); REQUEST_ID.scope(identifier, async { process(request2).await }).await }, }; ``` There is no longer a problem with request identifiers leaking. ## Disadvantages The primary disadvantage of this strategy is that the "set and forget" pattern with thread-locals is not possible. ```rust thread_local! { static FOO: Cell<usize> = Cell::new(0); } thread::spawn(|| { FOO.with(|cell| cell.set(123)); do_work(); }); ``` In this example, `FOO` is set at the start of the thread and automatically cleared when the thread terminates. While this is nice in some cases, it only really logically makes sense because the scope of a "thread" is clear (the thread). A similar pattern can be done with the proposed stratgy but would require an explicit setting of the scope at the root of `tokio::spawn`. Additionally, one should only do this if the runtime task is the appropriate scope for the specific task-local variable. Another disadvantage is that this new method does not support lazy initialization but requires an explicit `LocalKey::scope` call to set the task-local value. In this case since task-local's are different from thread-locals it is fine. Overall, I think this is a much better improvement over what we originally had in futures 0.1. This version is also much easier to work with and reason about! [01]: https://docs.rs/futures/0.1.29/futures/task/struct.LocalKey.html [local-set]: # [futures-unordered]: https://docs.rs/futures/0.3.1/futures/stream/struct.FuturesUnordered.html [select]: https://docs.rs/futures/0.3.1/futures/macro.select.html [join]: https://docs.rs/futures/0.3.1/futures/macro.join.html

system · April 16, 2020, 5:56pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Announcing Postage, an async channel library announcements	6	547	May 3, 2021
Looking for non-blocking WebSocket and/or TLS library help	12	2352	July 18, 2022
Tokio and block_in_place cause program to deadlock help	3	110	February 4, 2025
What's everyone working on this week (9/2020) community	8	794	May 25, 2020
Async ffi and tokio::spawn (or static/tls in general) help	1	847	August 9, 2022

Blocking Permit

Related topics