Discussion on Synchronous Crate Concurrency Refactor using Stackful Coroutines Model in Rust

Discussion on Synchronous Crate Concurrency Refactor using Stackful Coroutines Model in Rust

Hello Rustaceans!

Following several month of internal discussion and then the release of bincode-next v3.0.0-rc.7, bincode-next experimentally supports Asynchronous decoding by using Stackful Coroutines Model (or more precisely, by using async fiber framework) behind the optional async-fiber feature gate. Using the Stackful Coroutines Model, we have achieved Concurrency Support without rewriting the main part of the crate and the serde ecosystem, and at the same time, preserved high Concurrency performance (on my old Latitude 5400, about 1.2M ops/s, but more precise benchmark are under way and will be released shortly). But the problem is, this Stackful Coroutines Model using async fiber heavily relies on low level assembly and unsafe actions, which, although we try out best to ensure its safe, has some limitation under the current compiler utilities. And as it is Stackful, it also has some performance cost (though very much limited because of our stack pool implementation and the virtual memory table techniques of modern OS, and has an unexpected benefit which makes its performance very much more smooth under high Coroutines situation) and is not compatible with no-std environment.

Pros and Cons of our Approach

Although this is very much the basic knowledge of Asynchronous engineering, in order to make the reader have a clearer picture, we will repeat the pros and cons here.

Pros

  1. Using the async fiber model, we partially solve the function coloring problem of the Stackless Asynchronous Model, bridging the Synchronous ecosystem and the Asynchronous ecosystem (especially the serde problem).
  2. Performance mostly smooth under high concurrency status (near linear), and friendlier for the CPU branch predictor (huge Finite State Machine in Stackless Asynchronous Model often drag the performance when under high concurrency).
  3. Easy to use API just like normal async function.
  4. ...

Cons

  1. Safety issue that requires the user to do not write thread local dependent actions in the manual defined Decode trait implementation (the TLS issue).
  2. Tools like Miri, LLDB, AddressSanitizer, Tokio tracing, Kani will stop working correctly (which force us to use Loom, Proptest and fuzzing for testing).
  3. Because we use low level assembly instructions, cross platform testing is a problem.
  4. The Philosophy issue (See the following section for more details).
  5. ...

The Philosophy issue

As we all know, Rust, at its core, has its philosophy of zero cost, safety by default and so on. But this approach using low level assembly and is stackful, contradicting with the Rust way in some sense (we think everyone knows this and the reply to this post certainly knows this better...). But in another sense, maybe, just maybe, everything in Rust is to some extend, servicing the engineering need. So could we just define the Rust philosophy to be providing flexible and high performance tools for engineering practices instead of defining pure safe rust as the only symbol of the rust philosophy? Maybe yes, but maybe no (myself is devoted in the field of theoretical physics, if you found what I said a little bit confusing you could just ignore it). As this is more of a philosophy issue, there shall be not a definitive answer. So let's just keep ourselves open-minded (just like according to the CoC of our community) and discuss.

Links

Code Repository: GitHub - Apich-Organization/bincode: Bincode-next: The next official rust implementation of bincode · GitHub
Discord Server: Apich Organization
Contact E-Mail: info@apich.org


Note: We know that if been asked, the std team, compiler team, lang team, and many other developers or ecosystem developers of rust will certainly opposite this idea of using stackful coroutines model, but its just an engineering approach, though not idiomatic. We hope that as for the reply of this post, we would keep reasonable and rational, instead of go emotional.

Project Updates: v3.0.0-rc.10 Release

bincode-next v3.0.0-rc.10 is released today, and contains many minor fixes for the async fiber moduel.

1 Like

I am not sure why this initiative is entangled with bincode, but I am happy to see it nevertheless!

I long argued for this approach and even work on it as part of a proprietary project, so I would like to give some feedback based on my experience.

Strictly speaking it's not true. You can allocate stacks on heap or statically while targeting a bare metal target (and some RTOSes do exactly that), but you have to be conservative with stack sizes because you have limited memory and you don't have guard pages to protect against stack overflows. You could use stack canaries, but they are far from perfect. So incorrect selection of stack size can lead to various fun UB scenarios.

Fortunately, tools like cargo-call-stack can help to verify maximum stack usage of your functions, thus proving that you stack size selections were correct for your compiled binary.

I believe that ideally we should have separate "asynchronous" targets with appropriate std implementation and built-in executor (the language could provide ways for plugging custom executors, but it would be similar to plugging a custom allocator). Without it we have high risks of mixing sync and async code (e.g. by pulling third-party logging framework). It would resolve the coloring problem by introducing a global target-based switch between sync and async modes of compilation.

You forgot to mention one of the main advantages: no need for mirroring various IO traits with async counterparts.

With separate async targets it could be resolved by transparently replacing "thread"-locals with "task"-locals. We would still (somewhat confusingly) use thread_local! for backward compatibility, but under the hood it would switch to task locals for async targets. Same story with std::thread::spawn, on async targets it would spawn tasks, not "true" threads.

In some cases true tread locals could be still useful even on async targets (we could call them "worker locals"), e.g. rand::ThreadRng should be worker-local, but it's a separate design question on how it's better to expose it.

If "yield to executor" was part of the language, adding its support to Miri should be relatively easy.

I don't think it contradicts the Rust spirit, if anything I find the stackfull approach a much better fit for Rust. For example, it does not have the problem of task migration across workers for tasks which keep Rc (or any other non-Send type) across yield points and it does not need the mess of hacks like Pin.

1 Like

One major problem with the stackfull approach is efficient implementation of join/select-like functionality. Ideally, we do not want to allocate full stacks for each sub-task, since they are usually quite tiny (e.g. one read OP). But to allocate sub-task's stack on its parent stack is quite problematic because in the current version of Rust/LLVM it's impossible to estimate maximum stack usage at compile time and in the presence of dynamic linking (e.g. if we use libc::read) its fundamentally impossible (at least with the existing shared library formats and linkers).

So we either have to pay the cost of full stack allocation (it's small, but noticeable), or use unsafe task spawning with guesses on how much stack would be needed for spawned sub-task. Stack canaries could help a bit, but it's a runtime check which also happens after the fact to boot.

I believe that it should be possible to develop a new "stackless" ABI with two stack pointers: one for "persistent" stack for data crossing yield points (i.e. the memory which belongs to Future in the current Rust model) and one for ephemeral stack for data which does not cross yield points (i.e. worker stacks). The compiler would be able to track maximum stack usage for the latter stack, while the former would be used for shared library calls, thus resolving the libc::read issue. But it would involve a fair bit of work in Rust and LLVM, so I don't have high hopes for seeing it implemented in the following decade.

1 Like

If that's an issue I assume the tasks are being scheduled on different os threads. If that's the case then you have another issue because you're implicitly assuming everything stored on the stack is Send, and some types are not (e.g. MutexGuard when it was implemented with os mutexes, because those need to be released on the same os thread that acquired them).

1 Like

I am not sure how thread/task-locals are relevant to the rest of your comment. If a task uses task-local storage, then migration across workers is not a problem for it, since the memory is... well, local to the task. If a task wants to use a true thread/worker local storage (e.g. for ThreadRng), then the code which accesses it must not contain any yield points (and ideally it should be enforced by the type/effect system).

No, I am not assuming it and I explicitly mentioned tasks using non-Send data across yield points and that the stackfull approach works with such code perfectly fine without any headache you encounter in an equivalent futures-based code.

You've been (wrongly) conditioned by the flawed Rust async model. Think about it, why it's not a problem for a thread to migrate across CPU cores despite potentially storing Rcs on its stack, but it's a problem to do the same for a Future (migration across workers is equivalent to migration across CPU cores)? It's nothing more than a leaky abstraction point of the Rust async system which models persistent part of task stacks as "just a type". I wrote about it previously here.

1 Like

If thread (not task) locals are a safety issue the it must be because the current task could be moved to a different os thread while holding it. Or am I missing something here?

It will work fine for most things, but not everything.

Just to go back to thread-locals, any C code (or in general FFI) code using thread locals will be broken by this approach because it won't use task locals.

I know how async works, and I know that a lot of !Send can actually work fine in a task.

The issue is a conflation of the concepts of sending a type to a different task vs a different thread. Send represents both and many types that are !Send do not satisfy the first concept but still safisfy the second one (which makes them safe to store in the stack of an async function even though the compiler rejects the code). However some types do not satisfy the second concept, such as those using actual thread locals or other OS resources bound to an OS thread, and those break when used in stackful coroutines that migrate between os threads.

1 Like

I am saying that on "async targets" thread_local! should be actually task-local, thus migration between workers would not be a safety issue. If you want to use a "true" thread-local on such targets, then you must guarantee that you don't have any yield points while keeping references to such memory. In the absence of proper language tools to prove it, it would mean that "true" thread locals would be unsafe on async targets, similarly to how core-locals are unsafe for thread-based code.

Nope. Such code is guaranteed to not contain any yield points into our async runtime, so it's usage of thread locals should not be broken (assuming they do not return TLS references to the caller, which should not be done either way). And if a hypothetical library does it for some reason (e.g. if we want a shared library written in Rust which uses the async runtime), then it should follow the associated rules.

If you work with such resources then such code should simply not compile for async targets. Similarly to how if your crate uses some Linux-specific black magic, it should not compile for Windows or Mac.

I think it's better to keep things simple and define Send as a marker for types which can be sent across threads of execution, regardless whether those threads are "true" threads or "green" ones. Otherwise we would get proliferation of various markers, e.g. "can be used only on the main thread".

UPD: I forgot to mention, but it's also possible to temporarily forbid migration of stackfull tasks across workers and allow it back at runtime. For example, this capability is useful for join/select-like functionality where we want for all sub-tasks to be executed on the same worker (e.g. if they share some non-Send data). But whether such capability could be useful for dealing with the cases mentioned above is an open question.

1 Like

Sorry for the delay! I am in China standard time which makes me a little bit difficult to catch up in time.

Ah, it's a sad story, maybe you could refer to What's going on with bincode? and Releasing bincode-next v3.0.0-rc.1 for some ideas about that. It's somehow of a, firefighting move...

Yes, this certainly true. But as far as I understand, cargo-call-stack are designed to analyze the max length of the DAG map of the execution chain, but for most recursive based serialization crate like bincode-next or serde, it can not find a max length and will only thinks that it is infinite. So this is one problem.

Secondly, we think that, maybe, just maybe, in the most case scenarios, non-std environment might not need a high performance async decoding designed for high concurrency. But try to support it requires much design and maintenance work, and might result in almost infinite issues and security vulnerabilities (as no-std environment are so non-standard and diverse). So we just design it to be an optional feature just like the alloc feature and the std feature.

Language layer support is almost the best solution to this problem, yet significant challenges:

  1. introducing separate async and sync mode make the rust promise of consistency a little bit more complex, as different compile target on the same triple targets will behave differently. And trying to mix compile these two is just another layer of complexity.
  2. almost the total rewrite of miri logic (as it is now only based on the tree model, although other models are been developed also) and make it become much slower are required, or the miri wg will need to introduce infinite amount of fiber specialfic patches.
  3. as for the LLVM problem, ah, anyone who had ever keep in touch with the LLVM community would imagine thefriction. Introducing a new api will need all function prologue and epilogue generation logic be rewritten on different platforms. And as for the mad performance regression 0 tolerate nature of the LLVM community, if this new abi introduce any regression to sync code, the patch will just be thrown away...
  4. not to say the political pressure in the rust community itself, and our LFM and FCP policy.

So the perspective is great, but just we can only wait for a huge commercial company find it useful and decide to push for the changes...

Mathematically, if I have remembered it correctly, the only perfect and mathematically sound solution to the thread safety problem is the current FSM based stackless async model currently used by rust, but as we all agreed, it is far from perfect. So for the stackfull model, personally I think a language layer abstraction with proper potential unsafe operation discovery and force the user who do such works, like uses true TLS, to mark it unsafe and self-ensure the safety issue is a potential solution.

This is a wonderful idea, but how the compiler would decide it is a migration disabled task or a normal task, and its performance impact, remains unknown. Maybe future PoC and bench mark will tell, but it is a long way to go and probably we will need to wait for decades.

But as far as I think, the usage of true TLS will be necessarily limited on FFI interface, so it would be rather easy to tell as it seems. And one more thing is, different works uses a different set of FSM, so it seems that different workers are not that equivalent to different CPU cores to some extend.