I’m a fan of Go’s go-routines / CSP model for handling concurrency. Does Rust/wasm32 support anything like coroutines ? There is this detailed and long discussion at Coroutines and Rust … but I am not sure what the conclusion of that is.
Unfortunately, Rust cannot exactly replicate the design of goroutines without sacrificing other design goals such as good interoperability with C/++ and zero-cost abstractions (aka “you don’t pay run-time performance for features that you don’t use”).
Click me if you want more details...
Goroutines, as idiomatically used in Go to build scalable servers, essentially require growable task stacks. You can’t achieve scalability to huge number of tasks if it is not possible to keep task stacks very small with careful programming, and you can’t achieve good ergonomics if it’s not possible for task stacks to grow larger as needed. Add to this that it’s impossible to predict program stack size in advance in a fully general way, and the need for stack growth becomes clear.
There are two classic ways to implement growable stacks. One is to allow stack segmentation (i.e. make the stack a linked list of memory blocks instead of a single big memory block) and one is to move the old contents of the stack into a new, larger memory block, when the stack grows.
The segmented stack way incurs overhead when the stack pointer moves around (since there’s a need to check if one needs to move to the previous stack block or allocate a new one), and degrades interoperability with C because e.g. you can’t easily pass a pointer to an array on the stack to a C function since it may not be contiguous. The “move everything around” way also costs performance on stack growth, and breaks many low-level use cases because it means that pointers to the stack of Rust threads cannot be relied upon for extended periods of time as they may be invalidated by things moving around.
You can write Rust programs in CSP style by using OS threads communicating via
mpsc channels, but it will be more heavyweight than Go’s approach because OS threads are a more heavyweight abstraction than goroutines (more expensive to create on some OSes, eat up more RAM, etc).
As a more efficient alternative, you can also explore library-level implementations of the Actor paradigm, which is closely related to CSP. A well-loved library in this area is Actix.
I’m not sure if any of these libraries has been ported to wasm yet, though.
@HadrienG : Not the answer I was hoping, but I can’t argue with the logic. Thanks for explaining the “segmented stack / moving stack” approaches and how neither can work given current tradeoffs.
As a “fun” aside, it’s possible to implement very CPU-efficient and 100% C-compatible growable thread stacks at the OS level via paging tricks, but the granularity is limited (min 4kB per thread on x86, which means that 1M threads doing nothing take up 4GB of RAM) and you cannot portably assume that the average mainstream OS will implement that functionality…
Ah, reading this topic reminded me of something important that I forgot to mention: future prospects.
Because it requires the kind of stack management outlined above to be memory-efficient, Rust is unlikely to ever provide first-class support for Go-style implicit coroutines, where any function can decide to suspend the active task without higher-level caller functions needing to know about it.
However, there is work towards providing Python- and C#-style explicit coroutines, where you can define so-called “generators” and “async functions” with explicit suspension points (yield, async/await…) and these get compiled down to little state machines.
How do the two compare, you may ask?
From a Go user’s point of view, explicit coroutines can feel less nice to deal with than implicit coroutines, because every caller of an explicit coroutine must be aware of the fact that it is a coroutine and possibly become an coroutine itself. This leads to what Bob Nystrom calls the colored function problem, where you have a “coroutine world” and a “non-coroutine world” coexisting in your program and interactions between the two worlds require care.
On the other hand, explicit coroutines make it easier to reason about code behaviour, compile down to slightly more efficient constructs, and most importantly for Rust they do not cost anything to people who don’t use them and preserve 100% interoperability between the non-coroutine world and other programming languages. So ultimately, it’s a trade-off, and there are arguments to be made in favor of both approaches.
It just happens that the explicit model seems to be a better match for Rust’s design goals.
The “async function” part of this work, which is what you want for building servers, is close to completion and will hopefully land in the language at some point during this year. Then you will only need to wait for libraries that support it, which shouldn’t take too long because there’s plenty of library experimentation already going on with the unstable prototype of this language feature.
I’ve been working on a sort of next generation actix all this month. It’s called thespis and is on github (don’t pay attention to the state of the readme files right now). It’s not yet released on crates, and I think it will take another few months at least, since I will have a bit less time next months to work on it.
I also have not had the time yet to test it on wasm, but I will use it on wasm over websockets and with the remote actors from thespis for which I just finished integration tests. I probably need to make a small executor to defer to wasm-bindgen-futures for spawning futures and it should work straight away on wasm. I don’t think it’s got any problematic dependencies.
It’s not officially released, so don’t count on support yet, but of course it’s possible to file issues and pull requests. The repositories
thespis_impl_remote have examples and integration tests which show that and how it works.
I also before gave it a try to see if actix could work in WASM, and it can. But it would need work on actix_runtime, and actix does not have remote actors, so you can’t communicate as easily with the server using actor messages. You need to wrap them in some serialized format, manage the connection and make a dispatcher on the other end to send them to the right actor. Thespis takes care of those things for you.
I just saw a news.yc article https://news.ycombinator.com/item?id=19907229 linking to
Any insights on (1) how these systems work and (2) whether then can be ported to Rust?
I’m sorry, but I won’t have more time to look into it today, so I don’t know how they work, but you can do anything you can do in C in Rust, so yes they can be ported, if nothing else by creating a wrapper around the C API.
I’m sorry if it felt like I called you out – I was trying to use
@ as a ‘ping’
I studied http://dunkels.com/adam/pt/expansion.html a b it – it looks like there is no stack manipulating black magic going on with protothreads – and everything is done through macro => compile to select.
From my superficial understanding (not sure if this is correct), this has the flaw that all “waiting calls” of the protothread must appear in the same function body (since it’s macro expansion, it can’t “reach” across function calls.)
From a quick look…
Protothread seems close to the state machine based design that is targeted for Rust coroutines. Although obviously, owing to being built on top of C, its implementation is a bit more of a hack that what Rust can aim for with first-class compiler integration.
Libmill provides stackful coroutines, but crucially without stack growth. Which means that you have a performance vs ergonomics tradeoff depending on how big you make your coroutines’ stacks, unlike in Go where you can have it both ways. Similar libraries exist in Rust, may being one of the most popular examples. But without first-class compiler support, these libraries are unlikely to achieve large-scale use, because they break Rust’s safety guarantees:
- Stack overflow is memory-unsafe Undefined Behaviour instead of leading to a predictable crash.
- By allowing coroutines to migrate across threads between suspension points, may can also trigger extra Undefined Behaviour in areas of the rust standard library that rely on thread-local storage.
This is a bit of an advertisement, but corona doesn’t have these problems, because:
- It places a guard page at the end of the stack. If you overflow into that, you get a controlled SEGFAULT, not UB. This is AFAIK the same technique Rust itself uses on its own threads (I don’t know if the compiler makes sure never to generate code that would „jump over“ the whole page, though).
- It doesn’t migrate the coroutines between threads. The downside is you don’t get work stealing (or multi-threaded execution unless you take care of that yourself), so it might not be what you look for.
Dumb question here on my part: why is ‘coroutine stack overflow’ UB ?
It seems that doing an additional check “would allocating this function call stack frame cause overflow?” would only cost an additional + a comparison check – which is trivial compared to the overhead of a function call.
Is this due to (1) people taking “zero cost abstraction” literally and (2) in Rust there’s no way to define a function to say: if called from regular thread, do X, if called from co-routine, do Y ?
There are a few problems that prevent the kind of stack overflow checks that you have in mind in library-based implicit coroutines:
- Checking on function start may not always be enough, because the stack usage of some functions is not predictable in advance. As a silly example, consider calling via FFI into a C function that does randomly sized allocas.
- Checking on every function call can be too expensive from a performance PoV. Which may sound like excessive optimization, until one realizes that the very point of using coroutines for concurrency is performance/memory optimization.
- And, most importantly, in all the programming languages that I know of, a normal library cannot hook on the start of every function called by a coroutine. This is something that only a programming language compiler can do.
As an aside, I wouldn’t be surprised if these same problems also made Go code calling into C via FFI not 100% memory-safe in the face of stack overflow, since the Go compiler cannot check what the C code is doing with the stack. But calling into C is considered unsafe anyway so I guess it’s “fine”
@vorner’s aforementioned corona library uses a paging technique called guard pages to make the program automatically segfault on stack overflow, which is a close relative of the OS-based stack growth approach that I mentioned above. But I expect it to have the same granularity limitations as that technique: you cannot make your coroutines smaller than 8kB* (1 normal page + 1 guard page) on x86.
(*) Though that’s virtual size, which arguably doesn’t matter on 64-bit CPUs. Physical size is 4kB since the guard page does not actually map into allocated RAM. Still won’t scale to 100M coroutines though.
Thanks for your detailed response. I’m fairly convinced now of full blown co-routines with < 4KB stack size can’t be added as a library.
Yes, that’s true. Though usually you need a larger stack than 4k anyway in practice.
I think scaling up to 100M coroutines with any library might be a bit challenging. Let’s say 500 bytes of coroutine state (both overhead of the library for the coroutine itself + some state on the coroutines stack) doesn’t sound very large. Even if the library decided to shrink the stack to its current size, not its max when suspending the coroutine, that would be 50GB of RAM.
But certainly, the corona’s „comfort“ of full-stack coroutines is not zero cost and you have to decide if you can afford it. If you really need 100M connections or so, then corona is not appropriate tool.
@vorner Oh, I did not mean to say that going below 4k is the norm. But since one can also set the stack size of OS threads down to a page, it’s good to know whether coroutine libraries can scale below that or not when pondering the validity of the “threads use a lot of RAM” argument.
It’s not that actually: the array itself will be contiguous. The real problem was that since C does not know about segmented stacks, you need to grow stack by a large segment before every FFI call, then shrink it back afterwards. This is also why Go shuns libc and prefers to use kernel syscalls directly.
Not having pointers to data on the stack is pretty much impossible. In order to be able to relocate stack, you need to know where all those pointers are. Since Go already has garbage collector infrastructure in their runtime, this is pretty easy for them to do. (Or, perhaps, this is why Go uses GC).
These are fascinating implementation details. Although the original question starts with goroutines/CSP – what actually motivated the question was Erlang/OTP. I asked about go instead as go seems closer to Rust than Erlang is to Rust.
A few followup questions:
I know that Erlang has a “BEAM” VM – are there any other tricks they use to get such inexpensive “erlang processes” (which are like coroutines with the additional advantage that they can crash independently.)
The problem that prompted the original question – was “async error handling.” For sync error handling, Option/Result works great.
However, for async error handling (we ask a networked device for some data, the server might be down, the network may be slow, the data may be corrupt, maybe version mismatch, etc …) it’s not clear to me how to deal with such problems in Rust without co-routines.
To make a really bad analogy, imagine your car breaks down, and to fix it, you need to order some parts, run some diagnostics, try some stuff.
One day to handle this is to “intersperse” the “car work” into your 9-5 work day (won’t work very well). An alternative approach is to fork off a “mechanic thread” which handles all the car problems, and you occasionally communicate with the mechanic.
I see async error handling a bit like this – there’s some work that needs to get done, things might go wrong, and we want a separate “process/thread/coroutine” to “manage” the situation, and we want the main thread to occasionally communicate with the separate “process/thread/coroutine”
Erlang/OTP is great at handling this. Goroutines/CSP is pretty good too.
In Rust/wasm32, what is the best practical way for handling this?
[ Note: this question is easier than “what is best way to do coroutines”, as we don’t really care about the 100M co-routine situation, and “performance” is important only to the extent we can handle all the AJAX requests.]
@vadimcn Thanks for the corrections!
@zeroexcuses Erlang and Rust have very different philosophies with respect to error handling.
Erlang is famous for its “let it crash, then restart it” philosophy, which is somewhat unique to this language family and has a profound impact on the way libraries will be designed (need to build for crash recovery a lot more than usual, but to care about local error handling a lot less). I believe that at the implementation level, all this requires is an exception mechanism with a handler at the “top” of every coroutine that has a decent default, but maybe an Erlang regular can comment further.
Rust, on the other hand, is closer to the mainstream error handling strategy of trying to catch errors are close to the point where they occured as possible. It helps programs that use this strategy by doing two things:
- Making a difference between predictable errors (e.g. I/O issues) and program bugs (e.g. out-of-bound array accesses).
- Packing the former kind of errors together with the result (on success) of a function as a single handy enum object, so that clients of the function…
- Cannot mistakenly ignore the fact that a function can error out, because they need to unpack the result + error combo in order to read the result.
- Can easily carry around the (result + error) combo all the way to the point where it will be processed, even if that implies storing it in a container, transferring it over the network, etc.
In this context, Rust’s standard answer to the async error problem would be to make you either handle the error inside of the async coroutine or transfer the result + error combo through whatever communication channel you use to communicate with the async I/O task, and decide what to do about errors there (and, if the communication channel is unreliable, also having a strategy for handling communication channel errors).
If you also want to recover from program bugs in async I/O tasks, you can to some degree by making sure your code is built with
panic = "unwind" (which is the default) and using the
std::panic::catch_unwind() function as an “exception handler”. But recovering from bugs is a much more difficult task which much fewer applications need to engage in, so Rust did not polish/stress this aspect as much as the handling of “expected” errors.
What about may? It provides stackful coroutines.