Goroutines / csp in Rust/Wasm32?

This is a bit of an advertisement, but corona doesn’t have these problems, because:

  • It places a guard page at the end of the stack. If you overflow into that, you get a controlled SEGFAULT, not UB. This is AFAIK the same technique Rust itself uses on its own threads (I don’t know if the compiler makes sure never to generate code that would „jump over“ the whole page, though).
  • It doesn’t migrate the coroutines between threads. The downside is you don’t get work stealing (or multi-threaded execution unless you take care of that yourself), so it might not be what you look for.
2 Likes

Dumb question here on my part: why is ‘coroutine stack overflow’ UB ?

It seems that doing an additional check “would allocating this function call stack frame cause overflow?” would only cost an additional + a comparison check – which is trivial compared to the overhead of a function call.

Is this due to (1) people taking “zero cost abstraction” literally and (2) in Rust there’s no way to define a function to say: if called from regular thread, do X, if called from co-routine, do Y ?

1 Like

There are a few problems that prevent the kind of stack overflow checks that you have in mind in library-based implicit coroutines:

  1. Checking on function start may not always be enough, because the stack usage of some functions is not predictable in advance. As a silly example, consider calling via FFI into a C function that does randomly sized allocas.
  2. Checking on every function call can be too expensive from a performance PoV. Which may sound like excessive optimization, until one realizes that the very point of using coroutines for concurrency is performance/memory optimization.
  3. And, most importantly, in all the programming languages that I know of, a normal library cannot hook on the start of every function called by a coroutine. This is something that only a programming language compiler can do.

As an aside, I wouldn’t be surprised if these same problems also made Go code calling into C via FFI not 100% memory-safe in the face of stack overflow, since the Go compiler cannot check what the C code is doing with the stack. But calling into C is considered unsafe anyway so I guess it’s “fine” :slight_smile:

@vorner’s aforementioned corona library uses a paging technique called guard pages to make the program automatically segfault on stack overflow, which is a close relative of the OS-based stack growth approach that I mentioned above. But I expect it to have the same granularity limitations as that technique: you cannot make your coroutines smaller than 8kB* (1 normal page + 1 guard page) on x86.


(*) Though that’s virtual size, which arguably doesn’t matter on 64-bit CPUs. Physical size is 4kB since the guard page does not actually map into allocated RAM. Still won’t scale to 100M coroutines though.

2 Likes

Thanks for your detailed response. I’m fairly convinced now of full blown co-routines with < 4KB stack size can’t be added as a library. :slight_smile:

1 Like

Yes, that’s true. Though usually you need a larger stack than 4k anyway in practice.

I think scaling up to 100M coroutines with any library might be a bit challenging. Let’s say 500 bytes of coroutine state (both overhead of the library for the coroutine itself + some state on the coroutines stack) doesn’t sound very large. Even if the library decided to shrink the stack to its current size, not its max when suspending the coroutine, that would be 50GB of RAM.

But certainly, the corona’s „comfort“ of full-stack coroutines is not zero cost and you have to decide if you can afford it. If you really need 100M connections or so, then corona is not appropriate tool.

1 Like

@vorner Oh, I did not mean to say that going below 4k is the norm. But since one can also set the stack size of OS threads down to a page, it’s good to know whether coroutine libraries can scale below that or not when pondering the validity of the “threads use a lot of RAM” argument.

It’s not that actually: the array itself will be contiguous. The real problem was that since C does not know about segmented stacks, you need to grow stack by a large segment before every FFI call, then shrink it back afterwards. This is also why Go shuns libc and prefers to use kernel syscalls directly.

Not having pointers to data on the stack is pretty much impossible. In order to be able to relocate stack, you need to know where all those pointers are. Since Go already has garbage collector infrastructure in their runtime, this is pretty easy for them to do. (Or, perhaps, this is why Go uses GC).

1 Like

These are fascinating implementation details. Although the original question starts with goroutines/CSP – what actually motivated the question was Erlang/OTP. I asked about go instead as go seems closer to Rust than Erlang is to Rust.

A few followup questions:

  1. I know that Erlang has a “BEAM” VM – are there any other tricks they use to get such inexpensive “erlang processes” (which are like coroutines with the additional advantage that they can crash independently.)

  2. The problem that prompted the original question – was “async error handling.” For sync error handling, Option/Result works great.

However, for async error handling (we ask a networked device for some data, the server might be down, the network may be slow, the data may be corrupt, maybe version mismatch, etc …) it’s not clear to me how to deal with such problems in Rust without co-routines.

To make a really bad analogy, imagine your car breaks down, and to fix it, you need to order some parts, run some diagnostics, try some stuff.

One day to handle this is to “intersperse” the “car work” into your 9-5 work day (won’t work very well). An alternative approach is to fork off a “mechanic thread” which handles all the car problems, and you occasionally communicate with the mechanic.

I see async error handling a bit like this – there’s some work that needs to get done, things might go wrong, and we want a separate “process/thread/coroutine” to “manage” the situation, and we want the main thread to occasionally communicate with the separate “process/thread/coroutine”

Erlang/OTP is great at handling this. Goroutines/CSP is pretty good too.

In Rust/wasm32, what is the best practical way for handling this?

[ Note: this question is easier than “what is best way to do coroutines”, as we don’t really care about the 100M co-routine situation, and “performance” is important only to the extent we can handle all the AJAX requests.]

1 Like

@vadimcn Thanks for the corrections!

@zeroexcuses Erlang and Rust have very different philosophies with respect to error handling.

Erlang is famous for its “let it crash, then restart it” philosophy, which is somewhat unique to this language family and has a profound impact on the way libraries will be designed (need to build for crash recovery a lot more than usual, but to care about local error handling a lot less). I believe that at the implementation level, all this requires is an exception mechanism with a handler at the “top” of every coroutine that has a decent default, but maybe an Erlang regular can comment further.

Rust, on the other hand, is closer to the mainstream error handling strategy of trying to catch errors are close to the point where they occured as possible. It helps programs that use this strategy by doing two things:

  1. Making a difference between predictable errors (e.g. I/O issues) and program bugs (e.g. out-of-bound array accesses).
  2. Packing the former kind of errors together with the result (on success) of a function as a single handy enum object, so that clients of the function…
    • Cannot mistakenly ignore the fact that a function can error out, because they need to unpack the result + error combo in order to read the result.
    • Can easily carry around the (result + error) combo all the way to the point where it will be processed, even if that implies storing it in a container, transferring it over the network, etc.

In this context, Rust’s standard answer to the async error problem would be to make you either handle the error inside of the async coroutine or transfer the result + error combo through whatever communication channel you use to communicate with the async I/O task, and decide what to do about errors there (and, if the communication channel is unreliable, also having a strategy for handling communication channel errors).

If you also want to recover from program bugs in async I/O tasks, you can to some degree by making sure your code is built with panic = "unwind" (which is the default) and using the std::panic::catch_unwind() function as an “exception handler”. But recovering from bugs is a much more difficult task which much fewer applications need to engage in, so Rust did not polish/stress this aspect as much as the handling of “expected” errors.

1 Like

What about may? It provides stackful coroutines.

I briefly mentioned it above. IMO, it is an interesting experiment, but is unlikely to achieve large-scale recognition as long as it breaks Rust’s safety guarantees…

I’ve always wondered about that.

In general, when allocating say 1MB of memory, the OS will reserve 1MB in the virtual space, but will only really map the first page to real RAM. And similarly, a guard page will remain unmapped, since it cannot be used.

As a result, I would expect being able to create a stackful coroutine with a 1 MB stack segment, whose last 4kB page is a guard page (to detect stack overflow), and only actually commit 4kB to RAM (the first page) in which a portion will be coroutine meta-data, and not actual stack space.

I seem to remember Go uses 2kB as its initial stack size, so 4kB seems pretty good in comparison. On a typical Linux, you have 47 bits of user address space, so you could have 46 bits worth of stacks, or enough space for 2^26 (64M) stacks of 1MB each.

So on a beefy server, I could see serving 64M connections (256 GB of RAM minimum JUST for stacks).

Is there something wrong in my accounting?

1 Like

@HadrienG How does it break safety? Do you have an example or a reference?

@bjouhier As may’s README explains, stack overflow is UB and automatic task migration between threads can cause TLS-related UB in some circumstances.

@matthieum As I pointed out earlier, I find the “stackful coroutines are lighter-weight than threads” argument to border on misleading, given that 1/using less RAM than a 4kB thread is difficult and 2/threads can be recycled, so the “creating threads and destroying them is expensive” argument is also a bit weak.

What is true is that stackful coroutines usually go together with nonblocking IO, which has proven benefits over threads + blocking IO (e.g. less syscalls).

There’s a lot of details in this part of the BEAM book on how processes are laid out.

FWIW a new process takes up 2704 bytes on my macOS laptop and 2688 on a linux server I had handy. (The Erlang Efficiency Guide shows how to measure this.)

I can’t help there I’m afraid. I think a lot of the strength of how Erlang does things is because as well as handling errors locally it’s possible to link/monitor another process and handle errors “remotely” by restarting in a known good state. AKA let it crash, AKA have you tried turning it off and on again. This is handled by the runtime so something like that would have to be re-implemented to get supervisors in wasm.

Joe Armstrong’s thesis covers the Erlang approach in great detail. I don’t know that reading it would have an immediate practical benefit for your problem but I think it’s worth adding to anyones tech reading list because it gives you another mental model for thinking about software.

1 Like

If you create 100k threads and put them to sleep on some IO, your OS scheduler will not love you. That’s not so much about the RAM usage, the OS is just not built for that.

While the benchmark is not scientific in any way, I believe this factor is nicely shown there: https://vorner.github.io/async-bench.html.

Furthermore, if you have growing stacks, you might also have shrinking stacks. I don’t know if eg. Go does that (when it puts a goroutine to sleep, to drop the no longer needed parts), but I’m pretty sure it is not done for C-style stacks. So if you ever go 100kB deep into your C-style stack, it stays until you deallocate the whole stack, while sleeping Goroutine could as well get rid of it if it only needs 2k while sleeping.

1 Like

Yes, this is what I meant when I said that nonblocking IO was the key point, not threads vs stackful coroutines.

Re: growing/shrinking stacks, I think that is only relevant when your thread’s stack usage varies greatly over time.

@HadrienG Thanks for the pointers. I saw that may has changed coroutine creation to be unsafe, because of these issues.

1 Like

The problem with threads is not that they cannot be recycled, it’s that last I checked OSes will struggle with 10,000 threads, when 1,000,000 stackful coroutines are perfectly achievable with coroutines.

It doesn’t make stackful coroutines better at everything; it just gives them a very compelling advantage for servers handling upward of hundreds/thousands connections.

1 Like

This (and @vorner’s previous post) seems to suggests that fixed-size stackful coroutines mostly exist as a workaround for the incompetence of OS schedulers, and don’t really have a fundamental advantage that OS threads couldn’t gain through sufficient implementation effort. If correct, that’s very sad.