Goroutines / csp in Rust/Wasm32?

I briefly mentioned it above. IMO, it is an interesting experiment, but is unlikely to achieve large-scale recognition as long as it breaks Rust's safety guarantees...

I've always wondered about that.

In general, when allocating say 1MB of memory, the OS will reserve 1MB in the virtual space, but will only really map the first page to real RAM. And similarly, a guard page will remain unmapped, since it cannot be used.

As a result, I would expect being able to create a stackful coroutine with a 1 MB stack segment, whose last 4kB page is a guard page (to detect stack overflow), and only actually commit 4kB to RAM (the first page) in which a portion will be coroutine meta-data, and not actual stack space.

I seem to remember Go uses 2kB as its initial stack size, so 4kB seems pretty good in comparison. On a typical Linux, you have 47 bits of user address space, so you could have 46 bits worth of stacks, or enough space for 2^26 (64M) stacks of 1MB each.

So on a beefy server, I could see serving 64M connections (256 GB of RAM minimum JUST for stacks).

Is there something wrong in my accounting?

1 Like

@HadrienG How does it break safety? Do you have an example or a reference?

@bjouhier As may's README explains, stack overflow is UB and automatic task migration between threads can cause TLS-related UB in some circumstances.

@matthieum As I pointed out earlier, I find the "stackful coroutines are lighter-weight than threads" argument to border on misleading, given that 1/using less RAM than a 4kB thread is difficult and 2/threads can be recycled, so the "creating threads and destroying them is expensive" argument is also a bit weak.

What is true is that stackful coroutines usually go together with nonblocking IO, which has proven benefits over threads + blocking IO (e.g. less syscalls).

There's a lot of details in this part of the BEAM book on how processes are laid out.

FWIW a new process takes up 2704 bytes on my macOS laptop and 2688 on a linux server I had handy. (The Erlang Efficiency Guide shows how to measure this.)

I can't help there I'm afraid. I think a lot of the strength of how Erlang does things is because as well as handling errors locally it's possible to link/monitor another process and handle errors "remotely" by restarting in a known good state. AKA let it crash, AKA have you tried turning it off and on again. This is handled by the runtime so something like that would have to be re-implemented to get supervisors in wasm.

Joe Armstrong's thesis covers the Erlang approach in great detail. I don't know that reading it would have an immediate practical benefit for your problem but I think it's worth adding to anyones tech reading list because it gives you another mental model for thinking about software.

1 Like

If you create 100k threads and put them to sleep on some IO, your OS scheduler will not love you. That's not so much about the RAM usage, the OS is just not built for that.

While the benchmark is not scientific in any way, I believe this factor is nicely shown there: Benchmark of different Async approaches in Rust | Vorner’s random stuff.

Furthermore, if you have growing stacks, you might also have shrinking stacks. I don't know if eg. Go does that (when it puts a goroutine to sleep, to drop the no longer needed parts), but I'm pretty sure it is not done for C-style stacks. So if you ever go 100kB deep into your C-style stack, it stays until you deallocate the whole stack, while sleeping Goroutine could as well get rid of it if it only needs 2k while sleeping.

1 Like

Yes, this is what I meant when I said that nonblocking IO was the key point, not threads vs stackful coroutines.

Re: growing/shrinking stacks, I think that is only relevant when your thread's stack usage varies greatly over time.

@HadrienG Thanks for the pointers. I saw that may has changed coroutine creation to be unsafe, because of these issues.

1 Like

The problem with threads is not that they cannot be recycled, it's that last I checked OSes will struggle with 10,000 threads, when 1,000,000 stackful coroutines are perfectly achievable with coroutines.

It doesn't make stackful coroutines better at everything; it just gives them a very compelling advantage for servers handling upward of hundreds/thousands connections.

1 Like

This (and @vorner's previous post) seems to suggests that fixed-size stackful coroutines mostly exist as a workaround for the incompetence of OS schedulers, and don't really have a fundamental advantage that OS threads couldn't gain through sufficient implementation effort. If correct, that's very sad.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.