Async vs sync, # of threads

Consider a program with the following constraints:

  1. all IO is network; there is no file IO
  2. the machine has some fixed # of cores, say 64
  3. we are guaranteed that the machine has < 200 open tcp streams (so # of streams / core is <= 4)

Under such assumptions, does async present serious advantage over sync? Given the bound on the # of tcp streams, we can just have each core open 5 threads; one for doing work, 4 as 'sacrificial' threads to handle tcp.

In such a world would async expect to have any serious (say > 25%) performance increase ?

EDIT: After @RedDocMD 's comment, I realized I forgot to state: assume x86_64 linux.

It all depends on how fast are the "threads" from an OS perspective - switching to a new thread required a context switch on most monolithic OS's. Async doesn't suffer from this problem.
That said, I don't have quantitative figures.

With that many cores, I suspect threads will be more than fine. I'd have said 100 threads is likely fine even with only 4 cores.

Sorry to be pedantic -- my understanding of async tradeoffs is primitive -- is what matters here # of cores or ratio of (# threads / # cores) ? It's the ratio that matters right ?

Ah yes, when you mention that, as @scottmcm mentions, even 100 threads aren't a problem - even with much less cores.
The main advantage of async is programmer ergonomics - you don't need to write a thread::spawn when doing blocking IO - it is automatically taken care of by Future's, the async keyword and correct await's.
The point of async is to make a co-operative multi-threading "system" with "plain"-looking code.

I agree.

I don't think this is the crux if the issue. You still have to write async_std::task , but what it does save you is the mess of inverting logic flow into state machines.

1 Like

Mostly, yes, though at some point just the resource usage of all the threads can also get excessive. Like 10000 threads is never a good idea -- threads have their own stack memory, so that tends to be ≈10 GB of ram just for the stack usage. (And sure, if you're trying to run that many threads you probably have at least 64GB of RAM, so it's not the end of the world, but it's still not a good use of resources.)

Except if you are Google — then it's Ok. too.

Not even. All modern OSes overcommit, thus you are not using that much memory, just that much address space (or, if you are actually using all that memory, then async version would use it, too). 10GB of address space is not a big deal.

Otherwise you would only need 80MB of RAM for the kernel data structures.

What really affects your badly is scheduling. And that only start to matter of there are 10000 threads per CPU, not 10000 threads total.

At this point you really need fibers or async. Fibers can be done on top of kernel threads, as I have shown already.

1 Like

Note that that's fibers not threads:

"Google Fibers" is a userspace scheduling framework used widely and successfully at Google to improve in-process workload isolation and response latencies.

(Google Fibers use cooperative scheduling features only)

As soon as they're cooperative and scheduled by the user (not the kernel) I consider them fundamentally different from normal OS threads.

1 Like

What's the difference?

But Google Fibers are scheduled by kernel. They start as normal kernel threads. Only in addition to that they may relinquish control voluntarily. Then kernel again, picks the other fiber (from the appropriate thread group… note the thread groups is not a Google Fiber concept, they are part of the existing kernel API) to run.

Of course kernel can easily stop all that machinery if it needs CPU core to run some other app.

So… when exactly threads stop being threads and turn into something else? When they relinquish control? What if they stop doing that and would work like a normal thread for some time? Are they still fibers or no?

Or are they fibers because they are using “Google Fibers” library and if they would switch to “Microsoft Muppets” they would become threads, again?

Just curious, do you have a source for these numbers?

Performance wise, probably not. You could easily spawn a thread per connection and expect similar performance with blocking I/O. Async however, does make it easier to do I/O multiplexing (join!), so if you are doing a lot of I/O per connection that can have a significant impact. In your case, if you prefer the simplicity of the synchronous code model, I would say it's probably not worth it and you should stick with threads.

It's not the code complexity that is bothering me; it is the increase in compile time that is bothering me. One thing I have been thinking alot about lately is the set of problems where:

  1. You run GoLang to handle the outward facing TCP / UDP sockets, GoLang then 'merges' these into a few # of streams.

  2. RustLang stays using Thread, being fed by these GoLang 'streams'

I acknowledge up front that not all Rust async problems can be solved this way; but I am trying to figure out if an interesting subset of them can be done this way. This has the nice benefit that RustLang avoids the compile time costs of async; and GoLang always compiles fast anyway.

Back of the envelope calculations: modern OSes are rescheduling things every millisecond (default HZ in Linux kernel is 1000 novadays) and if you have 128 cores then every thread would be scheduled once per 70-80ms, which is usually considered acceptable (response below 0.2s is considered instantaneous by human). If not all threads are busy then latency would be even less.

Of course you may need special needs which would require you to respond not in 70-80ms, but in 1ms or even 1ns — but if you would deal with such developments then this requirement would be told explicitly.

P.S. Important: this is on the assumption than you have enough RAM. If your system goes to swap it may become laggy with much smaller number of threads. In pathological case it may become laggy with one thread.

How is async different from threads in regards to scheduling? Like, a .read(buf) tells the OS to switch to another thread, and a .read(buf).await yields back to the async scheduler. What allows async to handle more tasks? I always thought the difference was the memory overhead and the cost of context switching.

What allows async to handle more tasks?

Uhm.

a .read(buf).await yields back to the async scheduler

Precisely that. Every time you us await the Executor gets a chance to switch to other thread. Also you may do await without calling read(buf).

Kernel-based threads only relinquish their time-slot when they are blocked — or when timer interrupt happens (once every millisecond). Blocking doesn't happen every time you call read which leads to paradoxical situation: the better your SSD/HDD/network caches are performing the worse latency becomes.

I always thought the difference was the memory overhead and the cost of context switching.

It's true for some OSes (macOS, Windows), less true for others (Linux).

On Linux each thread only uses 8K of unswappable kernel memory, which is comparable to many async functions.

Add the ability to voluntarily relinquish control (like Google fibers are doing) and you get performance pretty similar to what you may get from async model.

Of course at this point you would need some kind of executor for these “fibers” which makes it not that different from async.

But you can use non-atomic Rc, thread-local variables and other such things. Which is convenient in some cases.

1 Like

So the difference is that async programs have more yield points, specifically when you do I/O that is immediately ready? Wouldn't sticking thread:: yield_now() at the beginning of such functions solve that issue? Also, I know async_io calls read() before checking for readiness (on a non-blocking source), so wouldn't that have the same issue?

Also, if I have 7GB of RAM taken up by overcommitted thread stacks on a system with 8GB RAM, does that mean anything over 1 more gig goes to swap? Because then even if that memory isn't being used it has the same issue right?

The way I am thinking about Async ( whether it's right or not I will leave you to judge ):

  1. A task has some state when it's waiting for IO or something else to happen.

  2. This state can be quite small, and it can be calculated in advance. Maybe it's just 50 bytes of memory.

  3. To allocate an entire system thread to hold this state is not parsimonious. Rust offers you ways to use as little memory as possible for solving a problem, and this in the long run will tend to help performance.

  4. In practice, our computers are huge and powerful, so whatever you do it will probably work almost all of the time. Until, I don't know, several million customers want to access your website in a space of a few minutes.

  5. So mostly it doesn't matter. But I think it's natural to use Async for any kind of server IO. There are crates like hyper and axum that make it easy, almost trivial.

Absolutely not. That's not how overcommit works. It create one, single, empty zero page and maps it over and over again. And even page structures requires to do that are also reused.

Ultimately if you are allocating 8GB of stacks (8MB each) but only using 4KB from each stack then you are losing about 50MB of memory.

Because then even if that memory isn't being used it has the same issue right?

No, no, no. Absolutely not. TSAN allocates 64TiB of address space (that's terabytes, not gigabytes). It it worked like you expect that noone would have been able to use it.

If you allocate 64TiB in small chunks and not as one big piece, then overhead would be higher, but no, it wouldn't be even remotely problematic till you would actually use that memory (and if you actually use it then async would use it, too).

Also, I know async_io calls read() before checking for readiness (on a non-blocking source), so wouldn't that have the same issue?

It might. But since it's part of your program and not part of OS kernel it can be easily fixed if problem would become real.

In practice, our computers are huge and powerful, so whatever you do it will probably work almost all of the time. Until, I don't know, several million customers want to access your website in a space of a few minutes.

Yeah, but if you have several million customers then usually you can afford more than one server.

So mostly it doesn't matter. But I think it's natural to use Async for any kind of server IO. There are crates like hyper and axum that make it easy, almost trivial.

Async matters a lot. But on the opposite side of spectrum. Not when you have lots of threads, but when you have two, four or eight. Because making sure you are distributing work nicely between two or four or eight threads (that's what typical customer CPUs have) and using all the resources you have… it's quite challenging task.

But if you split your program into lots and lots of small tasks which can be executed in parallel… you may get very nice speedup.

Think rustc which is so slow everyone complains. With async used judiciously it would have been able to use all the cores in your CPU.

Of course there are other, more pressing issues with rustc and one or two async functions wouldn't help, you need, basically, your whole program composed just from them. But that is much more important than threads overhead.

Which makes async, surprisingly enough, extremely important for client programs, not server programs. Servers always had an “embarrassingly parallel” option: just run request from one customer on one thread and bam: all your cores are useful (or if they are not useful because you have too few customers then you have completely different problem). Client doesn't have such an option. It needs async.

And yes, on opposite end of spectrum you may achieve something which requires millions of threads without async, but these things are very rare.