Causes for thread 'tokio-runtime-worker' has overflowed its stack

I am getting this error message:

thread 'tokio-runtime-worker' has overflowed its stack

I have tracked it down to invoking a method in a large code generated source file. The source file is about 54K lines.

What the code gen does is to build a static map of fn pointers, each fn is a small block of code that accesses data in structs (sometimes nested with Option nodes if the node is not required). The static map has around 2400 elements, so it is not huge. Then there are some simple methods to read from the map, invoke the selected fn.

One odd thing is that the same code runs fine on my Mac, this only happens on Linux.

Since there is no recursion, and this code is sync (not a future) I cannot see anything obvious. I even tried getting the type sizes with a nightly build, the only things that looked sizable were from low-level crates used by our code or that of others.

Could it be just the size of the source file? If so and I broke it into say 20 source files, all accessible as modules from a single source file would that help? I would still need to load each fn pointer to a single static map, I do not see how to get around that.

Any help appreciated.

Nope, this should have no effect.

I would try lowering the stack size until it happens predictably, then try to debug it. One thing to look for is large arrays.

Once you figure it out, you can decide if it's fixable, or if you just need to increase the stack size.

Clippy has a lint for large futures. If you're not using Box::pin(async).await, you may end up with the entire state of your whole application as one Future on the stack.

3 Likes

Thanks, I will try both of these ideas.

I tried various combinations and while we have it working with a bigger stack size, the results are confusing.

Here is what we tried:

  • Change from `#[tokio::main] to creating our own runtime
  • Wrap the initial future in Box::pin
  • On Mac this works fine (it worked before as well)

Set the thread_stack_size manually to smaller sizes, until a overflow happens on the Mac. This does not happen on a Mac until about 450K. When the overflow happens it is much earlier in the flow, so it doesn't really help us narrow it down.

Various stack sizes:

  • 2MB Mac works (the default), Linux crashes with overflow
  • 4MB same thing
  • 8MB both Mac and Linux works

It is odd that they both behave differently (stop at different places), so it is hard to debug. I would have thought Box::pin would solve it, I read an excellent blog post that explains why. If this moves the first main Future to the heap, shouldn't the issue resolve?

We can live with it now just by using a larger stack size (8MB is not unreasonable) but I would like to know better how to find the underlying issue. And the difference between Linux and Mac I suppose can be attributed to differences in compilers?

Any light that can be shed on this will be much appreciated.

Can you share this file and the method name?:

I have tracked it down to invoking a method in a large code generated source file. The source file is about 54K lines.

And did the clippy lint flag anything?

This is a commercial application and it is an entire server application, so it is many files and it does a lot. What I can share is this part of the main if helpful:

    let rt = tokio::runtime::Builder::new_multi_thread()
    .enable_all()
    .thread_stack_size(1024 * 8000) // Explicitly set the thread stack size, 8MB seems to be required for Linux
    .build()
    .expect("Could not start tokio Runtime");

    rt.block_on(Box::pin(inner_start_server(addr)))

The inner_start_server isolates all the async fns.

We have yet to try clippy for the large_futures, we will do that later today, we need to learn how to use it first :).

Try Box::pin inside your server code, in multiple places, for smaller call sub-trees. Rust is not very smart with Box, and sometimes makes it by copying data from the stack into the Box, so at this point it's too late to move a too-large stack object onto the heap. Box it while it's not too large yet.

That makes very good sense, we can do it on known smaller Futures. We'll try that and let you know.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.