LTO causes stack overflow

Hello,

I have a problem when use

[profile.release]
opt-level = 3
lto = true

Gives me:

thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
Aborted (core dumped)

If I comment lto line, all works fine.

What is happening ? Thanks

it could be an LLVM bug. But without any source to reproduce or debug strace in gdb or lldb,
this cannot be progressed.

1 Like

Well, my code use rayon. I use par_iter_mut. I replace all the parallel code with a serial code and works fine. The problem with LTO is rayon ??

If it’s that same Floyd algorithm code you were working on before, you were using large stack allocated arrays which makes it very easy to overflow the stack if you’re not careful. You might want to move away from those for the sake of robustness.

1 Like

Yes, the problem is here

    graph
        .par_iter_mut()
        .zip(&mut path[..])
        .zip(&column_k[..])
        .enumerate()
        .for_each(|(id, ((rows, rows_path), ik))| {
            if id != k {
                rows.par_iter_mut()
                    .zip(&mut rows_path[..])
                    .zip(&row_k[..])
                    .enumerate()
                    .for_each(|(id, ((ij, ij_path), kj))| {
                        if id != k {
                            floyd_serial(ij, *ik, *kj, ij_path, coord);
                        }
                    });
            }
        });

If i remove the inner par_iter, all works fine, but i lose a lot of performance. For example with 8192: using this code i have 1.05seg. Removing the inner par_ and using LTO="fat" i have 1.6seg.

I don't really understand why this.

Assuming that *ik and *ij are large arrays, as per @drewkett's comment,
those *-copies are very likely to be responsible for excessive use of stack memory.

I don't know how floyd_serial needs to work, if it needs its own copies of the buffers in order to mutate them (if it doesn't need to mutate those, then try and directly use the ik and ij references instead), but assuming it does, then try to work with Vec instead: ik.to_vec() and kj.to_vec(), and see if it helps (or if it doesn't).

Note that the default thread stack size is only 2MB, whereas the main thread on Linux will grow its stack dynamically. So if you're creating large arrays on the stack, this could be why Rayon overflows when a single-threaded solution is fine. I don't know why LTO would matter for this, except maybe if inlining made some large stack frame stick around longer, instead of a quick call/return.

1 Like

Thanks to all of you. Finally if instead of calling the function floyd_serial I do everything inside the for_each, it works fine.

Thank you !

The other way to do that is to change floyd_serial to take a reference for all arguments &Matrix. This would require some changes at the call sites.