SIMD (glam) slower than scalar version

I'm writing a path tracer and wanted to replace my own (scalar) math functionality withglam, because it uses SIMD. In dev mode (with opt-level = 3), the glam version is significantly faster than my own. However, once I switch to my release profile (lto = "fat", codegen-units = 1), the glam version suddenly becomes slower than my math implementation.

The following times were recorded for the cornell_box example on a Linux desktop, but the issue also remains on a Windows desktop

main glam
dev 94.05 s 74.74 s
release 53.85 s 67.05 s

My code can be found here. The main branch contains my math implementation, while the glam branch uses glam.

Does anybody know what might cause this or could point me in the right direction for debugging/profiling this?

It looks like only a selection of f32 types use SIMD, while you seem to use the f64 types. It's also not impossible that your old code already got optimized in a comparable way. If you aren't doing anything too special, you can probably switch to the f32 based types without much or any visual difference. That my also improve your bandwidth, since more fits in the cache.

Thanks! I didn't know that glam only supports SIMD for f32. I still find it weird that it is noticeably slower than my simple scalar implementation. But I won't dig into that for now.

I replaced every f64 with f32 (for both branches) and used Vec3A instead of Vec3. glam is still about 10 s slower.

That's a bit odd, yeah. Like I mentioned, it's possible that your old code was easy for the compiler to optimize. It can be surprisingly efficient sometimes. You may have to go into details like comparing the generated assembly and that's not exactly my wheelhouse. My personal experience is mostly in looking for hot spots with tools like flame graphs or perf.

I made a benchmark for my ray-sphere intersection routine and found out, that glam is slower if I enable lto = "fat". I opened a discussion on the glam repo. Maybe they know that's going on.

1 Like