To my surprise, generics were slower than using boxed dynamic objects.
Can somebody explain this behaviour, which seems to contradict everything I've read about boxing and allocating objects on the heap?
Note: I modified the Cargo.toml to use opt-level = 3 by default and compile --release with lto = "fat" and codegen-units = 1.
64 samples, 4 bounces
Box
<M>
run
43s
62s
run --release
39s
39s
256 samples, 32 bounces
Box
<M>
run
256s
366s
run --release
216s
237s
The project can be found here.
The commit SHA1s are 06ff90c4 for the boxed dynamic objects and 270083ae for generics.
It may be because material stored inline in Sphere made it larger. The true equivalent would be material: Box<M>, not bare M.
It may be due to code bloat. If you have many materials and many functions with M parameters, they will all be duplicated. Apart from bloating instruction caches, it may reduce effectiveness of branch prediction, since each generic copy will have its own branches that don't learn across materials.
BTW, Box<dyn Material> can be coerced to be &dyn Material, so &'a Box<dyn Material> is an unnecessary wasteful double indirection.
Thanks for the quick response.
Using Box<M> made pretty much no difference.
64 samples, 4 bounces
Box
<M>
Box<M>
run
43s
62s
63s
run --release
39s
39s
39s
256 samples, 32 bounces
Box
<M>
Box<M>
run
256s
366s
367s
run --release
216s
237s
231s
The scene used for testing is the Cornell Box.
There are 2 different object primitives (triangles and spheres) and a total of 28 objects with 4 different materials (1 texture type). That should make 8 different combinations.
The objects are immutable and persist for the whole runtime.
Maybe there shouldn't be a significant time difference then, but the provided numbers are sort of reproducible:
This is what I get when using Ubuntu 20.04 with a 11th Gen Intel i7-1165G7 (8 Threads) @ 4.7 GHz.
256 samples, 32 bounces
Box
<M>
run
281s
397s
run --release
244s
227s
The previous numbers were all gathered on a machine running Arch with an Intel i7-7700 (8 Threads) @ 4.2 GHz.
The only method that uses self.material is intersects with ~30 lines of code, which returns a Hit referencing the Material.
Profiling the --release builds yields very similar results.
There's one difference, though: The generic version doesn't inline the transform_ray method which is used in the intersects method.
When profiling the normal builds, there is much more of a difference. It seems like the generic version inlines much less than the boxed version.
When using generics, I already use &'a dyn Material over &'a Box<dyn Material> for Hit.
I should probably mention that the Sphere objects are stored on the heap as Box<dyn Shape> and that the code is multi-threaded using rayon.
Single-threaded generics are actually faster, as I'd expect it. However, for the boxed version, run is actually way faster than --release. I also tried disabling hyper-threading which yielded similar results:
256 samples, 32 bounces
multi-threaded
not hyper-threaded
single-threaded
Box
<M>
Box
<M>
Box
<M>
run
256.64s
366.20s
326.03s
521.52s
1147.52s
1897.15s
run --release
216.93s
237.55s
417.90s
321.57s
1541.12s
1122.57s
Compared to the single-threaded version, the speed-up on 8 threads (my thread count) is about 4-7.
Maybe multi-threaded code bloats the cache because each thread uses its own stack?
Are you using high opt-level in your debug mode? The numbers in debug and release mode being so close are very suspicious.
Rust generates exceptionally terrible code when optimizations are disabled. If your debug mode isn't 10x slower than release, you either optimize both, or you have a terrible bottleneck somewhere that makes code speed irrelevant.
I mean the case when not using generics, does your time increase. The &dyn uses more bytes and pushes up the size of Hit from 80 to 88.
Debug is usually not something you should be benchmarking against.
For me the generic one runs faster (using 256,32)
I changed trace_tray into loop. Took ~33% longer. Added a #[inline(never)] and time back again to about same as recursive. (Maybe different on your CPU.)
Maybe PGO could be something you might want to try.