Boxed Traits faster than Generics?

I'm currently working on a path tracer library.
Up to now, I use boxed dynamic objects for materials of shapes, e.g.

pub struct Sphere {
    radius: f64,
    center: Vec3,
    material: Box<dyn Material>,
}

[...]

pub struct Hit<'a> {
    position: Vec3,
    normal: Vec3,
    distance: f64,
    material: &'a Box<dyn Material>,
    uv: (f64, f64),
}

which requires to create a new material by boxing it.
To make this easier for the user, I started to use generics instead:

pub struct Sphere<M: Material> {
    radius: f64,
    center: Vec3,
    material: M,
}

[...]

pub struct Hit<'a> {
    position: Vec3,
    normal: Vec3,
    distance: f64,
    material: &'a dyn Material,
    uv: (f64, f64),
}

To my surprise, generics were slower than using boxed dynamic objects.
Can somebody explain this behaviour, which seems to contradict everything I've read about boxing and allocating objects on the heap?

Note: I modified the Cargo.toml to use opt-level = 3 by default and compile --release with lto = "fat" and codegen-units = 1.

64 samples, 4 bounces Box <M>
run 43s 62s
run --release 39s 39s
256 samples, 32 bounces Box <M>
run 256s 366s
run --release 216s 237s

The project can be found here.
The commit SHA1s are 06ff90c4 for the boxed dynamic objects and 270083ae for generics.

1 Like

It may be because material stored inline in Sphere made it larger. The true equivalent would be material: Box<M>, not bare M.

It may be due to code bloat. If you have many materials and many functions with M parameters, they will all be duplicated. Apart from bloating instruction caches, it may reduce effectiveness of branch prediction, since each generic copy will have its own branches that don't learn across materials.

BTW, Box<dyn Material> can be coerced to be &dyn Material, so &'a Box<dyn Material> is an unnecessary wasteful double indirection.

8 Likes

Thanks for the quick response.
Using Box<M> made pretty much no difference.

64 samples, 4 bounces Box <M> Box<M>
run 43s 62s 63s
run --release 39s 39s 39s
256 samples, 32 bounces Box <M> Box<M>
run 256s 366s 367s
run --release 216s 237s 231s

The scene used for testing is the Cornell Box.
There are 2 different object primitives (triangles and spheres) and a total of 28 objects with 4 different materials (1 texture type). That should make 8 different combinations.

The objects are immutable and persist for the whole runtime.
Maybe there shouldn't be a significant time difference then, but the provided numbers are sort of reproducible:

This is what I get when using Ubuntu 20.04 with a 11th Gen Intel i7-1165G7 (8 Threads) @ 4.7 GHz.

256 samples, 32 bounces Box <M>
run 281s 397s
run --release 244s 227s

The previous numbers were all gathered on a machine running Arch with an Intel i7-7700 (8 Threads) @ 4.2 GHz.

Have you profiled the code?

What is the size of the generics methods? Anyway, you should profile.

What is the size of the generics methods?

The only method that uses self.material is intersects with ~30 lines of code, which returns a Hit referencing the Material.

Profiling the --release builds yields very similar results.
There's one difference, though: The generic version doesn't inline the transform_ray method which is used in the intersects method.
When profiling the normal builds, there is much more of a difference. It seems like the generic version inlines much less than the boxed version.

Images of the method lists can be found here:

Maybe try on the dynamic, the coerced reference. As kornel's last line.

The Hits increased size could be enough so the compiler generates worse performing code.

When using generics, I already use &'a dyn Material over &'a Box<dyn Material> for Hit.

I should probably mention that the Sphere objects are stored on the heap as Box<dyn Shape> and that the code is multi-threaded using rayon.
Single-threaded generics are actually faster, as I'd expect it. However, for the boxed version, run is actually way faster than --release. I also tried disabling hyper-threading which yielded similar results:

256 samples, 32 bounces multi-threaded not hyper-threaded single-threaded
Box <M> Box <M> Box <M>
run 256.64s 366.20s 326.03s 521.52s 1147.52s 1897.15s
run --release 216.93s 237.55s 417.90s 321.57s 1541.12s 1122.57s

Compared to the single-threaded version, the speed-up on 8 threads (my thread count) is about 4-7.
Maybe multi-threaded code bloats the cache because each thread uses its own stack?

Are you using high opt-level in your debug mode? The numbers in debug and release mode being so close are very suspicious.

Rust generates exceptionally terrible code when optimizations are disabled. If your debug mode isn't 10x slower than release, you either optimize both, or you have a terrible bottleneck somewhere that makes code speed irrelevant.

6 Likes

Yes, debug mode uses opt-level = 3. Release mode additionally has lto = "fat" and codegen-units = 1 .

1 Like

I mean the case when not using generics, does your time increase. The &dyn uses more bytes and pushes up the size of Hit from 80 to 88.

Debug is usually not something you should be benchmarking against.

For me the generic one runs faster (using 256,32)

I changed trace_tray into loop. Took ~33% longer. Added a #[inline(never)] and time back again to about same as recursive. (Maybe different on your CPU.)

Maybe PGO could be something you might want to try.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.