Benchmarking the bytecount crate

Hi all,

I am trying to benchmark the bytecount crate (for comparison with C). I want to make sure I'm using it correctly so that all the optimizations are enabled.

I have the following in my Cargo.toml file:

...
[dependencies.bytecount]
features = ["runtime-dispatch-simd"]
version = "0.6"

[profile.release]
lto = true

And I'm running with RUSTFLAGS set to -C target-cpu=native.

Is this the correct setup to get bytecount working as well as possible?

1 Like

For maximum speed, I would also suggest trying out panic=abort and codegen-units=1. The former makes sure that the optimizer does not need to consider the possibility of unwinding by turning panics into instant crashes, the latter has the drawback of slowing down builds by serializing their LLVM tails but tends to result in better inlining decisions and therefore faster binaries.

You may want to try benchmarks both with -C target_cpu=native and without. The reason is that LLVM's cost model for AVX vectorization seems a little bit too optimistic, and therefore our beloved backend tends to overuse this form of vectorization, which may lead to a net slowdown because use of AVX causes downclocking on most CPUs... When target_cpu=native is off, LLVM is not allowed to use AVX, and therefore that problem doesn't exist.

3 Likes