I have an application that is taking tens of minutes to run (processing hundreds of gigabytes of data).
Without changing the algorithm, reimplementing the algorithm, increasing parallelism, are there any flags that can be added to “cargo run --release” that makes it optimized? (i.e. gcc vs “gcc -O3”) ?
The first thing I’d suggest is
[target.x86_64-unknown-linux-gnu] rustflags = ["-Ctarget-cpu=native"]
In ~/.cargo/config (adjust for platform and target as appropriate). The defaults are quite conservative and this will allow the compiler much more scope for optimisation options and faster / newer instructions.
There are a few options you can set in Cargo.toml profiles which may help. Release builds already default to
opt-level = 3, but you could try
codegen-units = 1 (monolithic codegen sometimes helps inlining) and/or
lto = true. There are a number of stable options you can read in
rustc -Chelp too, like
-Ctarget-cpu=native might help. You can put rustc options in .cargo/config or just the