Sha2's sha256 is very inefficient when building with opt-level "s" (while ring's implementation is unaffected)

I'm writing a program to hash all packaged managed files (and compare them to what the distro package manager thinks it should be). So far I have implemented support for Arch Linux which records sha256 hashes for all installed files (though I plan to support more distros down the line).

With the sha2 crate I wrote this code (slightly cut down):

use sha2::Digest;

let mut reader = File::open(path)?;
let mut buffer = [0; 128 * 1024];

let mut hasher = sha2::Sha256::new();
loop {
    match reader.read(&mut buffer) {
        Ok(0) => break,
        Ok(n) => {
            hasher.update(&buffer[..n]);
        }
        Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
        Err(e) => Err(e)?,
    }
}
let mut actual = Default::default();
hasher.finalize_into(&mut actual);

if actual[..] != expected[..] {
    issues.push(IssueKind::ChecksumIncorrect);
}

With ring I wrote this:

let mut reader = File::open(path)?;
let mut buffer = [0; 128 * 1024];
let mut hasher = ring::digest::Context::new(&ring::digest::SHA256);
loop {
    match reader.read(&mut buffer) {
        Ok(0) => break,
        Ok(n) => {
            hasher.update(&buffer[..n]);
        }
        Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
        Err(e) => Err(e)?,
    }
}
let actual = hasher.finish();

if actual.as_ref() != expected {
    issues.push(IssueKind::ChecksumIncorrect);
}

Both of these are executed in parallel by rayon on the iterator of all file loaded file metadata (path, expected hash and a few other things such as modes, owner, group, ...)

I'm building with --release of course. I also had this:

[profile.release]
lto = "fat"
opt-level = "s"

I'm NOT building with -Ctarget_cpu=native since I want to target generic x86-64 (same as arch packages are built with). If I do target my native CPU (Ryzen 5600X) time does go down a bit, though not to the level of ring.

I tried both with and without the asm feature of sha2, as well as the pre-release sha2-0.11.0-pre.3 (where the asm feature is removed?). None of that made any statistically significant difference.

  • With ring I get the following statistic from zsh for all files on my system: user=18,13s system=21,83s cpu=386% total=10,336. This is repeatable to within +/- 1 second for total time. More repeatable for user/system.
  • With sha2 I'm getting the following statistic: user=100,23s system=21,86s cpu=647% total=18,850. Again very repeatable.

If I change opt-level from s to 2 ring and sha2 becomes comparable (ring is largely unaffected by this change). So I decided to compare with hyperfine multiple builds:

Key to program name: mybin_<ring/sha2>_<opt_level>

Benchmark 1: test_bins/mybin_ring_2
  Time (mean ± σ):     10.985 s ±  0.793 s    [User: 17.782 s, System: 22.896 s]
  Range (min … max):    9.997 s … 12.373 s    10 runs
 
Benchmark 2: test_bins/mybin_ring_debug
  Time (mean ± σ):     14.673 s ±  0.523 s    [User: 22.884 s, System: 22.843 s]
  Range (min … max):   14.049 s … 15.557 s    10 runs
 
Benchmark 3: test_bins/mybin_ring_s
  Time (mean ± σ):     10.155 s ±  0.883 s    [User: 17.877 s, System: 22.335 s]
  Range (min … max):    8.976 s … 11.477 s    10 runs
 
Benchmark 4: test_bins/mybin_sha2_2
  Time (mean ± σ):     10.083 s ±  0.858 s    [User: 17.634 s, System: 23.141 s]
  Range (min … max):    8.794 s … 11.332 s    10 runs
 
Benchmark 5: test_bins/mybin_sha2_s
  Time (mean ± σ):     20.408 s ±  2.039 s    [User: 100.318 s, System: 22.028 s]
  Range (min … max):   17.783 s … 24.368 s    10 runs
 
Summary
  test_bins/mybin_sha2_2 ran
    1.01 ± 0.12 times faster than test_bins/mybin_ring_s
    1.09 ± 0.12 times faster than test_bins/mybin_ring_2
    1.46 ± 0.13 times faster than test_bins/mybin_ring_debug
    2.02 ± 0.27 times faster than test_bins/mybin_sha2_s

Sha2 debug build was unusably slow, it didn't complete a single run in multiple minutes.

So this leads to two questions:

  1. I'm very confused as to what is going on here. Why is ring largely unaffected while the
    performance of sha2 varies wildly?
  2. I normally build my binaries with opt level "s" as in previous experience it has
    be just as fast as 2 but with a bit smaller binaries. Clearly not the case here though! I believe [profile.<name>.package.<name>] exists that could let me override this,
    but this can only be applied at the workspace root level: warning: profiles for the non root package will be ignored, specify profiles at the workspace root
    How will this work when I publish to crates.io? I can't have the profile block in the bin crate
    without warnings locally, but from previous experience in other projects it seemed that workspace settings were ignored once I published the crate. What to do to make both workflows work?

I cannot simply choose ring in general as some package managers I need to support down the line only store md5sums (looking at you dpkg/apt). And ring does not support md5 (for very good reasons for its primary use case). So I need to figure out how to make the rustcrypto crates performant (I have not yet implemented support for dpkg, so I haven't tried the md5 performance yet).

ring uses assembly for implementation of most algorithms in it, so used optimization level does not influence core of the algorithm (block compressing function). Meanwhile sha2 has a pure Rust backend, which can be really inefficient with disabled optimizations as any other pure Rust code.

Also, if it's important for you to get a small binary, current version of sha2 can be suboptimal because of manually unrolled round loops. See this issue for more information.

2 Likes

Aha, that explains it. But then what does the "asm" feature of sha2 even do? Unfortunately the crate features aren't documented in this case (this is quite often the case and always regrettable).

No, this is more of a nice to have. So I'm perfectly happy with making just the sha2 crate build with opt level 2 instead of s. If I can make it work with both publishing to crates.io and in a workspace.

It enables experimental support for asm implementation based on the sha2-asm crate from this repository. Enabling it should make sha2 faster with disabled optimizations similarly to ring, but note that we plan to remove this feature in future breaking releases. We may later bring back assembly backends for x86 and ARM targets, but they would be implemented using asm! similarly to the LoongArch asm backend.

Note that the s level can result in a much slower binary, since it inhibits a lot of important optimizations such as inlining. Use it only when binary size is an absolute priority, e.g. for restricted embedded targets.