Is sha256 hashing in rust slower than go?

Hi. I did a benchmark using criterion for rust, and comparing the result with go benchmark. The scenario is buffered read a 200MB file, with 1024 bytes each read until EOF. Each read will be hashed using sha256. In the code, I don't really use the return value right now as I only want to benchmark the hashing operation.

I've configured benchmark to run 10 times and 100 times, both in rust and go.

Using rust, I achieved 1.2 s while go only 0.7 s.

Having only 2 weeks learning rust, I suspect my rust code is not optimal. However, is it or something else?

Rust

Here's the rust code.

const CHUNK_SIZE: usize = 1024;

pub fn calculate_chunk_hash(filepath: &str) {
    chunk_hash::perform(filepath);
}

mod chunk_hash {
    use sha2::{digest::generic_array::GenericArray, Digest, Sha256};
    use std::{
        fs::File,
        io::{BufReader, Read},
    };
    use crate::CHUNK_SIZE;

    pub fn perform(path: &str) {
        let file = File::open(path).unwrap();
        let mut reader = BufReader::new(file);

        let mut buffer = [0u8; CHUNK_SIZE];

        loop {
            let read_res = reader.read_exact(&mut buffer);
            match read_res {
                Ok(_) => hash_sha256(&buffer),
                Err(err) => {
                    match err.kind() {
                        std::io::ErrorKind::UnexpectedEof => {
                            // eprintln!("EOF");
                            break;
                        }
                        _ => {
                            panic!("err {}", err)
                        }
                    }
                }
            }
        }
    }

    fn hash_sha256(msg: &[u8]) {
        let hasher = Sha256::new();
        let _x = hash(msg, hasher);
    }

    fn hash<D: Digest>(msg: &[u8], mut hasher: D) -> GenericArray<u8, D::OutputSize> {
        hasher.update(msg);
        return hasher.finalize();
    }
}

The benchmark using criterion:

use criterion::{criterion_group, criterion_main, Criterion};
use filereader_rust::calculate_chunk_hash;

fn criterion_benchmark(c: &mut Criterion) {
    let mut group = c.benchmark_group("vid-example");
    group.significance_level(0.1).sample_size(10);
    group.bench_function("calculate_chunk_hash", |b| {
        b.iter(|| calculate_chunk_hash("file_test.MP4"))
    });
    group.finish()
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Golang

As for golang code, see below:

package chunker

import (
	"bufio"
	"crypto/sha256"
	"errors"
	"fmt"
	"io"
	"os"
)

const (
	ChunkSize = 1024
)

func CalculateChunkHash(path string) {
	data, err := os.Open(path)
	if err != nil {
		panic(err)
	}
	defer data.Close()

	reader := bufio.NewReader(data)
	part := make([]byte, ChunkSize)

	for {
		_, err := reader.Read(part)
		if err != nil {
			break
		}
		_ = hash_sha256(part)
	}
	if err != nil {
		if !errors.Is(err, io.EOF) {
			panic(fmt.Sprintf("err read %s: %v", path, err))
		}
	}
}

func hash_sha256(chunk []byte) [32]byte {
	hash := sha256.Sum256(chunk)
	return hash
}

1 Like

Make sure you test release builds of both. Cargo needs the --release option while building or running.

You're benchmarking 2 very different operations together: buffered file reads and hashing. That makes optimization and reliable comparison quite challenging.

Since you throw away the hash result, there is a possibility that one or both compilers optimize it away, depending on the implementation of sha256.

10 Likes

If you want to compare the two SHA-256 implementations, you should read the 200MB entire file into memory and generate a hash of the entire thing.

You'll also want to make sure the result is used somehow. In the snippet you shared, I would expect the entire hash_sha256() function to be optimised out of existence when running the code with --release because it has no observable side-effects (i.e. we don't return anything or modify global state like writing to stdout), so you are probably just measuring the time it takes to read a file.

13 Likes

Make sure to enable asm feature of the sha2 crate.

You can also try using rustflags = "-Ctarget_cpu=native".

I'm not sure what's golang's strategy for file buffering. Try without BufReader (because read_exact will read whole chunk in one go, without extra copying), or try BufReader::with_capacity(some_big_number, file).

8 Likes

ok so I enabled asm feature for sha2 crate. Based on same benchmark strategy, it improves the performance from 1.2 s to 900ms. However this still slower than golang implementation.

because read_exact will read whole chunk in one go, without extra copying

could you elaborate more in this statement? isn't "without extra copying" a good thing?

I change the code a bit so that it loads entire file into memory and the result is used. The time performance is still not improving, still same at around 1.18 to 1.2 s.

for context, here's the new code

pub fn hash_sha256(filepath: &str) -> Result<String, ()> {
    let mut file = std::fs::File::open(filepath).unwrap();
    let mut hasher = Sha256::new();
    let res = io::copy(&mut file, &mut hasher);
    match res {
        Ok(_) => {
            let x = hasher.finalize();
            return Ok(format!("hash {:?}", x));
        }
        Err(err) => {
            panic!("err {}", err);
        }
    }
}

Thanks for replying.

I'm aware of flag --release for building or running, but I didn't find similar compiler optimization option for benchmarking. I probably missed documentation, tho.

You're benchmarking 2 very different operations together: buffered file reads and hashing. That makes optimization and reliable comparison quite challenging.

agreed in the sense of benchmark isolation. However, as language marketed as "low-level, fast, etc", I would expect rust is superior for common io-related task like this, or at least having minimal difference (no more than 100 ms) in time performance compared to golang.

Indeed. I did some really coarse testing on a 257MB file, and reading the file took 64ms on my laptop, whereas hashing the contents using sha2::Sha256 took 1.6s. Enabling the asm feature of the sha2 brought the hashing time to 1.4s.

For reference, using sha256sum from the command line only took 0.8s in total.

Interesting.

I changed the sha256 implementation crate from sha2 to ring and the benchmark result now is 650ms. The benchmark scenario still same: buffered read (1024 bytes per buffer) of 200MB files, and sha256 hash each buffer.

There's probably something with sha2. I probably will revisit the sha2 implementation once I'm get more grip with rust code.

The sha256 crate is just as slow as sha2. I can confirm ring is much faster, on par with OpenSSL's sha256sum

Though, to be fair

3 Likes

The sha2 crate without asm feature is always a pure-Rust implementation using only the safe Rust subset. With asm it uses SIMD intrinsics on x86-64 and arm (they may still not be optimal if you don't specify -Ctarget_cpu=native or some new-ish CPU model).

Golang uses hand-written assembly for sha2 by default on most platforms.

So here you're not really comparing Go and Rust, but different (assembly) implementations of sha2.

BufReader has its internal buffer (8KB by default). read_exact copies data from BufReader's buffer to the buffer you provide. So you call the OS 8 times less frequently, but copy all data twice. If you used read_exact directly on the File, you'd call the OS every time, but avoid an extra copy. I suspect that a much larger buffer given to read_exact, used directly the File, would be faster.

6 Likes

cargo bench, specifically, does not need a flag because it defaults to the bench profile (which also enables optimizations) automatically.

If you did want to change the profile (there's no reason this should be necessary unless you have been customizing the profiles specifically) you would write --profile=release.

This is documented in the main section of the documentation for cargo bench.

2 Likes

No, sha2 can use backend based on SHA-NI intrinsics even without enabled asm feature, e.g. on my Ryzen 7 2700x it results in 2080 MB/s of throughput for SHA-256. The asm feature is used only for replacing the software backend with implementation from the sha2-asm crate, which on ARM may use instructions from the crypto extension (relevant intrinsics are not yet stable).

Also, you can get good results without using -Ctarget_cpu=native, sha2 by default uses target feature autodetection based on the cpufeatures crate. Autodetection introduces a tiny bit of overhead, but it's effectively unnoticeable in practice.

The measured difference in performance is probably because we currently do not have SSE and AVX2 based backends for SHA-256, see this issue for more information.

11 Likes

Barely, I read the entire file into memory, then hashed it all in one go. Nearly all time is spent in calculating the hash, not in I/o.

It's a known issue then. Thanks for the link.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.