Async sha1 file

I am writing a simple web server using Rust + tide 0.10 + async_std 1.6 with a route that takes a file path and returns the sha1 of the file. For very small files, it is very quick but for larger files (300MB) it takes over 3 seconds (in release mode). My workstation is very fast and for reference running sha1sum on the same file completes in less than 250 ms. I have inserted the relevant code below. I have tried it both using the async_std::io::BufReader and using the file directly and the results are about the same. Does anyone have recommendations?

const READ_SIZE: usize = 1024 * 8;
/// sha1_from_file example taken from https://rust-lang-nursery.github.io/rust-cookbook/cryptography/hashing.html
async fn sha1_from_file(path: &std::path::Path) -> Result<String, std::io::Error> {
    let file = async_std::fs::File::open(path).await?;
    let mut reader = async_std::io::BufReader::with_capacity(READ_SIZE, file);
    let mut context = ring::digest::Context::new(&ring::digest::SHA1_FOR_LEGACY_USE_ONLY);
    let mut buffer = [0; READ_SIZE];
    loop {
        let count = reader.read(&mut buffer).await?;
        if count == 0 {
            break;
        }
        context.update(&buffer[..count]);
    }
    let digest = context.finish();
    let sha1_str = data_encoding::HEXUPPER.encode(digest.as_ref());
    Ok(sha1_str)
}

How long does it take to compute the sha1 from the file if you include it in the binary (via the include_bytes macro)?

If that is fast, you know the problem is not computing the sha1 itself.

File IO cannot be made efficient in the world of async, so you probably want a spawn_blocking call around that entire thing and to use the std File IO.

Using the async File type boils down to every read call becoming an independent spawn_blocking call, which is quite expensive as it requires a lot of communication between threads.

1 Like

Converting the function to standard sync and then wrapping with a spawn_blocking call helped a lot. It brought the execution time down to around 750ms which is still far slower than using sha1sum. Is there anything else that I can do from a rust perspective? I could also just shell out to sha1sum and have it do the calculation because the application will only be run on Linux. Any thoughts?
The other question I have is that I tried to use buffer as both bytes::BytesMut and Vec to try different read sizes but in both cases I received an incorrect calculation of the SHA1 digest. I don't understand why those types would behave differently from a simple standard array?

fn sha1_from_file(path: &std::path::Path) -> Result<String, std::io::Error> {
    let file = std::fs::File::open(path)?;
    let mut reader = std::io::BufReader::with_capacity(READ_SIZE, file);
    let mut context = ring::digest::Context::new(&ring::digest::SHA1_FOR_LEGACY_USE_ONLY);

    let mut buffer = [0; READ_SIZE];
    //let mut buffer = bytes::BytesMut::with_capacity(READ_SIZE);
    //let mut buffer = Vec::with_capacity(READ_SIZE);
    loop {
        let count = reader.read(&mut buffer)?;
        if count == 0 {
            break;
        }
        context.update(&buffer[..count]);
    }
    let digest = context.finish();
    let sha1_str = data_encoding::HEXUPPER.encode(digest.as_ref());
    Ok(sha1_str)
}

Well for one, you don't need a BufReader in this situation, and are only introducing an extra memcpy you don't need. A BufReader is useful when you want to do many small reads and combine them into fewer large reads, or when you want to do some kind of look-ahead like the read_line method does. When you just want to plow through the file in as large chunks as you can get, it is not useful.

A stack array for buffer should be fine, but Vec or BytesMut should behave the same. I'm guessing it's because you were giving the read call an empty slice into the vector, as Vec::with_capacity returns a zero length vector.

Aside

Is there no reasonable way to write an async wrapper around something like select or epoll? As far as I know, they’re designed to with disk I/O in addition to network sockets.

select and epoll on regular files can block so they are not that useful in truely async world.
But async fs access is efficient on windows, and can be efficient again on linux with the io_uring apis. Not sure about the BSD equivalent.

3 Likes

The epoll api doesn't really do anything useful for a file. It says the file is always ready, and then proceeds to block anyway when you read from it.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.