My requirement is a large file upload, using
curl -T <file>, the server side is an HTTP Server in Rust, the implementation can be
hyper if other implementations are more suitable), using
HTTP/1.1, the server can only read the file upload data stream once, and I want to calculate the file's
BLAKE3) without reading the file data stream into memory, and then upload it to S3. If there is a solution for parallel calculation, it will be better, but it may be more complex, the key is still to reuse the stream.
It is best not to use a
channel to implement it, as the stream will flow through multiple
channel to calculate the Checksum, which may have a lower performance than reusing the stream to calculate at the same time.
This needs to be based on the Rust 2021 Edition,
actix-web 4.x version (the latest is 4.3.0), and if
tokio is used, it must be at least 1.0.0+ (I am currently 1.25.0). This is because the compatibility of library versions in Rust is too poor, and there may be API incompatibilities between each minor version.
May I ask if my requirement can be implemented using Rust? In Go, it seems that there is no better solution besides
io.Pipe, but the performance of
io.Pipe does not look good.
Thank you very much for your advice and patience!
If you need to calculate the hash before you upload the file, your only options are
- Read the file into memory once, hash it from memory and then upload it from memory
- Read the file in chunks twice. Once for calculating the hash, and once for the actual upload.
It sounds like Option 1 isn't feasible for you, which means you're almost certainly going to have to read the file twice to avoid storing the whole thing in memory.
If you actually don't need the hash before the upload is complete, you have more options though. I implemented a simple
tokio::io::AsyncRead wrapper for hashing a stream of data in response to another question awhile back. Some details might need to change for actix, I haven't used that before.
Thank you very much for your response. Yes, I cannot read the file into memory at once, as this would certainly cause an OOM. And I need to judge whether to upload the file to S3 after the Checksum calculation is completed, in order to avoid redundant storage (of course, another idea is to calculate and upload at the same time, and if it is found to be duplicated after uploading, it can be deleted asynchronously).
The goal can be understood as having two main points: first, to avoid reading or caching the file in memory, which would result in an OOM error; and second, to reuse the stream, as it will be used multiple times to calculate.
What do you mean by reuse the stream? If you're using something that has a similar interface to
std::fs::File there's probably a seek method (tokio has an
AsyncSeek trait with a rewind method for example) that will allow you to rewind the stream to the beginning. But most of the work is going to be actually reading the file so reusing the object doing the loading is a good idea, but it probably won't make a huge difference.
A couple of thoughts from when I've done similar things (in not-Rust):
- Have you looked at how often the uploads are a duplicate? If that's reasonably common, you can calculate the hash first, then make a call to check for its existence, and save ever uploading it, including the second read.
- Can the service insist that the client also send the hash, so it can just double-check it against the recalculated-while-streaming result. (Oh, if the client is curl then maybe not.)
- Does S3 have an efficient way to copy data from one name to another? You could upload to a temporarily name, and only move to the correct hash-based location after the on-the-fly hash calculation has finished.
Thank you very much for your insights. The core goal of this requirement is deduplication of storage.
The frequency of duplicates, as you mentioned, is certainly a very important indicator, similar to cache hit rate. If the duplication rate is too low, the value of this requirement naturally decreases.
The second stage can be combined with a custom CLI tool to locally check the file Checksum and then decide whether to upload.
As for S3's Copy or Rename interface, this depends on the S3 implementation, some are quite fast, some are slow. I have also considered the logic of first uploading and then moving and deleting.
In any case, thank you very much for your suggestions and sharing, I will consider it further!
The specific situation is uploading a file using
curl -T <FILE> , and the server is an HTTP Server. We need to calculate the checksum (three times: MD5, SHA-1, SHA-256) of the PUT HTTP request body, and then upload the body to S3 if necessary. The seek method may lead to OOM?
No, seeking moves the file descriptor to point back to the beginning of the file data. If you're using buffered IO seeking will generally clear the buffer.
You should be able to calculate all three hashes simultaneously at least.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.