How to compute a byte-for-byte ETag using Tokio?

Hi everyone, I'm stuck trying to create a strong Etag for a Hyper body using Tokio.
The code is shown below.
The problem is that code is obviously inefficient because it copies the entire body in memory to perform the hash computation until reaches the entire body bytes (loop).

use futures_util::stream::StreamExt;
use headers::HeaderValue;
use http::header::ETAG;
use hyper::{Body, Response};
use std::hash::Hasher;
use tokio_util::codec::{BytesCodec, FramedRead};
use twox_hash::XxHash64;

use crate::Result;

pub async fn etag(resp: Response<Body>) -> Result<Response<Body>> {
    let (mut head, body) = resp.into_parts();

    // AFAICT this line copy the entire body in memory
    // which should be inefficient for large bodies...
    let bytes = hyper::body::to_bytes(body).await?;

    let mut stream = FramedRead::new(&bytes[..], BytesCodec::new());

    let mut hasher = XxHash64::default();

   // Here we consume the stream so later is empty when completed.
   // Maybe there is a way to borrow a stream rather than consume it? but I'm using `stream.by_ref()`?
    while let Some(chunk) = stream.by_ref().next().await {
        let data = chunk?;
        hasher.write(&data);
    }
    let hash = format!("{:x}", hasher.finish());

    // also this Body::from(bytes)
    let mut resp = Response::new(Body::from(bytes));

    head.headers
        .insert(ETAG, HeaderValue::from_str(&hash).unwrap());
    *resp.headers_mut() = head.headers;

    Ok(resp)
}

Having in mind that the ETag header has to be set before sending the response and the hash has to be calculated for the entire body stream.

Additionally, I also tried implementing the futures_util::Stream trait to compute the hash on poll_next but once I have it then from there, I don't know how to proceed to set that hash value back to the response headers, since headers have to be set before the response body is sent.

I don't know how to do it better so any help will be appreciated.
Thanks in advance!

I don’t know how to do it with hyper, but HTTP has the option of including the ETag as a trailer instead of a header.

Alternatively, HTTP allows you to use any algorithm you want for the ETag as long as different document versions have different values. So, you could derive it from the data that is used to generate the stream instead of the stream itself.

2 Likes

Thanks for the hint, Trailer looks interesting I will have a look.

But what do you mean by that, can you please elaborate a bit more?

If you have something like Fn(T)->Response<Body> and T:Hash, it would be completely legitimate to send the hash of T as the Etag before calling the closure to actually generate the response (assuming the result of things like DB lookups have already been incorporated into T).

1 Like

Ok, If I got it correctly. Basically, I could compute a hash per byte-chunk, leveraging Trailer to send its Etag together repeatedly before generating the final response. Isn't it?

But when is the time for sending the final response, do I even need to send the complete Etag hash value again? Should that be redundant or not?

1 Like

Because sending an ETag for the whole body stream looks tricky, it could probably explain the reason why several implementations go with the weak (simpler) version based on file metadata, uuid, etc. Isn't it?

It also allows you to skip doing work when implementing the If-None-Match header mechanism. If you can calculate the current ETag quickly you can avoid generating the big response and immediately send back a 304 Not Modified response. If you have to generate the whole response anyway you're doing work you didn't need to do for the request.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.