Tokio/hyper how to generate a response with a shared body?

Hello,

I'm creating an http server (based on the tokio/hyper stack), but with a special feature: it only delivers files preloaded and compressed in a cache. (for performance purposes),

but I've noticed that a lot of performance is lost because of the Hyper requirement to clone the body before passing it in an http response,

isn't there a way to pass it by reference?

here's the code in simplified form (the code is a little more complex than that in reality) :


I have a hashmap with the file name as key and a vector representing the file data as value.

static mut GLOBAL_CACHE: Lazy<HashMap<String, FileData>> = Lazy::new(|| HashMap::new());

struct FileData {
    path: String,
    f_type: String,
    data: Vec<u8>,
    size: usize,
}

I used the "unsafe" to be able to modify the static data, since I know that the modification is made only once at server startup.

    let body: Body;
    let mime: String;
    unsafe {
        if let Some(cache) = GLOBAL_CACHE.get(filename) {
            body = Body::from(cache.data.clone()); //<-------- this CLONE
            mime = cache.f_type.clone();
        } else {
                return Ok(not_found());
        }
    };

    let response = Response::builder()
        .header("Content-Encoding", "gzip")
        .header("Content-Type", mime.as_str())
        .header("Cache-Control", "public, max-age=31536000")
        .status(StatusCode::OK)
        .body(body)
        .unwrap();

    return Ok(response);

with this code i exceeded nginx performance by ~5%. but if i avoid cloning i'm sure this percentage will be even higher.

for example if we have 1000 Req/s for the same resource, we have to make 1000 clones for an immutable data in memory, when you need read-only access.

if you have any suggestions on how to avoid the clone, I'd be very grateful.

thanks in advance.

As an aside:

Absolutely do not do that. Apparently you do not understand why static mut needs unsafe. The reason for that is that it can be accessed (and mutated) by several threads at once. Therefore it must be synchronized. You are completely defeating the purpose of Lazy by using unsafe. Lazy would have performed the necessary synchronization for you had you used it correctly.


Anyway,

How did you "notice" that? It sounds like you are only speculating. Did you measure how much time is spent with cloning the data? Are you running in release mode? It's likely insignificant compared to the entire round-trip time of the request. Yet,

There is a From<&'static [u8]> impl for Body. You could just use that.

The following Playground demonstrates a complete main.rs that I used successfully to run a server serving all-'static data.

2 Likes

Another option would be making the data field of FileData store a Bytes instead of a Vec<u8>. That way .clone() would be way cheaper since it would only increase the ref-counter.


If you only need to change it once at startup then you can change its type to OnceLock<HashMap<String, FileData>> and call .set() on it at startup with the correct map. Then you don't need to mark the static as mut and hence you don't need the unsafe.

Btw your code snippets don't match up:

Here GLOBAL_CACHE is supposed to have type Lazy<HashMap<String, FileData>>

But here you access its field files, which is not defined on neither Lazy nor HashMap.

1 Like

First of all, thank you for taking the time to reply.

I know it's not a clean way of doing things,
I've used unsafe knowing full well what I'm doing, the modification is only made once on the main function, and threads only have read access and never modify static data.

and I never said I was an expert in RUST, I'm new to this language, and I'm trying to unblock the situation with my modest knowledge.

I don't think we need to do tests to prove this, if we have two ways of doing it:

  1. copy the data for each request before sending it
  2. send data without copying

it's logical that choice 2 will be faster :face_with_raised_eyebrow:

I'm really impressed, you took your precious time to give me a working example, thank you for this proposal, I will test it.


Thanks for the suggestion, I'll test it too, combining it with the solution from @H2CO3 .

sorry I didn't see it, I modified the real code to make it simpler and post it on the forum, and I forgot to modify this part, the complete code is much longer, and it wouldn't be ergonomic to post all the code here. (I've corrected the post)

I'm going to apply all your advice, and benchmark the old version against the new one and give you the stats here. :pray:

thanks again.

Yes, but by what margin? If the network round-trip is 1000x slower than the copying part (which is a completely realistic estimate), then the non-copying version may be 0.1% faster, at which point the difference is smaller than the noise (ie. statistically insignificanf). You always need to measure the real, actual effect of these sort of claims – making the code more complicated for no substantial benefit is not good.

1 Like

test results : (@H2CO3 , @SkiFire13)

Old version (v1) :

wrk -c1000 -d30s -t12 http://127.0.0.1:1337/index.html
Running 30s test @ http://127.0.0.1:1337/index.html
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.22ms    4.02ms 208.77ms   98.78%
    Req/Sec    70.74k     7.67k   86.81k    85.47%
  25368002 requests in 30.07s, 7.96GB read
Requests/sec: 843558.96
Transfer/sec:    271.11MB

++++
CPU usage: ~51%
Memory usage :
at startup : 589 KB
maximum use during load test : 19,6 MB
residual memory after test end : 13,8 MB


New version (v2) : [by applying your advice]

wrk -c1000 -d30s -t12 http://127.0.0.1:1337/index.html
Running 30s test @ http://127.0.0.1:1337/index.html
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.18ms    2.86ms 209.27ms   98.02%
    Req/Sec    71.38k     5.30k   83.71k    76.00%
  25598805 requests in 30.08s, 8.03GB read
Requests/sec: 850957.65
Transfer/sec:    273.49MB

++++
CPU usage: ~51%
Memory usage :
at startup : 393 KB
maximum use during load test : 19,7 MB
residual memory after test end : 14,1 MB

there's a slight improvement in performance, but what puzzles me is the memory usage, which is weird. I was expecting less memory usage.

Does Body::from still create an internal copy of the data?
and for residual memory, is it a memory leak or does Tokio not empty the thread pool?

important note: to convert data to &'static, I used the leak() method. Is this a good thing?

FileData {
        path: string_to_static_str(path.to_str().unwrap().to_string()),
        f_type: content_type,
        data: cdata.leak(),
        // size: data_size,
 }

fn string_to_static_str(s: String) -> &'static str {
    s.leak()
}

I'll see if it's better with : Pin<Box<[u8]>>

test results with a large file : (@H2CO3 , @SkiFire13)

Old version (v1) :

wrk -c1000 -d30s -t12 http://127.0.0.1:1337/image.jpg
Running 30s test @ http://127.0.0.1:1337/image.jpg
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.16s   446.86ms   2.00s    63.99%
    Req/Sec    57.55     33.47   259.00     73.60%
  19688 requests in 30.47s, 133.99GB read
  Socket errors: connect 0, read 0, write 0, timeout 2588
Requests/sec:    646.19
Transfer/sec:      4.40GB

++++
CPU usage: ~65%
Memory usage :
at startup : 8 MB
maximum use during load test : 4,3 GB
residual memory after test end : 3,0 GB


New version (v2) : [by applying your advice]

wrk -c1000 -d30s -t12 http://127.0.0.1:1337/image.jpg
Running 30s test @ http://127.0.0.1:1337/image.jpg
  12 threads and 1000 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   661.82ms  366.91ms   1.99s    60.04%
    Req/Sec   108.88     56.37   359.00     68.43%
  37309 requests in 30.09s, 252.51GB read
  Socket errors: connect 0, read 0, write 0, timeout 15
Requests/sec:   1240.08
Transfer/sec:      8.39GB


++++
CPU usage: ~52%
Memory usage :
at startup : 393,2 KB
maximum use during load test : 21,8 MB
residual memory after test end : 21,8 MB

with a large file, the difference is huge ~x2, with very low memory consumption.
the JPG image is 7 MB, whereas v2 with cache starts with 393 KB, so the leak() method is dangerous? it frees memory and we're pointing to memory that's likely to be modified?

No, the purpose of leak is that it prevents the memory from ever being freed. It is a safe function.

2 Likes

No, there's zero need for explicit leaking. You are already putting the data in a global/static, so you can just get a &'static reference out of the cache directly without any sort of additional trickery. Demo.

I've tried everything but nothing works except using "leak()".

   FileData {
        path: string_to_static_str(path.to_str().unwrap().to_string()),
        f_type: content_type,
        data: &*cdata,
        // size: data_size,
    }
`cdata` does not live long enough
borrowed value does not live long enough
main.rs(127, 1): `cdata` dropped here while still borrowed
main.rs(113, 9): binding `cdata` declared here
main.rs(124, 15): this usage requires that `cdata` is borrowed for `'static`

the problem with "leak()" it hides real memory usage, at the OS level you can't see what the application is actually consuming, is there another way to transorm a Vec<u8> (in stack) to &'static [u8] ?

I don't want to use Arc,Mutex,Lock, for performance reasons.

Pin / Cow , Is this a good idea?

I found this discussion :
https://www.reddit.com/r/rust/comments/176tfy8/when_to_use_boxleak_in_rust/
and then there's an interesting suggestion :
this crate ArcSwap

Nah. You should put the data in the cache by-value. Don't try to make the cache store &'static references. References are not for holding on to data. Store the data by value, and if the cache is a static, then you'll be able to get &'static references from it.

Again, using an Arc would be unnoticeable. Cloning an Arc is literally a single atomic increment.

This has nothing to do with pinning. Accessing a pinned pointer is not inherently faster than an unpinned one. Pinning is not an "optimization", it's for a completely different purpose (for ensuring memory safety around types that cannot be moved).

I have no idea what you want to do with Cow. Cow is for dynamically choosing between owned and borrowed data. It's not magic. It's not faster to access than a regular borrow, either.

2 Likes

Which one? Did you try using Bytes in the end?

1 Like

Yes, I use "bytes::Bytes", since there have been major changes in Hyper 1.0.
Thanks for your suggestion.