MPSC channels: high memory usage

Hello!
I am experiencing unexpectedly high memory usage using Rust's MPSC channels.
To reproduce it, I made an example (playground --- sorry, it does not run there due to a stack overflow):

use std::{thread, time};

fn main() {
    use std::sync::mpsc::channel;
    //use flume::unbounded as channel;
    //use crossbeam_channel::unbounded as channel;
    let (sender, receiver) = channel();

    let consume = thread::spawn(move || {
        for x in receiver.into_iter() {
            // comment out the next line to consume less memory
            thread::sleep(time::Duration::from_millis(10));
            //println!("receiving");
        }
        println!("done processing");
        thread::sleep(time::Duration::from_millis(5000));
    });

    for i in 0..1024 {
        let v = [0; 1024 * 1024];
        sender.send(Vec::from(v)).unwrap();
    }

    println!("done sending");

    // signalise that we are done sending
    drop(sender);

    // wait for all sent things to be consumed
    consume.join().unwrap();

    println!("done receiving");
    thread::sleep(time::Duration::from_millis(5000));
}

In this example, we send a number of objects over a channel and process them in a different consume thread. When the thread consumes each received object "fast enough" (that is, without the thread::sleep()), then the total memory usage remains low. However, if the thread takes a bit of time to process objects (simulated with thread::sleep()), then the memory usage attains a certain level from which it decreases only when the consumer thread finishes.

memory

In this graph, the actual memory usage drops when "done receiving" is printed, that is, when the thread exits.
Instead, I would have expected that as the consumer thread receives objects, the memory usage would go down gradually.

This behaviour occurs with several channel implementations, namely crossbeam-channel, flume, and std::sync::mpsc.
(For flume, the memory usage drops already at "done processing", but still, that's later than expected.)

This is a real problem in my application, because it increases RAM usage from 4GB to nearly 24GB!

Do you have any idea how to make RAM usage decrease when receiving items, that is, to get to the "expected" memory usage in the graph?

Have you tried using a bounded queue ? They add backpressure and prevent unbounded (...) memory usage.

Yes, I have tried a bounded channel, and the problem goes away when I choose a bound small enough. However, I cannot always easily determine a bound beforehand. And I believe that even with unbounded channels, the memory consumption should decrease gradually as we receive items.

P.S.: To clarify, when I use a channel with a high bound, then the problem still appears.

How are you measuring RAM usage?

If this graph comes from the OS (e.g. Windows task manager) then it makes total sense and everything is working as expected.

When you allocate memory (e.g. with vec![...]), the Vec type asks the global allocator for a chunk of memory. If the global allocator doesn't have enough free memory then it'll ask the OS for more (see man sbrk or man mmap on Linux). This sort of thing is relatively expensive, so in order to reduce the number of trips to the OS it'll be conservative and ask for more than it immediately needs.

Later on, when your Vec goes out of scope and is dropped, it'll tell the global allocator to free the memory. Internally, the global allocator will update some bookkeeping to mark the memory as freed and available to be reused for other allocations. It won't release this memory back to the OS because then you'd have a situation where you're constantly asking the OS to add memory to your address space only to remove it again a couple milliseconds later. That's why you don't see memory consumption gradually decreasing over time when looking at the process from the outside.

If you instrumented the global allocator with something like the stats_alloc crate, you would see that the amount of memory being actively used does reduce over time as Vecs are removed from the queue and consumed.

With this behaviour in mind, the best way to reduce your memory usage is to make sure you don't have lots of large buffers alive at any one time. Using a bounded channel is one way to do this (and what I'd recommend) because it means you can only have at least n big buffers inside the channel before the sender will be told to go to sleep until there's more space. This name for this mechanism is "backpressure".

Knowing what number to use for your bounds is a bit of an art and depends on the runtime characteristics of your program (e.g. is data produced at a constant rate, or do values arrive in bursts), but the general idea is that having some buffering will let your code smooth over short bursts before telling upstream to slow down, while a large bound will increase memory usage and mask mismatches between consumption and production rates.

4 Likes

I measured RAM usage with htop, with /usr/bin/time -f %M, and with gnome-system-monitor. These tools all agree on the values I measured.

In that case the results are entirely unsurprising. The memory allocator often wont release memory back to the OS even if it isn't being used anymore.

4 Likes

Thanks for your extensive answer.

I agree with that.

I tried to verify what you said by writing a small program:

fn main() {
    let v: Vec<_> = std::iter::repeat(Vec::from([0; 1024*1024])).take(1024).collect();
    println!("collected");
    //std::thread::sleep(std::time::Duration::from_millis(2000));
    for i in v.into_iter() {
        std::thread::sleep(std::time::Duration::from_millis(3));
        //println!("i");
    }
}

This program just allocates a Vec with many Vecs inside, then processes each element of the Vec by waiting 5 milliseconds. The memory consumption graph is below:

vec

The graph shows that the memory consumption drops already while we are still into_iterating over the Vec (the slightly wiggly line going down). This seems to contradict your statement, at least in the Vec case:

It won't release this memory back to the OS because then you'd have a situation where you're constantly asking the OS to add memory to your address space only to remove it again a couple milliseconds later.

Perhaps we have a different interpretation of what it means to "release memory back to the OS". But that's OK. Let's just say that I would like to have a memory consumption as my Vec example program demonstrates. The point is that I see in practice very different memory consumption behaviour when into_iterating over a channel Receiver or over a Vec. And, what is even stranger, depending on how long it takes to process items, the Receiver either keeps memory for all elements received or nothing during iteration. That is disturbing me, and that is what I cannot understand.

I have very few elements that are very large, and lots of elements that are very small. For bounded channels to have a proper effect, I would need to set the bound relatively large. In which case I am again getting the problems I observed in this thread.

Independent of whether bounded channels are a good idea in my case, let's say that I am simply curious to see why unbounded channels behave the way they do.

I'm very surprised about your statement. Isn't that the definition of a space leak?
(At least if memory is systematically not released back to the OS.)

I’m having a hard time understanding the relevant implementation of the queue, in particular the usage of this field: spsc_queue.rs - source [1] seems weird. I can only see it mentioned in pop; and also it doesn’t seem to get modified..? (Ctrl+F for its name on the page to see all its usages.) Maybe there’s a bug in the implementations and nodes never get deallocated (before the whole channel is dropped)? (FYI, as far as I can tell, the caching limit that’s used for cache_bound on the call-site seems to be 128, so you’re way above that with your 1024 items.)


Rust’s current mpsc implementation has some other long-standing bugs anyway: Try if using crossbeam instead solves your problem / improves memory usage behavior for you.


  1. this link isn’t stable; for future readers, I’m linking to the cached_nodes field of Consumer, or see here ↩︎

The allocator will sometimes release memory back to the OS, but it's quite unpredictable. I have seen examples of benchmarks in the past were the measured memory remained constant even over a longer duration. I'm sure there are other benchmarks where it does get released back.

Any memory benchmark based on numbers from the OS is fundamentally unreliable. Create a wrapper around the system allocator that counts the memory usage instead. That would give us a number/graph that can reliably tell whether the channel is releasing the memory or not.

See this thread for a prior example of this issue. It also includes an example of implementing a custom allocator.

5 Likes

Oh god, you're right, @alice. It's really the memory allocator.

I just tried my original test program (first post) with jemalloc, using the flags that somebody linked to in the thread you referenced:

JEMALLOC_SYS_WITH_MALLOC_CONF="background_thread:true,narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0,abort_conf:true"

And what memory consumption do we get?

flume

Exactly what I hoped for. Thank you for clearing that up for me.

Unfortunately, I cannot use jemalloc for my whole program, because mimalloc is much faster.
Therefore, I will probably just use bounded channels as suggested by @Michael-F-Bryan and hope to find a proper channel size. :slight_smile:

Thanks again for all your comments!

2 Likes

Just for reference, I tried crossbeam_channel as well as flume, and they exhibit precisely the same behaviour as std::sync::mpsc.

Note that a channel bound of 1 is correct, just not necessarily performant. Assuming that you have a consumer thread running in parallel to the producers, you should set the bound to keep the memory consumption where you want it, and accept that you're trading off wall clock time (where the producer is idle waiting for the consumer to catch up) for memory usage.

Thank you for your comment.

I've actually noticed that when I consume my objects fast enough (which is the case when the consumer runs with several threads), then the total memory consumption remains quite low even when using an unbounded channel. For now, that floats my boat. If I ever notice that memory becomes an issue again, then I will play with channel bounds.

Just to really hammer this point home: if you're building a program for yourself or whatever, then I don't see the harm. But if you're standing up a production service where the inputs to your program are possibly linked to user behavior, then unless there was some other mitigation in place, I probably wouldn't let an unbounded channel pass code review. It doesn't provide any kind of back-pressure at all, which can result in all sorts of bad things happening exactly when you least expect it.

10 Likes

The allocator will reuse the memory it's holding onto for other allocations. So the maximum memory usage is pretty much the same whether it returns memory to the OS or not. It's very likely faster if it keeps the memory, since getting and releasing memory from the OS is relatively expensive.

1 Like

Good point. My application runs only locally on a few sets of known inputs, so that's fine. I would never expose something like that to the internet, for example.