Why multiple threads using too much memory when holding Mutex

Why does the below code uses ~150MB in single thread but uses several GBs in 100 threads?

My intuition is that while holding the LockGuard , a thread should have exclusive access. So, new Foo should be allocated and old value should be dropped at that point. So, it doesn't make any sense to me this much memory is being used when called from multiple threads.

use std::sync::{Arc, Mutex};
use std::thread;

fn main() {
    let f = Arc::new(Mutex::new(Foo::new("hello")));

    let mut threads = vec![];
    for i in 0..100 {
        let f = f.clone();
        let t = thread::spawn(move || loop {
            let mut locked = f.lock().unwrap();
            *locked = Foo::new("hello");
            drop(locked);
            println!("{} reloaded", i);
            thread::yield_now();
        });
        threads.push(t);
    }

    threads.into_iter().for_each(|h| h.join().unwrap());
}

pub struct Foo {
    _data: Vec<String>,
}

impl Foo {
    fn new(s: &str) -> Foo {
        Foo {
            _data: vec![s.to_owned(); 1024 * 1024],
        }
    }
}

Depending on your OS, you will have different system allocators. They use different techniques to quickly allocate memory, and some of them will, to some extend, split up memory into several pools, and you're likely having the allocations end up in different pools by using several threads.

1 Like

Is there a way to force the system to allocate from same pool? I also tried jemallocator and memory usage is a bit better than System allocator. But not as good as my java example in SO question.

how are you measuring memory use? probably you're measuring virtual memory (address space) usage instead of resident memory

FWIW this came up recently in another thread:

1 Like

I recommend reusing the memory if you run into this kind of memory.

I dont know how to do that. The new data comes from serde_json deserialization.

How to check resident memory?

In that case it would probably be easiest to read the data to serialize into one large buffer, and deserializing into types with slices instead of owned strings, so you don't have to make another allocation for every value.

You can reuse the large buffer by clearing it and reading into it again.

1 Like

The easier way I think is by using jemalloc allocator and jemalloc_ctl crate to get its internal stats. See how this is done in rust-analyzer:

1 Like

Whops, let me write a response to the actual question :slight_smile:

The easiest way to measure would be the /usr/bin/time -v utility.

λ /run/current-system/sw/bin/time -v ls
...
        Command being timed: "ls"
...
        Maximum resident set size (kbytes): 2804
        Average resident set size (kbytes): 0
...

rss is roughly how much physical pages of memory your app occupies, from the OS pov. However, not all of those pages are actually filled with data: you app allocator manages those pages, and some of them might be free (or only partially full). The above links show how to ask allocator for both the resident memory (allocated from POV of OS) and allocated memory (from POV of your application).

1 Like

I find the easiest way to monitor memory usage is to run top. The RES column shows resident memory and VIRT shows virtual memory, at least in the default configuration on my system.

I am running your Rust code on Linux now and resident memory is hovering around 1.7 GB. If I run it with MALLOC_ARENA_MAX=2 as fghj suggested on SO, that drops to a few hundred MB.