Benchmark result for "Local" allocation

After all the effort of getting “Local” allocation to work, I wanted to know if it was actually faster, so I just tried a simple benchmark:

#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn allocate_stdbox(bencher: divan::Bencher, size: usize) {
    
    bencher.counter(size).bench_local(|| {
        let mut v = Vec::new();
        for _i in 0..200 {
            v.push( Box::new(99) );
        }
    })
}

#[divan::bench(args = [1, 5, 10, 100, 1000, 10000])]
fn allocate_lbox(bencher: divan::Bencher, size: usize) {
    use rustdb::alloc::{Local,lvec,lbox};

    Local::enable_bump();
    bencher.counter(size).bench_local(|| {
        let mut v = lvec();
        for _i in 0..200 {
            v.push( lbox(99) );
        }
    })
}

fn main() {
    // Run registered benchmarks.
    divan::main();
}

Results:

Timer precision: 50 ns
example             fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ allocate_lbox                  │               │               │               │         │
│  ├─ 1             3.089 µs      │ 9.034 µs      │ 3.119 µs      │ 3.31 µs       │ 100     │ 100
│  │                323.7 Kitem/s │ 110.6 Kitem/s │ 320.5 Kitem/s │ 302 Kitem/s   │         │
│  ├─ 5             3.086 µs      │ 6.233 µs      │ 3.396 µs      │ 3.383 µs      │ 100     │ 200
│  │                1.62 Mitem/s  │ 802.1 Kitem/s │ 1.471 Mitem/s │ 1.477 Mitem/s │         │
│  ├─ 10            3.31 µs       │ 9.734 µs      │ 3.426 µs      │ 3.533 µs      │ 100     │ 200
│  │                3.02 Mitem/s  │ 1.027 Mitem/s │ 2.918 Mitem/s │ 2.829 Mitem/s │         │
│  ├─ 100           3.388 µs      │ 7.259 µs      │ 3.509 µs      │ 3.573 µs      │ 100     │ 200
│  │                29.5 Mitem/s  │ 13.77 Mitem/s │ 28.49 Mitem/s │ 27.98 Mitem/s │         │
│  ├─ 1000          3.38 µs       │ 9.669 µs      │ 3.512 µs      │ 3.576 µs      │ 100     │ 200
│  │                295.7 Mitem/s │ 103.4 Mitem/s │ 284.6 Mitem/s │ 279.6 Mitem/s │         │
│  ╰─ 10000         3.393 µs      │ 5.566 µs      │ 3.628 µs      │ 3.608 µs      │ 100     │ 200
│                   2.946 Gitem/s │ 1.796 Gitem/s │ 2.756 Gitem/s │ 2.771 Gitem/s │         │
╰─ allocate_stdbox                │               │               │               │         │
   ├─ 1             3.961 µs      │ 32.57 µs      │ 3.984 µs      │ 5.503 µs      │ 100     │ 100
   │                252.4 Kitem/s │ 30.7 Kitem/s  │ 250.9 Kitem/s │ 181.7 Kitem/s │         │
   ├─ 5             4.088 µs      │ 11.04 µs      │ 4.301 µs      │ 5.936 µs      │ 100     │ 200
   │                1.223 Mitem/s │ 452.4 Kitem/s │ 1.162 Mitem/s │ 842.3 Kitem/s │         │
   ├─ 10            4.092 µs      │ 15.92 µs      │ 7.909 µs      │ 6.443 µs      │ 100     │ 100
   │                2.443 Mitem/s │ 628.1 Kitem/s │ 1.264 Mitem/s │ 1.551 Mitem/s │         │
   ├─ 100           4.216 µs      │ 23.12 µs      │ 7.323 µs      │ 6.46 µs       │ 100     │ 100
   │                23.71 Mitem/s │ 4.323 Mitem/s │ 13.65 Mitem/s │ 15.47 Mitem/s │         │
   ├─ 1000          4.214 µs      │ 9.987 µs      │ 7.262 µs      │ 6.241 µs      │ 100     │ 100
   │                237.2 Mitem/s │ 100.1 Mitem/s │ 137.6 Mitem/s │ 160.2 Mitem/s │         │
   ╰─ 10000         4.212 µs      │ 10.73 µs      │ 8.05 µs       │ 6.431 µs      │ 100     │ 100
                    2.374 Gitem/s │ 931.5 Mitem/s │ 1.242 Gitem/s │ 1.554 Gitem/s │         │

So on this allocation-intensive test, it was nearly twice as fast on average. I have to say I am quite dubious whether this is really worthwhile, but it has still been an interesting exercise.

1 Like

I updated my benchmark to measure Perm. New results below.

Perm needs a Mutex, so (as expected) it is slower, although not all that much slower than std.

I also added an “info” method to Perm, which allows the current state to be printed, like this:

use pstd::localalloc::Perm;
println!( "Perm::info = {:?}", Perm::info() );

It is quite interesting trying to guess what is allocating from Global in a fairly complex app which is using tokio and various other crates. I was just thinking, is there a way in Rust to print “What is calling me”?

Timer precision: 38 ns
example             fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ allocate_lbox                  │               │               │               │         │
│  ├─ 10            12.05 µs      │ 93.66 µs      │ 12.11 µs      │ 13.05 µs      │ 100     │ 100
│  │                829.4 Kitem/s │ 106.7 Kitem/s │ 825.5 Kitem/s │ 765.7 Kitem/s │         │
│  ╰─ 1000          12.7 µs       │ 30.6 µs       │ 13.69 µs      │ 14.58 µs      │ 100     │ 100
│                   78.7 Mitem/s  │ 32.67 Mitem/s │ 73.02 Mitem/s │ 68.55 Mitem/s │         │
├─ allocate_pbox                  │               │               │               │         │
│  ├─ 10            38.18 µs      │ 120.7 µs      │ 39.03 µs      │ 40.59 µs      │ 100     │ 100
│  │                261.8 Kitem/s │ 82.81 Kitem/s │ 256.1 Kitem/s │ 246.3 Kitem/s │         │
│  ╰─ 1000          37.81 µs      │ 66.54 µs      │ 40.97 µs      │ 41.7 µs       │ 100     │ 100
│                   26.44 Mitem/s │ 15.02 Mitem/s │ 24.4 Mitem/s  │ 23.97 Mitem/s │         │
├─ allocate_stdbox                │               │               │               │         │
│  ├─ 10            30.45 µs      │ 86.29 µs      │ 30.7 µs       │ 33.21 µs      │ 100     │ 100
│  │                328.3 Kitem/s │ 115.8 Kitem/s │ 325.6 Kitem/s │ 301 Kitem/s   │         │
│  ╰─ 1000          29.92 µs      │ 63.54 µs      │ 30.02 µs      │ 31.97 µs      │ 100     │ 100
│                   33.41 Mitem/s │ 15.73 Mitem/s │ 33.31 Mitem/s │ 31.27 Mitem/s │         │
╰─ allocate_tbox                  │               │               │               │         │
   ├─ 10            13.81 µs      │ 58.89 µs      │ 13.92 µs      │ 14.79 µs      │ 100     │ 100
   │                724 Kitem/s   │ 169.7 Kitem/s │ 718.2 Kitem/s │ 676 Kitem/s   │         │
   ╰─ 1000          13.8 µs       │ 24.09 µs      │ 13.88 µs      │ 14.32 µs      │ 100     │ 100
                    72.46 Mitem/s │ 41.49 Mitem/s │ 72.03 Mitem/s │ 69.82 Mitem/s │         │

It sounds like you want #[track_caller], but that doesn't have any stable way to capture the information outside the reported panic location.

Instead you probably want

1 Like

It can be tricky capturing info in a global allocator, because you cannot allocate or you get in a recursive loop. I did try capturing the immediate caller in a System allocated BTreeSet, but that wasn’t useful ( it is always the same place). I gave up on that and decided I don’t really need to know, it was just curiosity really.

I found an interesting reddit post here with allegations of various global allocators “leaking” memory (maybe, the truth seems a little murky).

At least with mine I can see exactly what is going on.

I think that's basically why you end up with external tools like Valgrind, yeah. At least, I assume, I'm fortunate enough to have not needed it myself..

I just had a look a the mimalloc repository, and there seem to be various quite recent fixes.

They say it is “small” in terms of lines of code, and maybe for a highly optimised allocator this is true, but it is clearly quite complex.

At one point it says:

eager page purging: when a "page" becomes empty (with increased chance due to free list sharding) the memory is marked to the OS as unused”

Relying on luck and chance to have memory be released back to the operating system seems to me to indicate a fundamental problem with the approach. They also say:

”there will be thousands of separate free lists”

This isn’t light-weight at all. Just my thoughts. With my approach there are just 13 free-lists per thread, and 13 global free lists ( one for each size class 16..64K ).

Then send it into a pipe.

I have done some experimenting with backtrace. The problem is it seems to allocate (in some cases), and to handle that I need a re-entrant Mutex.

I can see there is one in the standard library, but it isn't stable yet. I could implement one myself I suppose but it seems quite a tricky business. Maybe there is a crate?

parking_lot has this: ReentrantMutex in parking_lot - Rust

1 Like

Unfortunately:

"ReentrantMutexGuard does not give mutable references to the locked data."

so I would need a RefCell as well. Still, I may do it, or maybe have a distinct allocator that has backtrace via a feature. But maybe a feature only available on nightly would be ok.

I don't really want to add a dependency if I can help it.

Maybe a distinct crate entirely with a tracing global allocator would be better.
( I did find a couple of crates, but they are both a bit complicated to use )

[ Another thing I am thinking about is a general purpose allocator with "buffering", that is it buffers (say) 10 allocations (in a give size class) at a time, to reduce the number of Mutex locks, the same for deallocation. This is quick, but you can still have fragmentation, which Local, Temp and now I also have "GTemp" avoid. And actually I don't think speed is the important thing, avoiding fragmentation is what is important. ]

1 Like