So on this allocation-intensive test, it was nearly twice as fast on average. I have to say I am quite dubious whether this is really worthwhile, but it has still been an interesting exercise.
I updated my benchmark to measure Perm. New results below.
Perm needs a Mutex, so (as expected) it is slower, although not all that much slower than std.
I also added an “info” method to Perm, which allows the current state to be printed, like this:
use pstd::localalloc::Perm;
println!( "Perm::info = {:?}", Perm::info() );
It is quite interesting trying to guess what is allocating from Global in a fairly complex app which is using tokio and various other crates. I was just thinking, is there a way in Rust to print “What is calling me”?
It can be tricky capturing info in a global allocator, because you cannot allocate or you get in a recursive loop. I did try capturing the immediate caller in a System allocated BTreeSet, but that wasn’t useful ( it is always the same place). I gave up on that and decided I don’t really need to know, it was just curiosity really.
I found an interesting reddit post here with allegations of various global allocators “leaking” memory (maybe, the truth seems a little murky).
At least with mine I can see exactly what is going on.
I think that's basically why you end up with external tools like Valgrind, yeah. At least, I assume, I'm fortunate enough to have not needed it myself..
I just had a look a the mimalloc repository, and there seem to be various quite recent fixes.
They say it is “small” in terms of lines of code, and maybe for a highly optimised allocator this is true, but it is clearly quite complex.
At one point it says:
”eager page purging: when a "page" becomes empty (with increased chance due to free list sharding) the memory is marked to the OS as unused”
Relying on luck and chance to have memory be released back to the operating system seems to me to indicate a fundamental problem with the approach. They also say:
”there will be thousands of separate free lists”
This isn’t light-weight at all. Just my thoughts. With my approach there are just 13 free-lists per thread, and 13 global free lists ( one for each size class 16..64K ).
I have done some experimenting with backtrace. The problem is it seems to allocate (in some cases), and to handle that I need a re-entrant Mutex.
I can see there is one in the standard library, but it isn't stable yet. I could implement one myself I suppose but it seems quite a tricky business. Maybe there is a crate?
"ReentrantMutexGuard does not give mutable references to the locked data."
so I would need a RefCell as well. Still, I may do it, or maybe have a distinct allocator that has backtrace via a feature. But maybe a feature only available on nightly would be ok.
I don't really want to add a dependency if I can help it.
Maybe a distinct crate entirely with a tracing global allocator would be better.
( I did find a couple of crates, but they are both a bit complicated to use )
[ Another thing I am thinking about is a general purpose allocator with "buffering", that is it buffers (say) 10 allocations (in a give size class) at a time, to reduce the number of Mutex locks, the same for deallocation. This is quick, but you can still have fragmentation, which Local, Temp and now I also have "GTemp" avoid. And actually I don't think speed is the important thing, avoiding fragmentation is what is important. ]