Tutorial on Rust, Memory performance & latency - locality, access, allocate, cache lines

nazmulidris · June 14, 2025, 2:52pm

Hi everyone

I would like to share this tutorial and video on memory latency, locality and its impact on performance, and how it aligns with the end of Moore's Law. It is a distillation of the work that I've been doing for a few months rewriting the r3bl_tui crate to be more performant. The goal was to speed up Markdown parsing & rendering for large and complex documents in the editor component.

Hopefully this is useful to others who are doing related things.

StefanSalewski · June 14, 2025, 5:32pm

That looks quite interesting. I have already collected a few links in the Appendix, I think I will add this one.

Please check the following sentence: I am not a native speaker, so I had trouble reading it:

"While there’s no official end date, the it effectively ended around 2015–2020."

For Moore’s Law, I still have some hope -- photonic computing seems to make some progress, and I recently saw a video about these graphene based chips, see https://www.youtube.com/watch?v=yJSrX1uOjxs

For your mention of smallvec, do you agree with Reddit - The heart of the internet? I think they say that smallvec is mostly useful when we have a vec containing vecs as elements.

And perhaps you could explain a bit the difference of data cache and instruction cache? Are that typically completely distinct entities? And list typical sizes?

[EDIT]

For your other post, Build with Naz : Rust lifetimes | developerlife.com, is COW in Rust really "Clone on write" instead of traditional "Copy on write"?

nazmulidris · June 14, 2025, 6:30pm

Thanks for pointing that out, it is now fixed.

It is "clone-on-write" in Rust. In Rust, Copy-on-Write and Clone-on-Write are essentially the same concept, and the Cow enum in the Rust standard library implements a Copy-on-Write strategy. The Cow enum in Rust is a smart pointer that provides a way to lazily create a copy of the data when it's modified. It's designed to avoid unnecessary cloning and provide efficient handling of shared data.

I haven't read the reddit post, but I can share how I use it. In my TUI engine, I have a hot render loop, in which collections of closely related structs are created and dropped. I have some heuristics around the range of sizes these collections can take up. In my case, it makes sense to perform a stack allocation for these many small structs, which are then dropped very quickly after they are created and used. This prevents me form having to do 2 pointer references (double indirection) to get to structs that I will need shortly after creating them.

You can find more information about CPU caches, including Data Cache and Instruction Cache, on Wikipedia:

These articles provide a detailed explanation of CPU cache architecture, including the differences between Data Cache and Instruction Cache, as well as the various levels of cache hierarchy.

Additionally, you can also check out the following articles for more information on specific topics:

Wikipedia is a great resource for learning about computer architecture and CPU design, and these articles should provide a good starting point for your research.

Very cool. I like the current trend of unified memory architecture and focusing on memory bandwidth and latency. And CXL is interesting as well. And Intel is doing some really interesting things with optical compute interconnect.

Vorpal · June 15, 2025, 7:44am

Well written article, mostly (I have not looked at the video).

However, it isn't clear what the source of your numerical claims are. Are they things you found elsewhere (in which case you should cite sources) or things you measured yourself (in which case you should describe your methodology and also include standard deviations and statistical significance figures)?

The numbers for how long allocation and deallocation takes might be right, or not. But right now it is just a claim by a random blog on the internet.

On a 14th gen Intel CPU (which is a 64-bit x86_64 architecture)

So, the specific CPU has nothing to do with alignment, only the target triplet matters. There are cases where alignment differs based on OS on the same architecture. See repr(C) AIX Struct Alignment - compiler - Rust Internals for an example of that. For repr(Rust) it is going to be determined by the CPU architecture alone as far as I know.

jemalloc is a replacement for the default global allocator.

Bad timing, as of a few weeks ago, jemalloc is unmaintained: jemalloc Postmortem

Consider mimalloc, tcmalloc or snmalloc instead. But I wouldn't be surprised if someone forks jemalloc soon.

Redglyph · June 15, 2025, 8:50am

Has Moore's first law ended? I've heard similar claims regularly over the last 30 years, even by Moore himself: it was about to end because of the difficulty to dissipate heat; it was about to end because the lithography process was at its limit; it was about to end because we reached particle dimensions, etc.

In practice, it looks like it's still on, though, so I don't think it "effectively ended".

But it's true that, for a while now, the gain in performance has more often been due to architectural improvements than simply shrinking the node process^[1], so even if Moore's law were to indeed end one day, it wouldn't automatically entail that CPUs shouldn't keep improving their execution time.

PS: A minor detail: I'd add to the cost of allocating on the heap the necessity to perform a system call, which requires the CPU to switch to kernel mode. But I'm not entirely sure of this, since a few syscalls benefit from vDSO; I don't think memory allocation is among them, though.

Which isn't directly what Moore's law is about, by the way: his observation was about the number of transistors per chip doubling every 2 years. The performance increase, doubling every 18 months, was a quote from one of his colleagues, who thought that to be a consequence. Also, our processes have changed a lot since he gave that observation 50 years ago, very early in the history of chip design (SoC, MCM, ...). ↩︎

Vorpal · June 15, 2025, 1:55pm

Memory allocation from the Linux kernel happens via mmap (or sometimes sbrk, but not all allocators use that at all). For larger allocations only mmap is typically used. Neither benefit from vDSO. However, the user space malloc implementation allocates larger chunks of memory, and then do sub-allocations from that, so the syscall overhead is amortised over many allocations. Memory will also not be returned to the kernel (unless there is a lot of free memory), but will be reused for future allocations.

Thr exact scheme for all of this will vary between glibc, musl, jemalloc, etc. And really large allocations (multiple pages in size) may get their own mmap allocations directly from the kernel.

One scheme I know of is to have pools for allocations binned by powers of two: 4 bytes, 8 bytes, 16 bytes, 32 bytes, etc. Then any allocation is rounded up to the next available size.

Any general purpose allocator will always be balancing tradeoffs, and you can almost always beat it with a special purpose allocator for your use case. E.g. If you have lots of 132 byte allocations, creating your own pool instead of using 256 bytes for each might be really good in the previously mentioned scheme.

Some allocators creates pools per thread and optimise for the case of freeing on a different thread than what allocated being rare, etc. So YMMV.

piwakawaka · June 18, 2025, 4:36am

Thanks for the tutorial.

For anyone else following along, you might run into the same problems I did:

I had to run cargo add serial_test to fulfill a dependency,
I had to add #![feature(vec_into_raw_parts)] to the top of my main.rs,
I updated my rust toolchain to nightly-x86_64-unknown-linux-gnu rustc 1.89.0-nightly (f3db63916 2025-06-17),
I ran the code with cargo +nightly test -- --nocapture.

Only then did it compile and let me see the output from, say, the Memory Layout Example tests.

Also, contrary to the article, I get the size of the Demo struct to be 8 bytes, on my x86_64 CPU:

Size of Demo: 8
Alignment of Demo: 4

I think this is what the code expected (due to the assertions), but the text of the article says that it should be 12 bytes:

Rust will reorder and pad as needed, but in this case, the minimum size to fit all fields with correct alignment is 8 bytes:
...

Total size: 12 bytes (not 7), due to alignment and padding

I think there's some kind of inconsistency in the writing here. It would be 8 bytes with Rust field ordering, and 12 bytes with C field ordering. I think the article meant to describe how Rust would reorder the fields to fit into 8 bytes, but mistakenly shows the method for 12 bytes instead.

Topic		Replies	Views
Type alignment (understanding memory layout) help	105	1180	June 12, 2025
Behind the scenes, how does Rust move structs?	36	3521	December 2, 2023
Reference Speed help	15	872	January 11, 2024
Why isn't Rust even faster than C++? Comparing heap vs stack allocation	26	6092	August 11, 2020
How Rust free memory when functurn returned help	20	1106	December 24, 2022

Tutorial on Rust, Memory performance & latency - locality, access, allocate, cache lines

Related topics