Trying to solve memory latency

A call flow for my software takes an average of 200 microseconds at the beginning of the run. And after running for an hour, or after running a large game and finishing, it took an average of 400 microseconds.

I traced the entire flow and found that the largest feature of this latency appeared in the .get() method of a hashmap<String, vec>. Early in the run, .get() would always take less than 5 microseconds (even when CPU resources were very tight). And after a delay, it has a 1/3 chance of taking 30-100 microseconds. Of course, the map itself doesn't change during this time, and it has a length of around 20,000.

I've spent weeks trying to solve this problem. I changed the type of the map to hashmap<u64, [u16; 8]>, disabled mimalloc, the person modified the mimalloc parameter, rewrote the code extensively...
None of them solved the latency. This should be memory related time consuming, lots of memory fragmentation, or being cached to the hard disk.
My final solution, if .get() takes more than 30 microseconds three times, reset all global variables and the runtime efficiency will immediately return to what it was at the beginning of the run. e.g.:

let mut m = list_active.write().unwrap();
let c = m.clone();
*m = c;

I can't say for sure if this is the best solution. But so far there have been no negative effects.

Interesting, so effectively writing m = m.clone() makes a difference? Is list_active the HashMap you referred to? If yes, then does the part you said the map doesn't change means literally readonly access?

I think I never noticed such behavior before. Can you provide a reproducing example?

This problem is hard to reproduce, and when you don't focus on the tiny time-consuming costs, you won't even notice it. To reproduce it, I need to use this tool type of software for a long time or play a big game. I can't even be sure if this is a software, or hardware problem

The full definition of "list_active" is as follows, it's not read-only, it will have a small probability of appending new content (very small).

static list_active: Lazy<RwLock<HashMap<u64, [i16; 8]>>> = Lazy::new(|| RwLock::new(HashMap::with_capacity(30000)));

I probably reset 100 of these variables.

I suspect it might be something related to this comment in the documentation of HashMap

Performance
In the current implementation, iterating over map takes O(capacity) time instead of O(len) because it internally visits empty buckets too.

Just a theory, I don't have the context of the full code. Does that get happen to be inside of an iter() call?

3 Likes

What operating system are you using? Could this be related to memory compression? I suspect some os interference with seldomly used sparse pages. Did you try reading the hashmap items into hint::black_box?

I think you're on to something with the HashMap algorithm.

The std HashMap uses open addressing, meaning that items with the same hash value are placed in the next empty slot, and a get operation has to walk over multiple slots to find the key, comparing the key in each slot as it goes. When the map's size approaches its capacity, this happens more frequently and might result in a measurable performance difference.

This can be tested by increasing the initial capacity of the HashMap to see if this impacts the problem. Specify an amount that is double the expected maximum size of the map, just as a test.


Update: I am probably wrong about this. Since cloning the map solves the problem, I was assuming that clone may increase the capacity if the map is almost full, or rehash. But when I look at the hashbrown::HashMap sources, upon which the std HashMap is based, it looks like it just duplicates the buckets. (I can't find any doc about what clone does.)

I'm not clear on one thing. Is the map immutable only after the delay, or during the entire run (including when get only takes 5us)?

And is there no change during the entire run in the access pattern for map entries?

30-100mcs for a simple HashMap::get call sounds unbelievably high. Are you sure the problem is indeed with HashMap::get? How did you determine it? If that's the output of some standard profiler, could it be that the issue is really in the calling functions and you're just observing some botched debug symbols?

Having to deal with empty tombstones is certainly something that comes to mind, but if you're using hashmap in an append-only way it shouldn't be much of an issue. And still, 100mcs is ridiculously high. You could scan your entire 20000-element hashmap in that time, and why would that happen? (Unless you're using some erroneous full-scan quadratic algorithm, but it looks like an unlikely error when you already have a hashmap)

Does the issue persist if you change HashMap into a BTreeMap? They are API-compatible, so it should be easy to test. Are you using the default hash function? How large are the items, exactly? In you other post you suggest they are within 32 bytes, is this true for the actual code? If not, have you tried boxing the items before insertion to reduce their size and pressure on the memory?

EDIT: Actually, even your starting 5mcs per HashMap::get looks crazy slow. Maaaaaybe it's believable if your entire hashmap is always out of cache and is read from memory, but even then it's straining it. Could it be that issue is that at some lower level? E.g. maybe your OS pages out the memory holding your hashmap, so every access incurs a page fault, possibly a disk read? That's more in the ballpark of your numbers. 5mcs for a page fault is more believable than for a simple memory access, and 100mcs for swapped out memory is even more believable, particularly if you use an HDD.

Re-cloning the entire hashmap would touch all of its pages, and could cause the speedup that you observe. You said slowdown happens after an hour, and your app possibly runs as a background process, which would make it prone to swapping out.

What are the specs of your hardware? Could you be running on something non-standard in 2024?

EDIT2: "after running for an hour, or after running a large game" --- yeah, that looks exactly like a swapping issue to me. A large game is extremely likely to kick passive processes from memory into swap. There is little you can do to deal with it. In general, the idea is that if your app runs often enough and touches its memory often enough that its performance matters, then it will not be swapped out, or at least that will happen rarely. If you run only once an hour, then probably a 100mcs delay for accessing paged out memory shouldn't matter either.

Of course, maybe your app is for whatever reason latency-critical, and must execute asap. In general, it's impossible to enforce, unless you control the OS. If you do, you can look up you OS documentation on real-time priority and keeping pages in memory.

From an app, you can try to mlock pages to ensure they are never swapped (on Linux; Windows and Mac would have a different but similar API). But the OS may ignore your request due to lack of privilege or some other config issue. Locking a few pages should generally work, but if you try to keep mega/gigabytes resident, you may very well be denied.

5 Likes

that would be my guess too. this problem reminds me of a fanscinating story I recently read: the chrome installer would use 100% CPU time in Windows kernel's page fualt handler, because the installer set its thread priority class as background, thus the kernel took the liberty to reduce its working set size.

I cannot find the original post I read, but I think this issue contains most of the information, just in case anyone is interested to read:

https://issues.chromium.org/issues/40927803

1 Like

In fact, all "static" variables (long-lived variables) have this problem. I tried changing the type of this variable to [[i16; 8]; 30000], reading it like list[12232] , and the delay still occurs.

There are multiple random occurrences of static variable delays in my code, such as vec.clone(), hashmap["key"]...

The map delay is not inside the iter() method. map's entire lifecycle is mutable.
I tried a lot of hashmap libraries and finally I chose gxhash which has better performance. btreeMap can't solve the latency.
Even if there is a delay, .get() mostly takes 0μs, it just has a 1/3 chance of using 30-100μs.
My PC: AMD 5600G; RAM 16GB; WIN-10 64

That suggests to me that it is not the map that is slowing down, if it is not changing how can it, and that it is everything going on in your program over time that is slowing things down. Likely by consuming more memory and pushing the map out of cache.

1 Like

Like other people already said it seems a problem with the operating system swapping out memory pages. If you're running on Linux it is pretty easy to check if this is the problem. When your program starts to slow down use ps to obtain its PID and then check /proc/PID/status: look for the VmSwap value and compare it with the total memory used by the process: it will tell you a rough estimate of how much your process has been swapped out.