How to stop an app behaving like Row hammer?

Hi folks, it is not quite Rust question, but hopefully someone knows the answer.

I accidentally wrote an app that seems to behave like Row hammer - Wikipedia

The issue is that an app must behave as memory-hard function, so it is kind of by design. What I'm wondering is whether there are some software tricks that can decrease probability of this happening.

What I see specifically is that under heavy memory-bound load I can cause some in-memory data to have unexpected values causing unexplainable panics. I'm fairly certain at this point that the bug is not the algorithm, after restart with the same inputs I do not get the crash anymore.

My system is 13900K processor with Crucial Balistix DDR4 RAM running at XMP profile (3600MHz CL16) that I also used with 5900X processor before for a few years and never observed any memory-related issues until I started happering it with this app. OS/CPU do not detect any memory-related errors either.

Just to make sure it is not memory I ran memtester for 2 hours and then 3 full passes of memtest86+ (12.5 hours) and neither found any issues.

While I can capture panics around this code and handle them gracefully in the app, it is not guaranteed which memory will be corrupted. I had my compute freeze once because of this thing :disappointed:

Any ideas are welcome.

How do you know it's row hammer and not just unsound code or something like that?

2 Likes

I don't know for sure, but there are a few reasons that lead me to this conclusion:

  1. The only unsafe in the code is transmuting fixed size arrays from/to #[repr(transparent)] new types over u32 in two places, which I believe is the correct usage of transmute
  2. The only external dependencies involved are chacha20, rayon and sha2 that are currently assumed to be designed properly and expose correct safe API
  3. Mostly happens under heavy load of the app (two instances of the app hammering memory concurrently typically result in panics within 2 hours)
  4. I captured a few example inputs that were panicked and failed to reproduce an issue with them, so it doesn't depend on specific value, more like a combination of things related to the environment

Row hammer requires an astronomical amount of cache misses. This implies that, if you are in fact hammering DRAM rows, you have a performance incentive to change the algorithm to avoid cache misses.

And any time you are unsure of unsafe usage, you should have it reviewed.

4 Likes

The algorithm is by design memory-bound, there is unusual amount of cache misses that are not avoidable, so yes, it does indeed hit memory and hit hard:

perf stat (default)
     5.005038579          10 386,52 msec task-clock:u                     #   10,387 CPUs utilized             
     5.005038579                  0      context-switches:u               #    0,000 /sec                      
     5.005038579                  0      cpu-migrations:u                 #    0,000 /sec                      
     5.005038579            283 406      page-faults:u                    #   27,286 K/sec                     
     5.005038579     51 311 639 029      cpu_core/cycles/u                #    4,940 G/sec                       (47,83%)
     5.005038579     40 694 129 651      cpu_atom/cycles/u                #    3,918 G/sec                       (53,94%)
     5.005038579     93 335 293 643      cpu_core/instructions/u          #    8,986 G/sec                       (47,83%)
     5.005038579     76 695 230 792      cpu_atom/instructions/u          #    7,384 G/sec                       (53,94%)
     5.005038579     17 302 006 773      cpu_core/branches/u              #    1,666 G/sec                       (47,83%)
     5.005038579     14 220 902 310      cpu_atom/branches/u              #    1,369 G/sec                       (53,94%)
     5.005038579        174 393 857      cpu_core/branch-misses/u         #   16,790 M/sec                       (47,83%)
     5.005038579        137 459 123      cpu_atom/branch-misses/u         #   13,234 M/sec                       (53,94%)
     5.005038579    205 385 386 501      cpu_core/slots:u/                #   19,774 G/sec                       (47,83%)
     5.005038579     82 838 531 207      cpu_core/topdown-retiring/u      #     40,1% Retiring                   (47,83%)
     5.005038579     23 974 794 304      cpu_core/topdown-bad-spec/u      #     11,6% Bad Speculation            (47,83%)
     5.005038579     21 051 463 309      cpu_core/topdown-fe-bound/u      #     10,2% Frontend Bound             (47,83%)
     5.005038579     78 714 817 375      cpu_core/topdown-be-bound/u      #     38,1% Backend Bound              (47,83%)
     5.005038579      2 321 632 747      cpu_core/topdown-heavy-ops/u     #      1,1% Heavy Operations          #     39,0% Light Operations           (47,83%)
     5.005038579     22 983 402 116      cpu_core/topdown-br-mispredict/u #     11,1% Branch Mispredict         #      0,5% Machine Clears             (47,83%)
     5.005038579     14 320 266 210      cpu_core/topdown-fetch-lat/u     #      6,9% Fetch Latency             #      3,3% Fetch Bandwidth            (47,83%)
     5.005038579     48 344 834 022      cpu_core/topdown-mem-bound/u     #     23,4% Memory Bound              #     14,7% Core Bound                 (47,83%)
perf stat (cache misses)
     2.001951245         53 503 592      cpu_core/cache-misses:u/                                                (49,16%)
     2.001951245         66 459 026      cpu_atom/cache-misses:u/                                                (51,58%)

With two instances cache misses increase by a few more %.

I don't think 50% cache miss is astronomical, but it is pretty terrible ideed.

Auditors are looking at unsafe and no issues were flagged so far. I'm avoiding unsafe unless I am certain it is correct and necessary.

Depends on the details. See if you can get it to work without unsafe of your own using the bytemuck crate.

3 Likes

There is really nothing fancy, usage is equivalent to this:

#[repr(transparent)]
struct X(u32);

fn foo(xs: [X; 8]) -> [u32; 8] {
    unsafe { mem::transmute::<_, [u32; 8]>(xs) }
}

And there are identical transmutes the other way around.

1 Like

Yeah that's fine. (Though I'd personally still use bytemuck.)

1 Like

Any chance you could run on a system with ECC RAM. Cloud providers for example usually use systems with ECC Memory which would help to detect if bit flips are occurring.

2 Likes

Disable any and all overclocking in your motherboard settings and try again. (This includes settings automatically applied by your MB, like XMP etc...)

3 Likes

I do actually have Epyc 7002 system I can experiment with, it runs ECC memory, but I also read that not all of such RAM attacks can be detected and recovered by ECC.

I thought about it, but nothing indicates there is any system instability except when I run this app. Will probably try that next though.

I also had one particular app once, that caused instability. After disabling overclocking everything worked. Next I replaced RAM + my shitty power supply, re-enabled overclocking, and it also fixed the problem.

I agree with the sentiment. Percentages are not a great indicator, though. Do you know what the memory bandwidth looks like? Or more importantly, the frequency of cache misses?

Are you sure you are not able to adjust the algorithm or memory layout? It sounds like the application is not memory capacity bound, though it might be memory bandwidth bound. If you are fine wasting some memory, just pad it out enough so that adjecent DRAM rows are never used.

Worst case, you can always sleep the thread for a few milliseconds just to allow time for the DRAM lines to refresh correctly.

I don't believe you can "accidentally" make an app which acts as a row hammer. Rowhammer requires a very specific access pattern which repeatedly read-writes memory in specific memory rows. Moreover, normal ways to hammer a row will be thwarted by the processor cache, which would handle the entire hammering of ~64K memory row, without touching the main memory. The original paper used a clflush processor instruction specifically to flush the cache and remove that layer of defense.

The simple answer to your question is thus "the thing you say is impossible, and if you somehow managed to do it, just change your access pattern". Without any more specific details about your app (ideally, the source) I'll stand by this answer. It's entirely possible that the issue is far more mundane, like faulty hardware or some overheating issue.

3 Likes

Yes, it is not capacity bound, but more like latency. The algorithm is Chia's proof of space (although used for slightly different purpose). It is specifically designed to be like that and requires a lot of random access to create a series of tables with pointers between them.

The challenge is that the place where this algorithm is used is time-sensitive, so I spent quite a bit of time optimizing it. It already trades some amount of memory for performance, but since corruption seems to happen within the data set it works with, I'm not sure which amount of padding would be helpful and any increase in compute cost is extremely undesirable.

I have seen enough strange things as an engineer over the years to not go with "impossible" quickly :slightly_smiling_face: Also access pattern is not possible to change, it is supposed to be like that by design. Overheating is unlikely with 3x560mm radiators and large custom water cooling loop, temperatures are under control according to visible sensors.

That sounds unlikely. You have a memory-hard function. You are supposed to be utilizing as much of the total memory on each cycle as possible, not hammer away at specific locations (and note that rowhammer requires read-write cycles counted in the millions). Sure, in principle everything is possible. In practice, you are extremely unlikely to hit this bug, particularly given your problem description.

Chia uses k32 for proof of space, my use case involves k20, which means data set is much smaller and there are more iterations that are working with a smaller data set. I think depending on how allocator selects memory to work with, it will end up hammering that more narrow region of memory. It is a few hundred megabytes in total that application works with for this purpose.

I still consider it impossible. Chia PoS is specifically designed to access memory in a maximally random way. It's just not realistic that it ends up hammering on adjacent physical memory rows. Add virtual memory into the mix, and you don't even have the guarantee that adjacent addresses correspond in any way to adjacent physical memory.

Whatever the cause of your troubles, rowhammer can't be one of them. Doesn't mean that your memory doesn't glitch under high load, but in the latter case it should be detectable with other applications.

I should add to my previous post: I had faulty (unstable) hardware, that did not report anything wrong in memtest86 with multiple hours of runs.

Even without knowing how chia works, I wouldn't call this row hammer, since that implies predictable behavior over many hardware configurations. For now, this is undetermined memory corruption.

You said you used your ram was fine when using a 5900X, but that's a whole different motherboard and CPU. You need the whole CPU-mobo-RAM chain to be stable to have stable RAM.

The first thing you should do is turn off XMP and try it again. That's the easiest thing to do, and will give you more information. If it stops, then it's likely your RAM isn't stable, and if it persists, then it's more likely to be a software bug. There's plenty of bugs that only happen in multithreaded or performance-limited scenarios. You can also try it on different OSes and machines.

Some more things you can do:

  • Remove one stick of RAM
  • Reinsert the RAM and CPU
  • Find the minimum amount of code that causes the issue
  • Try running a row hammer proof-of-concept and see if it flips bits faster than normal