This is not a Rust-specific problem, instead it is a low-level hardware-OS related problem. However, this problem occurs when I am writing a Rust crate, so I put it here for help.
Background
I am working on my Rust crate static-keys, which re-implements Linux kernel's static key in userspace Rust application. In short, static keys let users dynamically modify nop
/jmp
instructions to replace the conditional jumps, decreasing the overhead when checking some rarely-changed features. I do believe this is a useful crate, which is suitable for many situations such as log
crate. Moreover, this feature has never been implemented in userspace in any other languages.
However, if the application is multi-threaded, we may need to dynamically modify an instruction which may be executed by another thread, and this is called "cross-modifying code" (XMC) by Intel. This feature is requested in this post (and some initial discussions happen there), but this is a very complicated low-level problem, and I am not sure what is the correct way to implement this in userspace, and moreover, it is also very hard to write a unit test to make sure I am doing the right thing.
Any theoretical suggestions or practical code examples are OK. I want to express my thankfulness in advance for anyone could help. Or if putting this question here is not suitable,
I would appreciate it if someone could tell me which forum should I post this.
The problem itself
As described above, the problem itself is how to dynamically modified an instruction which may be executed by another thread. From Intel manual Volume 3 section 9.1.3 Handling Self- and Cross-Modifying Code:
To write cross-modifying code and ensure that it is compliant with current and future versions of the IA-32 architecture, the following processor synchronization algorithm must be implemented:
(* Action of Modifying Processor *) Memory_Flag := 0; (* Set Memory_Flag to value other than 1 *) Store modified code (as data) into code segment; Memory_Flag := 1; (* Action of Executing Processor *) WHILE (Memory_Flag ≠ 1) Wait for code to update; ELIHW; Execute serializing instruction; (* For example, CPUID instruction *) Begin executing modified code;
It is worth noting that the above approach is the only right way to cross-modify code in Intel arch. Any other approaches such as dealing with cache lines and alignment is not standard and should not be adopted.
It is obviously unacceptable for the executing processor to always check the memory flag for synchronization since we are going to decrease the overhead of conditional jump by static keys. And in Linux kernel, how they handle this problem is by text_poke_bp
(it has well-written comments):
- Replace first byte with
0xCC
(theint 3
instruction) - Modify the remaining bytes
- Replace the first byte with the replacing first byte
This approach do follow the rule given by Intel: when the executing thread encounters the int 3
, it will stop and enter the signal handler, thus can be treated as "waiting for code to update".
However, when implementing this in userspace, you now have to fight with multi-thread signal handling:
- Which thread will handle the
SIGTRAP
signal raised byint 3
? - What should the signal handler do?
For the first question, I really get lost in many conflict descriptions (see this SO for detail), and I really don't know the right answer.
For the second question, Linux kernel's way is to emulate an instruction (poke_int3_handler
). While I think a simpler approach is to just wait in the signal handler until the modification is done (if we can make sure the signal handler is not in the thread which does the modification).
More problems
To properly implement cross-modifying code in userspace, the above approach is not enough. The following problems should be resolved step-by-step:
- Write a good unit test to test this
- Implement it in Linux x86_64 Intel (this is the most common platform which static key will be used)
- Implement in both Intel and AMD
- Implement in more OS (Windows, macOS, etc.)
- Implement in more arch (AArch64, etc.)
- Does this work well for multi CPU sockets?
- Does this work well for arbitrary instruction substitution instead of
nop
/jmp
?
The above problems do exceed my knowledge, and what's worse, it is extremly hard to verify whether certain solution is right (as a matter of fact, I once tested the multi-threaded static key modification without any synchronization, and the race condition never happens).
I hope someone could help even one problem listed above, and I really hope my crate would be useful in the Rust ecosystem.