Linux kernel has a low-level mechanism called static-keys, which utilizes code patching to modify the if-check to nop or unconditional jump jmp accordingly.
An usage example is
define_static_key_false!(FLAG_STATIC_KEY);
if static_branch_likely!(FLAG_STATIC_KEY) {
do_a();
} else {
do_b();
}
// Somewhere else to modify the static branch
FLAG_STATIC_KEY.enable();
The generated assembly will not involve any value tests or conditional jumps in the if-check. Instead, it just a nop or unconditional jump jmp. When we enable or disable such static key, the if-check will be modified to jmp or nop as opposite.
In the past, the static keys only supported in Linux kernel. Now, we reimplement into Rust, and can be used in Windows, macOS, Linux in x86-64, x86 and aarch64!
For now, nightly Rust is required for asm_goto and asm_const features.
Hm had time to look closer at this, the fact that you can't modify static keys in multi-threaded is a bit limiting. This is not an area I'm very familiar with, but couldn't you use atomic operations to update the instruction?
At least X86-64 has up to a 16-byte compare and exchange, which should be enough for the length of one of the involved instructions I imagine (without having gone and counted the bytes)?
And since x86-64 has a strong memory model I would expect even normal reads to syncronize with writes at at least acq/rel (but even relaxed would be enough for this I believe). As long as the accesses are aligned of course (but isn't there an align directive in assembly, so not a problem?) .
Maybe the fact that this involves the icache throws a wrench in my reasoning, maybe I'm just completely wrong. And I don't know how this would work on weakly ordered architectures anyway (and you probably want to be portable).
An alternative formulation would be to always have jumps with immediate offsets and just modify the address part to switch between two labels, slightly less compact than the nop/jmp approach, but would mean fewer bytes to modify and higher probability it could fit into smaller atomics.
(Also I don't believe the rust abstract machine memory model enters into this, since we are doing inline assembly we need to use the hardware memory model)
One issue I can see is if multiple uses of the same static key don't get updated at the same time, which would mean the value as observed by different parts of the code are out of sync. However, this is no different than what would be observed if someone rapidly toggles the value back and forth, and I don't believe it is an actual issue thus.
As for synchronising between writers, it seems it would be fine to just use a Mutex there even. It is supposed to be rare.
Thank you for your great advice. However, I wonder what is the use case of modifying static keys in multi-thread? As far as I can imagine, in ideal use cases, the feature flags are usually given at program initialization, such as RUST_LOG for env_logger, or flags passed in commandline from clap::Parser.
Moreover, as described in my FAQ, to modify static keys in multi-thread, we also need to hold a lock to make sure the page protection is not modified concurrently. I wonder whether the performance improved by static keys can make up for the overhead of such locks.
Anyway, I think your solution should work for at least Linux in x86 and x86-64.
I'm thinking about some of the use cases that the Linux kernel has for this. The most famous one is probably trace points that are zero overhead until you enable them.
Another use case is a long running server that can reload the configuration on request (traditionally on receiving SIGHUP).
But there is yet another motivation too, and that is library code. If I want to make a rust library that uses static keys the init function for that needs to be unsafe, all the way up the chain to the application. Since I can't know if the application already started some threads before my library, anything else would be unsound. It would be nice to not need unsafe here.
Since static keys are modified extremely rarely having a global lock to set them is most likely ok (you need to benchmark this obviously).
However, I just realised the program needs to promise to only use one library that modify the memory protection. If multiple libraries are in use they still need to be coordinated. So you can't get away completely from unsafe in the API, but at least the safety obligation becomes much easier to fulfill.
Oh, yeah I didn't even think about non-Linux. Can't really help you there I'm afraid.
And while I know a fair bit about the data memory model on x86-64 (and x86) and a little bit about it on Aarch64, I don't know that this works the same for self modifying code. I have this nagging feeling that I have heard that you may need to manually flush I1 caches, and that memory model may be even weaker for instructions. Can't remember which arch that applies to though.
After writing some implementations for sync-version static keys, I realized that it may be not so possible to atomically update the instruction even in x86-64. The AtomicU128 requires data to be aligned to 16, and so do the low-level arch's requirement. However, as x86-64 is a variable-length ISA, instructions may not be properly aligned, so the 5-byte instruction may be divided into two slots, which is not atomic at all.
So I think at least for x86-64 Linux, it is not so clear about how to correctly implement a static key which can be modified in multi-threads.
Anyway, for others who may raise concerns, it is free to use it (via static_branch_likely! and static_branch_unlikely!) in multi-threads. We are just not able to modify it in multi-threads.
It's feasible. However, as stated in the link of text_poke syscall, even if we can atomically modify it, it still seems impractical to modify instruction of another thread in Linux userland.
Ah I see, didn't have time to read the linked material before so just responded to the bit in the post (sorry about that). If you need to do IPI (Inter Processor Interrupts) to sync the icache between cores on x86-64 that is indeed quite nasty. Can that even be done from user space at all? How do JIT engines handle this?
They have sentinel pages that they flip to PROT_NONE which triggers a segfault handler when an instruction (something cheap like test with a memory operand) tries to read from them. The handler then does whatever fixup is supposed to happen to get onto cold path.
There was some new feature that was added to linux and glibc recently. Restartable sequences, maybe that could help here?
I have little knowledge about Linux kernel internals but this is an absolutely useful crate. However I think this may not work on iOS because Apple does not allow modify code at runtime?