I found many threads discussing the fact that file backed mmap is potentially unsafe, but I couldn't find many resources about shared memory with MAP_ANON. Here's my setup:
Setup details:
I use io_uring and a custom event loop (not Rust async feature)
Buffers are allocated with mmap in conjuction with MAP_ANON| MAP_SHARED| MAP_POPULATE| MAP_HUGE_1GB
Buffers are organized as a matrix: I have several rows identified by buffer_group_id, each with several buffers identified by buffer_id. I do not reuse a buffer group until all pending operations on the group have completed.
Each buffer group has only one process writing and at least one reader process
Buffers in the same buffer group have the same size (512 bytes for network and 4096 bytes for storage)
I take care to use the right memory alignment for the buffers
I perform direct IO with the NVMe API, along with zero copy operations, so no filesystem or kernel buffers are involved
Each thread is pinned to a CPU of which it has exclusive use.
All processes exist on the same chiplet (for strong UMA)
In the real architecture I have multiple network and storage processes, each with ownership of one shard of the buffer, and one disk in case of storage processes
All of this exists only on linux, only on recent kernels (6.8+)
IPC schema:
Network process (NP) mmap a large buffer ( 20 GiB ?) and allocates the first 4 GiB for network buffers
Storage process (SP) gets the pointer to the mmap region and allocates the trailing 16 GiB as disk buffers
NP receive a read request, and notify storage that a buffer at a certain location is ready for consumption via prep_msg_ring (man page)
SP parse the network buffer, and issue a relevant read to the disk
When the read has completed, SP messages NP via prep_msg_ring that a buffer at a certain location is ready for send
NPsend the disk buffer over the network and, once completed, signals SP that the buffer is ready for reuse
Questions:
Is this IPC schema safe?
Should I be worried about UB?
Is prep_msg_ring enough of a synchronization primitive?
When cross-posting, be sure to link the other post(s) as to avoid duplicate discussion: Reddit - The heart of the internet (not using my reddit account anymore, so I won't put a link there).
Who sanitizes requests: NP or SP? It should be NP, because If you have unsafe code you might encounter UB, and hackers might use UB to trick SP into reading data they shouldn't have access to.
I hope that NP has only network access (and no nvme access) while SP has only the latter one, right?
Yes, I prefer to use local PCIe NVMe devices for SEVERAL reasons (cost, performance, durability of the data, amount of tricks you can do to improve performance, ...)
Data to NP is provided by the compute layer, which is separated, and is responsible of validation, authentication and authorization. eBPF is used as an additional safety measure in case compute layer fails
Yes, there is a separation of concerns, mostly for performance reasons
I think the suggestion was to replace your custom protocol implemented in NP+SP with directly exposing NVMe over TCP to the client that would otherwise talk to NP over the network.
In Rust, &[u8] is always strictly immutable, no exceptions. &mut [u8] is only supposed to be mutated by the same process (owner of the reference) only, and nothing external like another process or even the kernel if it's not a syscall from the thread owning the reference.
This means that byte slices can't be used with anything that isn't vanilla boring process-private memory or guaranteed to be completely immutable for as long as any reference to such memory exists.
You can use mmap magic, but it needs *mut [u8] or &[UnsafeCell<u8>] or other types like that which disable Rust's assumptions.
This is a very good question! You are right that I'm basically implementing NVMe-over-TCP (or SCSI-over-TCP) by myself, but I also add rich semantics to the protocol that makes a huge performance difference for my use case.
Also this approach allow me to use cheap commodity hardware, where NVMe-over-TCP might not be available, and where local bandwidth is premium (my current nodes have only 2*25 GBits NICs, and the NVMe disks greatly exceed that bandwidth)
Thanks Kornel! I will dig further, make experiments and eventually get back to you!
Some additional things I found while digging in the meanwhile:
Writes and reads never overlap
Between the write from one thread and the read of another there are multiple synchronizations and memory barriers, like io_uring_smp_store_release and io_uring_smp_load_acquire from here. Does this change anything?
That doesn't sound right to me. How is Rust compiler can distinguish between opaque code in kernel that fills your &mut [u8] buffer with new data and opaque code in assembler that fills it with with new data by means of shared memory?
I think passing reference to some external function (or maybe even empty asm statement) should be enough to convince compiler that there are some “invisible” changes are happening… otherwise it's not clear how I/O may work in Rust at all…
As I understand it the linux kernel supports exposing an NVMe driver over TCP without requiring any hardware support through the nvmet-tcp kernel module: https://blogs.oracle.com/linux/nvme-over-tcp
If you pass a mutable reference to the kernel, it is allowed to write through said mutable reference, but as soon as you touch it yourself again, this mutable reference the kernel got is invalidated and the kernel is no longer allowed to touch the data until you pass a mutable reference to it again. The same applies for inline asm.
First: Much of the "immediate UB" from mmap is exclusive to file mapping and does not apply to anonymous mappings. Stuff like truncating the file while it's mapped.
Second: you should build an abstraction over your *mut [u8] that rechecks invariants of complex types like bool, char, and any enum every time it is read from the mapped area. You need to make sure any corruption from one process is detected and captured before it spreads to the other.
It doesn't always, which is the annoying part about UB working until it doesn't.
In general Rust (and LLVM) is allowed to assume that memory behind a reference won't change if the optimizer doesn't see any code that could change it (opaque function calls are often assumed to be potentially mutating). This means that the optimizer can reorder or cache memory accesses, or assume one read has the same value as another read from the same location. When the memory unexpectedly changes, these assumptions are broken and can lead to invalid optimizations.