I was wondering how memory mapping a file and Rust's aliasing rules for references work together, i.e. if I create a read-only memory mapping of some file and then go ahead use the resulting slice to parse that file, can something bad happen if another process changes this file on disk and hence the &[u8]
does not actually guarantee that the underlying memory is not mutated while I am accessing it?
Yes. For example, if you make a string slice from (part of) the &[u8]
, it is only validated as UTF-8 at creation time. Afterwards, it is expected that the data remains valid.
I believe this is one of the reasons that Mmap::map()
from the memmap
crate is marked as unsafe
. Just because your program is treating the file's contents as immutable doesn't stop another process from doing whatever it wants to it.
Thank you for the quick answer, this is more or less what I understood as well.
But what I still do not understand is how this technique can then be used at all, e.g. how can Ripgrep use mmap and apply regular expressions on the resulting slice without worrying that their implementation goes off the rails, i.e. how is that unsafe
block justified?
I wrote a little bit about memory maps here: https://www.reddit.com/r/rust/comments/906u4k/memorymapped_files_in_rust/e2rac2e/
As far as I can tell, I don't know how to safely encapsulate the use of a memory map inside a library that is both portable and doesn't negate some or all of the benefits of using memory maps in the first place. To me, that means if you write a Rust library that uses memory maps, then that use of unsafe
needs to be propagated outwards into your public API in one form or another. This is a specific problem I faced in the course of libraryizing ripgrep.
In terms of actually justifying it... I don't know how to make that argument. In ripgrep's case, it never creates a &str
from a memory mapped region of &[u8]
, but there's nothing stopping safe code from doing that elsewhere in ripgrep. Pretty much all ripgrep does with the memory mapped region of &[u8]
is search it and print it. If the &[u8]
gets truncated, then I've seen that result in a SIGBUS and thus process termination, which is good and avoids memory safety issues there (at least on Linux). If the &[u8]
gets mutated out-of-process though, then I don't know what that means. Is it UB? If so, then we are in some serious hot water.
I think the Rust ecosystem would benefit from some clear rules or guidelines regarding memory map usage and memory safety.
cc @RalfJung You might be interested in this conversation. Or at least, I know I would be interested in your opinion.
Thank you for the in-depth reply, I will read up on the reddit thread you linked.
Pretty much all ripgrep does with the memory mapped region of
&[u8]
is search it and print it.
After reading on the complications of LLVM's internal aliasing rules and how Rust enables compiler optimizations by applying them to its references, I was even wondering whether just searching resp. parsing the resulting memory might yield UB if the compiler is making inappropriate assumptions.
I guess a mutable memory mapping that is mutated by some other process is even worse as the compiler can make even more assumption about &mut [u8]
w.r.t. aliasing?
(Personally, I was experimenting with porting parts of Poppler to Rust and tried to apply twoway
and nom
to the resulting slice and I hence am not even completely on top of what those external crates do with the memory I pass on to them.)
Yes, I wonder the same thing. I don't have the compiler chops to answer. I'm hoping there's an argument that starts and ends with something like "... since the mutation is out-of-core, it is not actually UB ...".
If you're handing a &[u8]
to another crate, then you can be pretty sure that all its going to do is read from it.
Just out of curiosity, how would this sort of problem be approached in a C program? It's quite common to use mmap for performance and UB due to aliasing or mutation from another process would still present itself as buggy behaviour. Would there be a pragmatic way to approach memory mapped IO which, although unsafe
in the Rust sense, is still sound?
But as established by the example of using str
, just reading could definitely be enough for UB. I think I can generally not expect that an external library is not allowed to try to turn the byte slice into a string and then use that string to e.g. parse it into an integer.
Actually, lazy as I am I even did that myself
named!(pub parse_u32<&[u8], u32>,
map_res!(
digit1,
|digits| unsafe { from_utf8_unchecked(digits).parse() }
)
);
thinking the unsafe
is justified since the digit1
parser only accepts ASCII digits...
I think for correctness, memory mapped IO would be done through volatile
pointers which have exactly the required semantics of telling the compiler that the memory pointed to can be changed without it making that change. But I also suspect this would prohibit a lot of compiler optimizations and impact performance considerably.
I agree with everyone who said "yes". &
means "target memory will not change". When using mmap
, it is your responsibility to guarantee that. Converting to &str
is but one example -- the compiler is also free to optimize code like let x = slice[0]; let y = slice[0];
into let x = slice[0]; let y = x;
. So, no matter whether you create an &str
, if you turn an mmap
'd file into a &[u8]
, mutating that file causes the program to have UB.
I have no idea how to make use of the fact that the mutation happens in a separate process. TBH I do not see an argument for why it makes this problem any less serious. mmap
is entirely uncharted territory as far as formal models go (the ones I have seen, anyway), but once two processes mmap
the same file, isn't that the same as two threads in a single process sharing some memory? And at that point we know very well that mutating memory where another thread holds a &[u8]
is UB.
So, the least we need to do is tell the compiler by adding some type that internally employs UnsafeCell
. For example, we could use &[Cell<u8>]
. Now at least modification in general is not UB any more.
However, the optimization I mentioned above is still valid in this case. Worse, the inverse optimization is still valid as well! You might be reading the u8
or Cell<u8>
once, but the compiler is free to turn that into two reads and assume they yield the same result! That is because the compiler can assume that you have no data races.
So maybe &[AtomicU8]
works, then? Maybe. As @adamreichold mentioned, this is territory where volatile
enters the picture, and it is pretty unclear to me whether non-volatile atomic accesses are "good enough" here. Notice that atomic does not "imply" volatile. For example, even with atomic variables, the compiler may turn let x = var.load(ordering); let y = var.load(ordering);
into let x = var.load(ordering); let y = x;
(but not vice versa, that is a key difference to non-atomic accesses). With volatile, it is not allowed to do this.
That said, I cannot think of any reason to actually use volatile
here, or even think of any case where adding volatile
removes UB. Adding volatile
helps control the exact accesses that hit the memory subsystem, but I do not think it is ever needed from a UB perspective. So, to the best of my knowledge, &[AtomicU8]
(and making every single access Relaxed
) is okay. It is also rather horrible to use, I assume.
I was thinking about this as well, but as you continue this just seems to point at the fact that &[u8]
is not a useful type for memory backed by a mapped file.
I'd say it would enable a sort of type checking, i.e. if &volatile [u8]
would exist, it could not be used to e.g. back a str
instance and would have to be copied into some &[u8]
accessible memory before validation of UTF-8 would make sense.
But I also agree that using volatile
pointers does not magically make e.g. a C parser safe from external modification issues, but if it expects "normal" pointers you either have to cast the qualifier away or copy the memory contents.
Maybe put differently, if memchr
had the signature volatile void* memchr(volatile void*,int,size_t)
I would expect it be safe to use on memory-mapped files whereas void* memchr(void*,int,size_t)
gives me no such guarantees.
FWIW, I have a similar issue where a Rust lib was written that exposes a Rust interface to an FFI backed by a C IPC lib whereby the messages the C lib delivers aren't "stable" - the underlying buffer can be mutated. The intended usage protocol is you read out all the data from the buffer into locals (where the locals are all Copy
types in Rust terms), and then call an API that indicates whether you raced with an update; if you raced, you're supposed to discard (and not look at) the values read into the locals and repeat the reads. So it's sort of like a CAS loop where you keep going until you get a stable set of reads.
While thinking about the proper Rust API to expose, we punted on exposing a &[u8]
to the callers precisely for the reason that it's a lie as far as rustc is concerned. Still need to come back to this topic (left just the FFI bindings for now). This is essentially the same thing as the mmap problem discussed here.
Both &[Cell<u8>]
(or, equivalently, &Cell<[u8]>
) and &[AtomicU8]
also serve that role.
@vitalyd So "who" may perform those mutations? Would they happen from other threads?
Yes, I do not disagree with that, I just try to say that C's CV qualifiers do play a similar role in that language. And basically have the same downside there, as the intended benefit of treating file contents just like any other memory region is lost due to no useful library function being equipped to handle volatile pointers.
Yeah, it'd be another process actually. It's essentially the same mmap case (the underlying IPC is formed off mmap'd files), just in a different usage scenario from ripgrep (which is, presumably, mostly reading/mmap'ing unchanging files). This is all subtle in C as well, it's just that Rust adds the extra UB concerns around (erroneously) exposing this type of memory as slices.
I merely wanted to throw this out here since I recently had to think about this same thing (and haven't yet come to a good conclusion on how to best model this in Rust).
Is the issue really MMAP instead of Shared Memory? If I have a file that is mmapped with an exclusive lock, then I could safely mutate it. If I have a file mmapped with a share-lock (meaning it cannot be modified elsewhere) then I can safely read it. If I have mmapped file that is not locked in any fashion, then treating it as a slice or array is undefined behavior, no?
EDIT: So, I think the following would make sense:
- An API may return an exclusively-locked block of shared memory (MMAP or otherwise) as a mutable reference
- An API may return a share-locked block of shared memory as only a non-mutable/shared reference
- An API may return a non-locked block of shared memory only as a raw pointer (de-referencing the pointer is unsafe)
IIRC only windows allows to reliably lock files, on Linux lock is just an advisory.
I think a good approach to consider will be to move unsafe
from mapping creation to acquiring &[u8]
from it. In other words Mmap
will implement Read
(with volatile intrinsics), but will not deref to [u8]
. You will have to use explicit unsafe method for it. Sometimes it's convenient to pass slices to mapped data around, to help with it we could introduce opaque MmapSlice<'a>
type, which will behave like a slice (i.e. it will support indexing and slicing), but will not expose &[u8]
directly, but only via Read
trait and unsafe method.
Quite often when you process data you will need to copy it either way sooner or later. With the described approach a good chunk of applications will be able to use completely safe Rust while working with memory mapped files.