How unsafe is mmap?

But as established by the example of using str, just reading could definitely be enough for UB. I think I can generally not expect that an external library is not allowed to try to turn the byte slice into a string and then use that string to e.g. parse it into an integer.

Actually, lazy as I am I even did that myself

named!(pub parse_u32<&[u8], u32>,
    map_res!(
        digit1,
        |digits| unsafe { from_utf8_unchecked(digits).parse() }
    )
);

thinking the unsafe is justified since the digit1 parser only accepts ASCII digits...

I think for correctness, memory mapped IO would be done through volatile pointers which have exactly the required semantics of telling the compiler that the memory pointed to can be changed without it making that change. But I also suspect this would prohibit a lot of compiler optimizations and impact performance considerably.

1 Like

I agree with everyone who said "yes". :slight_smile: & means "target memory will not change". When using mmap, it is your responsibility to guarantee that. Converting to &str is but one example -- the compiler is also free to optimize code like let x = slice[0]; let y = slice[0]; into let x = slice[0]; let y = x;. So, no matter whether you create an &str, if you turn an mmap'd file into a &[u8], mutating that file causes the program to have UB.

I have no idea how to make use of the fact that the mutation happens in a separate process. TBH I do not see an argument for why it makes this problem any less serious. mmap is entirely uncharted territory as far as formal models go (the ones I have seen, anyway), but once two processes mmap the same file, isn't that the same as two threads in a single process sharing some memory? And at that point we know very well that mutating memory where another thread holds a &[u8] is UB.


So, the least we need to do is tell the compiler by adding some type that internally employs UnsafeCell. For example, we could use &[Cell<u8>]. Now at least modification in general is not UB any more.

However, the optimization I mentioned above is still valid in this case. Worse, the inverse optimization is still valid as well! You might be reading the u8 or Cell<u8> once, but the compiler is free to turn that into two reads and assume they yield the same result! That is because the compiler can assume that you have no data races.

So maybe &[AtomicU8] works, then? Maybe. As @adamreichold mentioned, this is territory where volatile enters the picture, and it is pretty unclear to me whether non-volatile atomic accesses are "good enough" here. Notice that atomic does not "imply" volatile. For example, even with atomic variables, the compiler may turn let x = var.load(ordering); let y = var.load(ordering); into let x = var.load(ordering); let y = x; (but not vice versa, that is a key difference to non-atomic accesses). With volatile, it is not allowed to do this.

That said, I cannot think of any reason to actually use volatile here, or even think of any case where adding volatile removes UB. Adding volatile helps control the exact accesses that hit the memory subsystem, but I do not think it is ever needed from a UB perspective. So, to the best of my knowledge, &[AtomicU8] (and making every single access Relaxed) is okay. It is also rather horrible to use, I assume.

6 Likes

I was thinking about this as well, but as you continue this just seems to point at the fact that &[u8] is not a useful type for memory backed by a mapped file.

I'd say it would enable a sort of type checking, i.e. if &volatile [u8] would exist, it could not be used to e.g. back a str instance and would have to be copied into some &[u8] accessible memory before validation of UTF-8 would make sense.

But I also agree that using volatile pointers does not magically make e.g. a C parser safe from external modification issues, but if it expects "normal" pointers you either have to cast the qualifier away or copy the memory contents.

Maybe put differently, if memchr had the signature volatile void* memchr(volatile void*,int,size_t) I would expect it be safe to use on memory-mapped files whereas void* memchr(void*,int,size_t) gives me no such guarantees.

FWIW, I have a similar issue where a Rust lib was written that exposes a Rust interface to an FFI backed by a C IPC lib whereby the messages the C lib delivers aren't "stable" - the underlying buffer can be mutated. The intended usage protocol is you read out all the data from the buffer into locals (where the locals are all Copy types in Rust terms), and then call an API that indicates whether you raced with an update; if you raced, you're supposed to discard (and not look at) the values read into the locals and repeat the reads. So it's sort of like a CAS loop where you keep going until you get a stable set of reads.

While thinking about the proper Rust API to expose, we punted on exposing a &[u8] to the callers precisely for the reason that it's a lie as far as rustc is concerned. Still need to come back to this topic (left just the FFI bindings for now). This is essentially the same thing as the mmap problem discussed here.

3 Likes

Both &[Cell<u8>] (or, equivalently, &Cell<[u8]>) and &[AtomicU8] also serve that role.

@vitalyd So "who" may perform those mutations? Would they happen from other threads?

1 Like

Yes, I do not disagree with that, I just try to say that C's CV qualifiers do play a similar role in that language. And basically have the same downside there, as the intended benefit of treating file contents just like any other memory region is lost due to no useful library function being equipped to handle volatile pointers.

Yeah, it'd be another process actually. It's essentially the same mmap case (the underlying IPC is formed off mmap'd files), just in a different usage scenario from ripgrep (which is, presumably, mostly reading/mmap'ing unchanging files). This is all subtle in C as well, it's just that Rust adds the extra UB concerns around (erroneously) exposing this type of memory as slices.

I merely wanted to throw this out here since I recently had to think about this same thing (and haven't yet come to a good conclusion on how to best model this in Rust).

Is the issue really MMAP instead of Shared Memory? If I have a file that is mmapped with an exclusive lock, then I could safely mutate it. If I have a file mmapped with a share-lock (meaning it cannot be modified elsewhere) then I can safely read it. If I have mmapped file that is not locked in any fashion, then treating it as a slice or array is undefined behavior, no?

EDIT: So, I think the following would make sense:

  • An API may return an exclusively-locked block of shared memory (MMAP or otherwise) as a mutable reference
  • An API may return a share-locked block of shared memory as only a non-mutable/shared reference
  • An API may return a non-locked block of shared memory only as a raw pointer (de-referencing the pointer is unsafe)
4 Likes

IIRC only windows allows to reliably lock files, on Linux lock is just an advisory.

I think a good approach to consider will be to move unsafe from mapping creation to acquiring &[u8] from it. In other words Mmap will implement Read (with volatile intrinsics), but will not deref to [u8]. You will have to use explicit unsafe method for it. Sometimes it's convenient to pass slices to mapped data around, to help with it we could introduce opaque MmapSlice<'a> type, which will behave like a slice (i.e. it will support indexing and slicing), but will not expose &[u8] directly, but only via Read trait and unsafe method.

Quite often when you process data you will need to copy it either way sooner or later. With the described approach a good chunk of applications will be able to use completely safe Rust while working with memory mapped files.

5 Likes

But I really want mmap to yield a "real" &[u8] without any gotchas so I can actually use that memory like any other memory, almost as much in C as in Rust.

Hence I think I would rather have the semantics of mmap be extended, e.g. adding a MAP_SNAPSHOT flag as a sort of converse to MAP_PRIVATE so that I get a page with a copy of the original file content as soon as someone else modifies the underlying file and hence can safely work with a void* or &[u8] without caring about what other processes do. (I wonder what the performance of such a flag implemented in e.g. the Linux kernel would be and whether this creates a class of local denial of service attacks.)

3 Likes

On Linux, I believe if the "sticky" bit is set on the file, then, Linux enforces "Mandatory Locks". So, you could have an MMAP API that returned reference/slices when it has checked that the underlying file has the sticky-bit checked and that the appropriate lock has been taken. Any UB at that point would be the result of a malicious/misbehaving other application on the system that has permissions need to muck with things. At that point, that is not something Rust can ever protect against and would be out-of-scope as far as UB is concerned. Is that not the case?

See: https://www.in-ulm.de/~mascheck/various/permissions/mandatory.txt.html

1 Like

From the manual page of fcntl:

Mandatory locking
       Warning:  the  Linux implementation of mandatory locking is unreliable.  See BUGS below.  Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a
       configuration option (CONFIG_MANDATORY_FILE_LOCKING).  This is an initial step toward removing this feature completely.
2 Likes

Question for you: Are advisory locks sufficient? If they aren't, then nothing is to my mind because a malicious/misbehaving other application process can always screw up my memory if it has the right permissions and any UB from that wouldn't be the fault or purview of Rust to correct. If a process/application has permission to futz with a file that I'm futzing with, then we are in a sort of mutual trust relationship whereby we both agree to "Play by the Rules". In this case, use of "Advisory Locks" appropriately. Would it not, in that case, be OK for Rust MMAP implementation to treat Advisory Locks as if they were mandatory? Any UB would not be the result of Rust.

1 Like

But it's not "real", as it does not have the same aliasing guarantees which must be upheld by &[u8]. If we'll get reliable cross-platform locks (I guess target degree of reliability is up for discussion), then it can be made "real". You can safely get *const u8, the thing you actually work with in C, but again you will need unsafe to use it. Having "fake" &[u8] may break stuff. (e.g. &[u8] to &str conversion, compiler mis-optimizations, etc.)

Not sure. Maybe? To me personally it looks a bit less iron-clad than I would've preferred. But I'll leave judgment to people more familiar with the topic.

1 Like

I think this is more or less the status quo, i.e. creating a memory mapping in Rust is unsafe and you have to ensure that nobody modifies the file by external means outside of Rust's type system. If you have e.g. a tightly controlled environment where you just know that this won't happen then this might be enough.

But it does make a lot of use cases unsafe that would be really nice to have, e.g. grepping random files should be both fast and safe without any extra precautions. Of course, the kernel or root can always screw up my memory and any language-implied memory safety is gone, but with the status quo even an unprivileged user could.

1 Like

To my mind, not really. It would be a user with sufficient privileges to muck with the files I'm looking at.

That's a fundamental limitation though, that other code was compiled assuming no volatile-style accesses are needed.

I do not know the details of that API but agree with the general sentiment -- another process can also open /proc/$PID/mem and mess with any data you have, and we consider that their fault and not yours. :slight_smile:

The question is where to put the limit here. AFAIK most programs will ignore advisory locks? Basically if it doesn't take an admin to break, and especially if it can quickly happen accidentally, then considering that "misbehavior" is putting the burden unfairly in the wrong place.

2 Likes

Hmm. I'd like to say this is the territory of atomics, not volatile. After all, it's not a memory-mapped hardware device or something; the only ways the memory can change are:

  • When another CPU writes through a different mapping to the same physical address (likely, though not necessarily, in a different process); or
  • If the kernel pages it out, then back in with different contents.

From the current process's perspective, the second case should be strictly 'less weird' than the first; changing mappings usually involves a strong barrier, so you should get sequential consistency and all that.

Regarding the first, the only issue would be if atomics don't operate correctly when multiple clients are accessing the same physical memory at different virtual addresses. According to this thread:

https://stackoverflow.com/questions/51463312/are-lock-free-atomics-address-free-in-practice

…they do in practice, while the C++ standard encourages but does not require it.

That does have some weird consequences when handling multiple mappings in the same process. For example, the compiler would not be allowed to assume that two pointers do not alias even if it knows that they are aligned and non-equal (e.g. because the user tested whether they were equal and control flow only reaches this point if they're not) – at least if all accesses to those pointers are using atomics.

1 Like