How unsafe is mmap?

But I really want mmap to yield a "real" &[u8] without any gotchas so I can actually use that memory like any other memory, almost as much in C as in Rust.

Hence I think I would rather have the semantics of mmap be extended, e.g. adding a MAP_SNAPSHOT flag as a sort of converse to MAP_PRIVATE so that I get a page with a copy of the original file content as soon as someone else modifies the underlying file and hence can safely work with a void* or &[u8] without caring about what other processes do. (I wonder what the performance of such a flag implemented in e.g. the Linux kernel would be and whether this creates a class of local denial of service attacks.)

4 Likes

On Linux, I believe if the "sticky" bit is set on the file, then, Linux enforces "Mandatory Locks". So, you could have an MMAP API that returned reference/slices when it has checked that the underlying file has the sticky-bit checked and that the appropriate lock has been taken. Any UB at that point would be the result of a malicious/misbehaving other application on the system that has permissions need to muck with things. At that point, that is not something Rust can ever protect against and would be out-of-scope as far as UB is concerned. Is that not the case?

See: https://www.in-ulm.de/~mascheck/various/permissions/mandatory.txt.html

1 Like

From the manual page of fcntl:

Mandatory locking
       Warning:  the  Linux implementation of mandatory locking is unreliable.  See BUGS below.  Because of these bugs, and the fact that the feature is believed to be little used, since Linux 4.5, mandatory locking has been made an optional feature, governed by a
       configuration option (CONFIG_MANDATORY_FILE_LOCKING).  This is an initial step toward removing this feature completely.
2 Likes

Question for you: Are advisory locks sufficient? If they aren't, then nothing is to my mind because a malicious/misbehaving other application process can always screw up my memory if it has the right permissions and any UB from that wouldn't be the fault or purview of Rust to correct. If a process/application has permission to futz with a file that I'm futzing with, then we are in a sort of mutual trust relationship whereby we both agree to "Play by the Rules". In this case, use of "Advisory Locks" appropriately. Would it not, in that case, be OK for Rust MMAP implementation to treat Advisory Locks as if they were mandatory? Any UB would not be the result of Rust.

1 Like

But it's not "real", as it does not have the same aliasing guarantees which must be upheld by &[u8]. If we'll get reliable cross-platform locks (I guess target degree of reliability is up for discussion), then it can be made "real". You can safely get *const u8, the thing you actually work with in C, but again you will need unsafe to use it. Having "fake" &[u8] may break stuff. (e.g. &[u8] to &str conversion, compiler mis-optimizations, etc.)

Not sure. Maybe? To me personally it looks a bit less iron-clad than I would've preferred. But I'll leave judgment to people more familiar with the topic.

1 Like

I think this is more or less the status quo, i.e. creating a memory mapping in Rust is unsafe and you have to ensure that nobody modifies the file by external means outside of Rust's type system. If you have e.g. a tightly controlled environment where you just know that this won't happen then this might be enough.

But it does make a lot of use cases unsafe that would be really nice to have, e.g. grepping random files should be both fast and safe without any extra precautions. Of course, the kernel or root can always screw up my memory and any language-implied memory safety is gone, but with the status quo even an unprivileged user could.

1 Like

To my mind, not really. It would be a user with sufficient privileges to muck with the files I'm looking at.

That's a fundamental limitation though, that other code was compiled assuming no volatile-style accesses are needed.

I do not know the details of that API but agree with the general sentiment -- another process can also open /proc/$PID/mem and mess with any data you have, and we consider that their fault and not yours. :slight_smile:

The question is where to put the limit here. AFAIK most programs will ignore advisory locks? Basically if it doesn't take an admin to break, and especially if it can quickly happen accidentally, then considering that "misbehavior" is putting the burden unfairly in the wrong place.

2 Likes

Hmm. I'd like to say this is the territory of atomics, not volatile. After all, it's not a memory-mapped hardware device or something; the only ways the memory can change are:

  • When another CPU writes through a different mapping to the same physical address (likely, though not necessarily, in a different process); or
  • If the kernel pages it out, then back in with different contents.

From the current process's perspective, the second case should be strictly 'less weird' than the first; changing mappings usually involves a strong barrier, so you should get sequential consistency and all that.

Regarding the first, the only issue would be if atomics don't operate correctly when multiple clients are accessing the same physical memory at different virtual addresses. According to this thread:

…they do in practice, while the C++ standard encourages but does not require it.

That does have some weird consequences when handling multiple mappings in the same process. For example, the compiler would not be allowed to assume that two pointers do not alias even if it knows that they are aligned and non-equal (e.g. because the user tested whether they were equal and control flow only reaches this point if they're not) – at least if all accesses to those pointers are using atomics.

1 Like

I’ve seen mixed messages around atomics and volatile. For example, N4455 No Sane Compiler Would Optimize Atomics suggests that they’re different and therefore shouldn’t be equated yet also says that the standard mandates address freedom, but then also says it’s non normative. One would think this type of thing wouldn’t have so much ambiguity :slight_smile:

It's probably partly the reason things like Spectre and Meltdown are a thing. So much ambiguous definition of how memory should behave and over-optimization (at a different level, but, similar to my mind).

In my primary use case (body-image crate) I support as an optional feature, creating a read-only mmap to a temporary, unlinked file, which should avoid any other process interacting with the same file at least on Linux. My current belief and testing is that this is sufficient to not require me propagating the unsafe in my interface. However I'm also now exposing a alternate constructor that takes an arbitrary File, which could be used to mmap a linked file or a file open'ed read-write. Perhaps I should mark that alt constructor unsafe and/or make sure to document all the potential pitfalls?

To some extent, these concerns seem incongruous for a systems and potential C-replacement language like rust. Is there not any relational database or other obvious mmap use cases that can be built with LLVM/Clang on Linux?

At rustconf there was several talks on embedded systems that expose peripherals via known memory addresses with R/W interfaces. Do all of these applications also loose out on rust's safety guarantees?

Also I really didn't understand the suggestions above on use of &[Cell<u8>] but feel it likely important, any further reading for that?

2 Likes

I agree that the current semantics of mmap do not allow one to safely use &[u8] without external guarantees. But I also think that anything less than &[u8] is almost bound to be less useful than using std::io::Read directly if one is parsing arbitrary byte streams, i.e. no parser will work efficiently when applied directly to &[AtomicU8] it is also not clear to me if a successful parse at some point in time means anything when the underlying memory is allowed to change.

Which is why I would prefer something like adding MAP_SNAPSHOT so that using &[u8] for memory-mapped IO actually is correct due to the kernel ensuring the necessary stability at least from the POV of my processes' address space.

I think the current status quo prevents a lot of important use cases from being safe, e.g. I cannot grep a project directory in which a build is currently running since it might overwrite a file while it is grepped via a &[u8] based on a memory mapping. Meaning the problematic user is myself in this particular case, but it would also be rather unpractical to force me to avoid these concurrent modifications.

Agreed, volatile is unnecessarily strong since even changes via (direct) I/O always reach the mappings via the page cache eventually. (Thinking of applications implemented in C++ and using shared memory, we usually stuff std::atomic-wrapped primitive types in there for communicating between processes relying on the properties mentioned in the linked SO question.)

I think this has to be accessed via str::ptr::{read,write}_volatile and any safe wrappers need to be built on top of that. But it is also a rather different use case, since one usually writes structured data to these locations to e.g. configure the peripherals instead of e.g. parsing arbitrary byte streams so that building these safe abstractions is usually straight-forward if tedious. (If these peripherals do process byte streams, these are usually clocked in via a well-known memory-mapped register or transferred via DMA.)

I tend to agree.

Oh wow, yes. I had not thought of these consequences of virtual memory before, but I think you are right.

Probably you want to define "non-aliasing" as "memory operations are independent", then it has a priori nothing to do with address equality.

Oh they are certainly different, see here for example. But it is unclear whether one of the is strictly stronger than the other. I do not think that that is the case.

Well, the reason C does not have these problems is that it does not care about safety (as in, providing a safe-to-use abstraction). :wink: Rust is doing strictly more here than any other language has done before.

I am not sure if it helps, but you could have a look at the documentation of UnsafeCell. Basically, the difference between &[u8] and &[Cell<u8>] is that the former says "the memory this slice is stored in will not change", whereas the latter makes no such promise. &T is very different from C's *const T: The latter just says that memory will not be mutated through this pointer (but it may still be mutated through other pointers e.g. when we call some function), whereas &T says that the memory will not be mutated at all.


UnsafeCell is a way to opt-out of that strong from of immutability, and that's why it is important in the context of mmap.

Did you mean to say "do not allow"?

Yes that would be very useful indeed. It doesn't even have to be a consistent snapshot taken at the time the file is opened, but the kernel would have to guarantee that once a page is actually loaded and observed by the process, it will not change again.

volatile is also too weak because it does not guard against the problems arising when another thread in the same process mmap's the same file (and hence performs concurrent atomic accesses to the same locations).

1 Like

Yes, sorry for the mix-up, fixed the post.

I'm pretty sure mysql and friends just tell you not to mess with their files while the database is running, and they enforce it by running the RDBMS as its own user. Concurrent modifications of the database backing files by separate processes is a misconfiguration, and is allowed to cause UB.

Too bad that's not really a general solution.

3 Likes

Yeah, they’re different. Whether one is stronger than the other is hard to answer because we’d need to define what “stronger” means.

volatile has no ordering effects but does prevent the compiler from eliminating access to the location. Atomics can provide ordering, depending on the specified memory order, but may have accesses coalesced (or otherwise eliminate some), from what I gather. This latter part implies you can’t substitute atomics for scenarios where you want volatile access, and thus you may need volatile atomic.

As you pointed out above, atomic and volatile access seem to be somewhat orthogonal and volatile access seems to be unrelated since everything happens via the page cache, i.e. processes actually share access to pages in memory instead of the "file contents".

However, atomics do not seem to capture the full problem either, since this is not just about data races: Think of a single core system where a process maps a file and turns the &[u8] into a &str, then the same process uses write to change that particular part of the file while that reference stays alive and finally uses the &str which does not contain valid UTF-8 data anymore.
I do not think any concurrency issues enter here and there is actually only a single user-space mapping of that page of memory involved, but the problem is still visible and even accessing the memory via &[AtomicU8] would not change things (except for not being able to back &str in the first place) as far as I understand.

This is expressed wrong, since AtomicU8 having interior mutability, does prevent one from building a verified-during-construction data structure like str. But so does Cell<u8>. What I mean is that atomic memory access does not enter the picture.

The usual definition for relative strength of memory accesses that I was assuming here is: A is stronger than B if, whenever B is used, it is correct to replace it by A. "Correct" here means that the transformation does not introduce new program behaviors (the same kind of correctness that compilers are subject to).

This is not about volatile nor atomic, but about Rust's & type. Just replace the write call by some unsafe code which writes to the slice. This is what UnsafeCell solves.