Linux x86_64: mmap: correct way to enlarge file, overwrite page

Context: write ahead logs, append-only btrees, database recovery

On x86_64 Linux with mmap, can anyone share 'correct' code (along with guarantees during OS crashes) for:

  1. enlarging a mmap-ed file

  2. overwriting an existing page ?

'correct' here is defined as being able to make non-trivial / useful statements on: if the OS crashes, what pages are:

  • guaranteed to be uncorrupted
  • may have corruptions

Thanks!

I don’t have any code of the kind you requested to share. But I suspect that no one else will be able to either, unless you are more specific about what kind of crashes you care about.

Unless there is some mechanism to prevent this that I am unaware of, I would think that a root user would be able to arbitrarily corrupt the OS code on disk, such that on the next boot some arbitrary condition could cause arbitrary memory to be corrupted arbitrarily, with or without a crash.

Presumably, cases like that are uninteresting. But which interesting cases are left after those are removed?

I am unaware of any mechanism that limits the potential scope of bugs in Linux itself to corrupt memory. But I guess there might be ones I am
unaware of?

Assuming we leave out Linux bugs, and assume that you are running non-corrupted OS code, I am not sure what remaining sources of OS crashes there are.

Theoretically, even things like OOM conditions, for example, would be recoverable because a random high memory process would be killed. Is that the kind of thing you were referring to when you said “OS crashes”?

Valid criticism. Let me try to clariy:

I am operating under the assumption that: files not opened stay safe; files opened in read only mode stay safe; and the only thing that may get corrupted are pages of memory that we open in rw-mode and make a modification to (anywhere on the page).

I am assuming that the kernel is bug-free (unrealistic, but this is part of the model.)

Not worried about OOM. I am only concerned about modified pages of mmapped memory. that when writing them out, the page may be partially written, it may be corrupted during the write, ...

1 Like

This doesn't really relate to Rust; all the answers are language-independent.

The fundamental problem is that it's extremely hard to make any correctness claims when writing via mmap. The basic problems are:

  • Writes can happen early - the kernel can write out your pages at any time, including mid-modification, so you need to be prepared to have partial updates in your durable storage.
  • Writes may never happen - the kernel could defer the write indefinitely if there's no memory pressure or other cause to perform the write.
  • Writes can happen in any order.
  • The kernel might decide to rewrite unmodified pages - many filesystems have a preferred write size larger than a page (eg, a COW filesystem like btrfs), so if there's a clean page between two dirty ones it may write them all out together.
  • There's no explicit ordering between filesize changes and writes - you could extend the file and write some pages, but the kernel may write the pages to the filesystem before making the file size change durable, so if there's a crash the file will not contain that written data. Or you could end up with a zero-size file on some filesystems with some configs.
  • In general, intermixing mmap IO with file operations (eg read/write) may not be coherent (though I think that's less of a problem on modern kernels)

The precise semantics of a crash in the middle of a write operation are very filesystem dependent. You could get the effect of your operations with arbitrary reorderings, data and metadata updates could be unsynchronized, data updates can get tearing (new data intermixed with either old data, zero data or just plain garbage).

You may be able to mitigate this with tactical applications of msync and fsync, but getting it correct can be tricky. And you end up making a lot of expensive synchronous syscalls which can really tank performance.

Overall, I'd really recommend against mmap for IO which has strong durability requirements. And performance can also be elusive - you'll end up with a pagefault per page which can be a lot more overhead than a more explicit write if your IO is larger than a page. (There's a pile of more subtle performance costs like IPI from TLB shootdown when doing munmap or mprotect.)

Doing writes with something like iouring gives many more opportunities for both correctness and performance - you can insert async barrier/sync operations in the writes so you have much better control over the durability semantics (though you'll still be at the mercy of the specific filesystem semantics for the details).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.