Towards a more perfect RustIO

In a single-threaded application, mmap is not worse than a synchronous file read, because you were going to block anyway and you have nothing better to do while you are being blocked (by definition of being single-threaded). So one gets all the benefits that were mentioned in the articles that you quoted, without significant drawbacks.

As soon as you start multitasking, however, be it by simultaneously processing multiple I/O requests or by overlapping I/O and compute tasks, the ability to reason about when your threads are going to block (and if possible avoid blocking altogether) is important, because it is what you can use to avoid the wasteful strategy of spawning one OS thread per concurrent task. This is where mmap becomes a problematic API, because it gives you thread blocking behaviours which are both unavoidable and unpredictable.

3 Likes

mmap has two big perf benefits over I/O:

  1. There’s no syscall overhead.
  2. There’s no duplicated memory between user and kernel - they both can share the same mapping.

To avoid the type of thing that @HadrienG mentions, you can mmap files that are on tmpfs if the circumstances allow for it. Alternatively, you can prefault the mapping at mmap time, lock the pages, etc to avoid some of the latent accesses later on. There are still possible stalls, mostly on the write side. For example, if the background dirty mem ratio is above a threshold, Linux will block. There was also a “stable page writeback” issue at one point, which I don’t know offhand if it’s been changed. In this case, if a page is undergoing writeback, any writes to it are blocked until the writeback completes. This one is particularly nasty because it exhibits itself not because you’re dirtying pages at a rate exceeding kernel’s writeback pace, but rather unfortunate timing while not dirtying a lot of pages (can even be a single page).

These types of things is why systems like seastar (and most databases, Postgres being the notable outlier) try to avoid the kernel as much as possible and manage things themselves.

4 Likes

So, would it be safe to say, that MMAP I/O has limited usefulness for achieving high-performance. If the "Ideal" Rust I/O library could optionally choose MMAP provided the right conditions are met (single-thread of computation etc.) it might make sense to let it optionally choose that method automagically, but, the interface presented at a higher-level shouldn't just be indexing against a slice because that would force the higher-level API into unexpected/unpredictable blocks when indexing into the slice. Rather, the higher-level API should look more like keeping a sliding window(s) of some automagically or manually chosen size and that window start point can perhaps have a seek tied to the underlying stream, mmap, buffer, or whatever is implemented at the lower-level. So, you could "seek" on a source with a configurable or automagically chosen window size, that "seek" would potentially return an indicator like, "EAGAIN/Not Ready" or "Unrecoverable I/O Error", in a Result<slice : &i8,E>. You would know after a successful seek, that you'd have a Windows Sized Slice to work with in actual (unpaged) memory and only when you needed the next window would you request it with "seek". Also, the underlying implementation could have a certain about of anticpatory/threaded read-ahead/read-behind of windows to make delays/blocking on seek unlikely in the sequential access case.

2 Likes

For anyone that wants an empirical and clear demonstration of this explanation, find a large directory and use ripgrep to search for something that doesn't exist. Compare the times it takes between using memory maps and not using memory maps.

e.g., On a checkout of the Linux kernel source code, no memory maps (-j controls the number of threads):

$ time rg zqzqzqzqzq --no-mmap -j1
real    0m0.665s
user    0m0.320s
sys     0m0.333s
$ time rg zqzqzqzqzq --no-mmap -j2
real    0m0.380s
user    0m0.375s
sys     0m0.358s
$ time rg zqzqzqzqzq --no-mmap -j3
real    0m0.266s
user    0m0.356s
sys     0m0.407s
$ time rg zqzqzqzqzq --no-mmap -j4
real    0m0.209s
user    0m0.369s
sys     0m0.414s
$ time rg zqzqzqzqzq --no-mmap -j8
real    0m0.129s
user    0m0.470s
sys     0m0.451s
$ time rg zqzqzqzqzq --no-mmap -j16
real    0m0.093s
user    0m0.624s
sys     0m0.608s
$ time rg zqzqzqzqzq --no-mmap -j32
real    0m0.150s
user    0m1.355s
sys     0m0.588s

with memory maps:

$ time rg zqzqzqzqzq --mmap -j1
real    0m0.731s
user    0m0.345s
sys     0m0.374s
$ time rg zqzqzqzqzq --mmap -j2
real    0m0.611s
user    0m0.477s
sys     0m0.640s
$ time rg zqzqzqzqzq --mmap -j3
real    0m0.692s
user    0m0.519s
sys     0m1.139s
$ time rg zqzqzqzqzq --mmap -j4
real    0m0.773s
user    0m0.676s
sys     0m1.449s
$ time rg zqzqzqzqzq --mmap -j8
real    0m0.921s
user    0m0.966s
sys     0m2.183s
$ time rg zqzqzqzqzq --mmap -j16
real    0m1.207s
user    0m1.234s
sys     0m2.738s
$ time rg zqzqzqzqzq --mmap -j32
real    0m1.279s
user    0m1.435s
sys     0m2.938s

Interestingly, the only reason ripgrep has memory map support at all is because it is faster. Errmm, faster in some cases:

$ ls -lh OpenSubtitles2016.raw.en
-rw-r--r-- 1 andrew users 9.3G Sep 10  2016 OpenSubtitles2016.raw.en

$ time rg zqzqzqzqzq OpenSubtitles2016.raw.en --no-mmap
real    0m2.029s
user    0m0.604s
sys     0m1.423s

$ time rg zqzqzqzqzq OpenSubtitles2016.raw.en --mmap
real    0m1.205s
user    0m0.826s
sys     0m0.376s

ripgrep has heuristics to decide when to use memory maps and when not to. They are far from perfect.

(Results above may differ based on your system's implementation of memory maps. :-))

11 Likes

I'm still not sure I understand the difference. The only thing you describe is that "page faults cause the thread to block." That is certainly true, but page faults occur regardless of multithreading when the system is under heavy memory pressure. The access patterns used also plays a key role in how often page faults can occur [1].

What @vitalyd goes on to mention about tmpfs is even worse [2], because that implies each page fault could potentially result in using swap space (writing dirty pages to persistent storage) vs. mmap with read-only file backing that doesn't have a "page out process".

@BurntSushi provided a demonstration of how ripgrep's multi-threaded random access patterns perform worse under memory pressure with mmap than without. Understanding how (and when) your applications cause page faults is the reason tools like cachegrind exist.

However I'm not prepared to provide any empirical evidence, and I admit my sources could be completely wrong. At this point, I'm not adding anything productive to the conversation, so I'll just leave it at that.

[1] No fault to anyone. Just leaning on my earlier references, where Poul-Henning Kamp provides a detailed analysis of algorithmic access patterns vs memory pressure.

[2] Once again, from one of my earlier references, Howard Chu says about mmap in LMDB:

Since we’re using a file-backed mmap, none of our usage contributes to swap storms. Any clean page in the map can immediately be reused by the OS for any other purpose when memory pressure is high; no writeback of any kind is needed. And since LMDB’s default mode is to use a read-only mmap, by definition every page in the map is always clean. (If using a read/write mmap, then writes to the map create dirty pages that must eventually be written back to disk by the OS. When using anonymous memory, both clean and dirty pages must be written to swap if the OS wants to reuse the memory pages.)

It’s not worse - I failed to also mention that swap should be turned off. In fact, most servers I know that want high reliability disable swap and overcommit.

mmap is a good option for read/write mixed workloads. If you’re doing sequential reads only, then it’s not necessarily better and can be worse, as @BurntSushi mentioned. Being able to share the mapping with the kernel, as mentioned, in use cases like a cache, is a particularly nice benefit.

3 Likes

One other possible explanation is the overhead of creating and tearing down memory maps themselves. Namely, the Linux source tree contains a lot of files, but many of them are very small, so it is very possible that memory map overhead in multithreaded work loads could dominate search time.

A different kind of test might be a combination of the ones I showed, which consists of a large directory tree with large files. If memory maps win out there, then perhaps memory pressure itself isn't the culprit after all.

2 Likes

I think the important point is the difference between asking for a MMAP of a large file, getting it, and then just working with it as a slice as if it were all in memory when it isn't possible for it ever to be so reliably vs working with a sliding window slice that you can seek where the underlying implementation may be using mmap or may not. The former, will unpredictably block during index/dereference, whereas the latter will only block (or return the equivalent of EAGAIN) during a seek. Both options can behind the scenes try to keep likely to access parts of the file in memory (look-ahead/look-behind/frequently accessed windows/etc). This argues for not exposing things simply as MMAPPED slice, but, possibly using MMAPPED slice behind the scenes when it makes sense to do so. The part to work out then, would be what info does the implementation need from the environment, OS, and consumer of the API to base those sorts of decisions upon (whether or not to use an MMAP implementation behind the scenes or not).

I guess I look at this as kind of like SQL Query Planning where the planner takes into account the query and statistics of the underlying tuples and indexes as well as cost factors for various heap, index, etc. look-up operations to work towards choosing the best access pattern.

1 Like

For anyone interested in referencing the System.IO.Pipelines work by the dotnetfx core team, this is interesting: https://github.com/Drawaes/Leto

It's an implementation of the Pipe I/O and buffer traits for TLS cryptography. One interesting concept is that by having pluggable buffering traits, one can substitute a secure "crypto buffer" which uses VirtualAlloc() or the equivalent to prevent swapping sensitive data to disk, zeroing out buffers before they are returned to the heap, etc...

This would mean that if someone builds a pipeline up that looks like: socket -> tls -> someprotocol then the buffers handed to the "someprotocol" part would automatically be secure buffers -- even if the protocol decoding is in a completely separate cargo crate or library. Brilliant!

5 Likes

I think I like the general idea of this proposal. Let me try to rephrase it using my own words to see if I got it. The idea is to request a managed slice into the file from the API. If you get it, then you are sure that the file contents are actually there in RAM. If not, then the API returns a "Not ready" indicator.

This means that API implementation can either internally allocate buffers and manage a seastar-style user-mode I/O scheduler, or use the locking/touching tricks that @vitalyd mentioned to make memory mappings more predictable.

Getting into the details, for better performance / scalability, there are two lessons which I think we should learn from epoll-style network I/O:

  • Ability to query multiple files in a single call. Otherwise, querying N files necessarily results in O(N) syscalls, even if the underlying OS can do better, which is bad for the "multiple small files" use case.
  • Ability for the I/O thread to block when there is nothing to do on any active file. Otherwise, this situation results in the application polling the kernel in a loop, which is bad for power consumption and system-wide performance.

Also, we did not go into the details of what the API parameters should look like, but it is important to design it so that redundant information is not passed to the implementation over and over again on each I/O call, as that causes costly parameter re-validation.

Window size is an important tuning parameter, as it affects syscall frequency vs working set size. It would probably make sense to have it chosen automatically by the implementation by default (or even allow it to dynamically vary at runtime as the implementation analyzes the application's data access patterns), but let the user tune it manually when needed.

1 Like

I think for "requirements" the best place to look is to clone most of the capabilities of the System.IO.Pipelines package being worked on by the dotnetfx core team. It is designed to eliminate a wide class of "wheel reinvention" that is IMHO an issue with Rust application code (or any other language for that matter).

Please read the following two blog articles for some background on the issues and some potential solutions:

In practice, the core of the issue is moving the buf parameter of std::io::Read::read() from an "in parameter" to a "return value" and ideally also making it a parametric type, not just [u8]. This inversion-of-control is important for improving the flexibility of general streaming data processing, not just for I/O. For example, DSP or physical sensor I/O can often produce a stream of integers or floats. Similarly, parsers often have a lexer stage that return a stream of enums. Compatibility with non-UTF-8 text sources may benefit from a stream of u16 or char. Etc...

This is similar to the behaviour of BufReader, which is the closest example to what I'd like to see. However, it's a struct, not a trait, making it totally useless for building up pipelines of I/O "handlers".

In general, there is enormous overlap with the tokio and futures crates. For example, take this method from the dotnetfx core team's work:

https://github.com/dotnet/corefx/blob/87ed90733ec4b307c4d39b541ad4ad893f438f4a/src/System.IO.Pipelines/src/System/IO/Pipelines/PipeWriter.cs#L32-L35

The Task being returned by that method is essentially the same as the already existing Flush future. This could be used via and_then() in user code to chain a closure to be invoked when the reader is complete.

2 Likes

Note there’s a BufRead trait.

1 Like

We’ve mentioned this in passing, but the bytes crate (used in tokio) has traits to represent buffers. So if you’ve not looked yet, you may want to do that to see if it’s API makes the type of things you’d like possible. It’s probably the closest to “buffer traits” that I’m aware of.

3 Likes

To achieve portability without locking the user into an inflexible thick runtime, I think it is most correct to have a layered stack that looks like this:

  1. Transport layer that moves data around with an implementation-defined granularity (can be bytes for Unix pipes, pages for mmap(), packets for IP, 32-bit integers for microcontroller I/O ports)... At this layer, data is not yet validated, and therefore cannot be generally assumed to have any higher semantic than "a bunch of bytes" => You do not control this layer, and for some use cases you cannot ignore it
  2. Structured messaging protocol, sitting on top of the transport, that uses whatever transaction granularity and data semantics are convenient for the programmer, implementing any complex IO and validation pattern needed to build those transactions on top of what the transport layer provides.

IMHO, the purpose of the Rust standard library's Read trait is to correctly interface part 1. The C# pipeline library is more focused on point 2, which in Rust is the job of higher-level abstractions like the BufRead trait or tokio.

We should probably consider the prospect of redesigning these two layers of abstractions as related, but ultimately separate tasks.

5 Likes

So, just trying to summarize the discussion on Read so far, I think we would like to complement its current only mandatory method...

fn read(&mut self, buf: &mut [u8]) -> Result<usize>

...with a second method based on internally managed storage that looks like this (all names are open to bikeshedding)

fn read_managed(&mut self) -> Result<&[u8]>

Note that with this API, you cannot call read_managed() if you are still holding a slice that you got from a previous call to it. That is intentional: it is a prerequisite for "sliding windows" use cases.

The implementation must guarantee that the underlying data is actually present in RAM, that no data races can occur with concurrent memory mappings, and that reads from the underlying slice will not block. If you want to use full-blown mmap(), in all its blocking and unsafe glory, then that is OS-specific and you should go for the memmap-rs crate.

One thing which read() can do and read_managed() cannot is to tune the buffer size. To some extent, this is on purpose: we sometimes want it to be automatic. But it is an important tuning parameter for performance vs memory footprint, and therefore it should be possible to control it. We could do it this way:

fn transfer_granularity(&self) -> usize;
fn max_window_size(&self) -> usize;
// winsize must be higher than zero, multiple of granularity, smaller than max
fn set_window_size(&mut self, winsize: usize) -> Result<()>;

To which extent these methods could/should have "sane defaults" for existing Read implementations is left as an exercise to the reader. If there are no sane defaults, then we may want to add these facilities as an extra "ReadManaged" subtrait of Read, rather than directly into the Read trait.

Another question that must be resolved is whether such an API could be used in no_std environments. I suspect that it couldn't, which would be another argument in favor of the ReadManaged supertrait approach.

I'm not sure how epoll-style nonblocking readout and readout from N different sources should be handled, maybe this could use inspiration from mio for network I/O. Does anyone have opinions on that?

(One opinion which I personally have is that not every I/O source may have a nonblocking mode. For example, I'm not sure if CPU I/O ports can always be nonblockingly peeked in low-level operations. If that is the case, we may not want to provide a nonblocking interface for everything that has a Read/ReadManaged implementations, but only in cases where it makes sense.)

4 Likes

Writes are a little bit harder than reads, because at some point you must commit your writes to the underlying storage and that may not be automatic on all implementations (or, to the contrary, it may be automatic AND block your application without warning). The most promising path which I can think about is to provide not just a writable slice, but a wrapper around a writable slice, with the following semantics:

  • Writer implementation must guarantee that no blocking writes or data races will occur as long as the slice wrapper is in scope (that is stronger/more useful than a naive mmap!)
  • Writer trait provides a way to eagerly commit writes to I/O device, in a configurable fashion (e.g. nonblocking mode, schedule multiple writes at once).
  • Wrapper caches any information needed for the separate commit transaction to be fast.

I initially thought about automatically commiting on Drop, but think this is not a good idea after all because there is no way to cleanly handle I/O errors in a Drop implementation.

As soon as writes come into the equation, another thing to think about is transactional semantics, or lack thereof. Do we want storage writes to be atomic, or to provide a convincing illusion of being so? I personally think that this should not be the case, because such transactional semantics are very expensive to provide and are IMHO best provided by higher-level layers such as SQLite.

Code mockup of how that would extend the existing Write trait:

fn begin_write(&'a mut self) -> Result<WriteWrapper<'a>>;
fn commit_write<'a, 'b: 'a>(&'b mut self, write: WriteWrapper<'a>) -> Result<()>;

// Some Deref/DerefMut magic to access &mut [u8] backing the WriteWrapper

The buffer size considerations which I discussed concerning Read also apply here.

2 Likes

You can't add read_managed to Read trait, as it assumes that reader has some kind of an underlying buffer which is not always true (e.g. non-buffered File), thus it must be in a separate trait.

This restriction is too severe and will make this API unusable for many use-cases, but, unfortunately, as I've mentioned in the parent thread, without "borrow regions" supported by traits we can't describe a desired API. Also if I am not mistaken we would like also to have GAT, as reader can either own underlying buffer, or borrow it.

4 Likes

Essentially, Rust has most of the pieces already, like a LEGO kit that's just been opened and poured out on the floor.

There's the bytes::Buf trait which looks an awful lot like the System.IO.Pipeline stuff, there's the std::io::BufRead trait, which is also very similar, but doesn't appear to be directly compatible with the Buf trait unless I'm missing something. There's the std::io::BufWriter struct, which unfortunately does not have a BufWrite trait that is symmetric with BufRead...

The downside of the bytes crate are:

  • It is for... bytes only. That's okay for low-level I/O, but that's it. No Buf<char> or Buf<u16> for you!
  • It (optionally) pulls in Serde which in my opinion should be orthogonal to buffer management, and seems to be done for convenience. Due to this mixing of "parsing a buffer" and "manage a buffer" concerns, it also pulls in the byteorder crate which it ought not to require.
  • It appears to be primarily designed to create a single buffer at a time, instead of a pool of buffers, unless I'm missing something...
  • Buf implements Read instead of BufRead...

The tokio effort seems very much a work-in-progress, and still appears (at a first glance) to be missing the crucial separation of buffer management from read() and write() calls. For example, even the latest async "sync vectored read" passes the buffer in.

https://github.com/tokio-rs/tokio/blob/f98b81e527dfba6750bf82875af8725bae86854a/tokio-tcp/src/stream.rs#L414-L418

Am I missing something?

It feels like Rust is really close to having all the pieces required to implement efficient zero-copy I/O with all the advantages outlined in the System.IO.Pipelines blog, but it just doesn't seem to be "wired up".

2 Likes

@newpavlov, can you elaborate on cases where you would like to keep multiple memory-mapped windows around instead of either allocating a bigger one or moving it around within the file?

I am asking because in combination with Seek, this worsens the memory-safety troubles of memory-mapping (it means that one can get a "window already mapped" error on read_managed even if a file is opened only once in a single application), and it is not immediately clear to me what one gets in return.

Expanding on this further, one argument against memory-mapping is that on Linux at least, it worsens concurrent filesystem I/O from a race condition to a full-blown data race. On a theoretical level, since there is no way to know for sure who is manipulating a file on this OS, the only truly safe way to access a memory-mapped file is with volatile reads and writes. On a practical level, we can use lockfiles and add some global state to the managed IO library so that at least accesses made via the managed IO library are safe.

3 Likes

A simple example is a file which you want to parse and which contains big binary blobs. You don't want process those blobs on parser level (as well as copy them to heap) and instead would like to pass them further. Don't forget that the disused API should be able to work not only with memory mapped files, but also with buffers already loaded into memory.

Can you elaborate regarding "window already mapped" error? In my understanding if we'll forget for a moment that mapped file can be changed by other process and will assume that we work with read-only files, then memory-mapped file from user perspective will behave in the same way as a simple owned buffer.

And I don't think that having restricted read_managed will help with data-race, you still have &[u8] which can be changed by other process behind your back, the only difference is that you'll have one slice at a time.

3 Likes