Towards a more perfect RustIO

To achieve portability without locking the user into an inflexible thick runtime, I think it is most correct to have a layered stack that looks like this:

  1. Transport layer that moves data around with an implementation-defined granularity (can be bytes for Unix pipes, pages for mmap(), packets for IP, 32-bit integers for microcontroller I/O ports)… At this layer, data is not yet validated, and therefore cannot be generally assumed to have any higher semantic than “a bunch of bytes” => You do not control this layer, and for some use cases you cannot ignore it
  2. Structured messaging protocol, sitting on top of the transport, that uses whatever transaction granularity and data semantics are convenient for the programmer, implementing any complex IO and validation pattern needed to build those transactions on top of what the transport layer provides.

IMHO, the purpose of the Rust standard library’s Read trait is to correctly interface part 1. The C# pipeline library is more focused on point 2, which in Rust is the job of higher-level abstractions like the BufRead trait or tokio.

We should probably consider the prospect of redesigning these two layers of abstractions as related, but ultimately separate tasks.

5 Likes

So, just trying to summarize the discussion on Read so far, I think we would like to complement its current only mandatory method…

fn read(&mut self, buf: &mut [u8]) -> Result<usize>

…with a second method based on internally managed storage that looks like this (all names are open to bikeshedding)

fn read_managed(&mut self) -> Result<&[u8]>

Note that with this API, you cannot call read_managed() if you are still holding a slice that you got from a previous call to it. That is intentional: it is a prerequisite for “sliding windows” use cases.

The implementation must guarantee that the underlying data is actually present in RAM, that no data races can occur with concurrent memory mappings, and that reads from the underlying slice will not block. If you want to use full-blown mmap(), in all its blocking and unsafe glory, then that is OS-specific and you should go for the memmap-rs crate.

One thing which read() can do and read_managed() cannot is to tune the buffer size. To some extent, this is on purpose: we sometimes want it to be automatic. But it is an important tuning parameter for performance vs memory footprint, and therefore it should be possible to control it. We could do it this way:

fn transfer_granularity(&self) -> usize;
fn max_window_size(&self) -> usize;
// winsize must be higher than zero, multiple of granularity, smaller than max
fn set_window_size(&mut self, winsize: usize) -> Result<()>;

To which extent these methods could/should have “sane defaults” for existing Read implementations is left as an exercise to the reader. If there are no sane defaults, then we may want to add these facilities as an extra “ReadManaged” subtrait of Read, rather than directly into the Read trait.

Another question that must be resolved is whether such an API could be used in no_std environments. I suspect that it couldn’t, which would be another argument in favor of the ReadManaged supertrait approach.

I’m not sure how epoll-style nonblocking readout and readout from N different sources should be handled, maybe this could use inspiration from mio for network I/O. Does anyone have opinions on that?

(One opinion which I personally have is that not every I/O source may have a nonblocking mode. For example, I’m not sure if CPU I/O ports can always be nonblockingly peeked in low-level operations. If that is the case, we may not want to provide a nonblocking interface for everything that has a Read/ReadManaged implementations, but only in cases where it makes sense.)

4 Likes

Writes are a little bit harder than reads, because at some point you must commit your writes to the underlying storage and that may not be automatic on all implementations (or, to the contrary, it may be automatic AND block your application without warning). The most promising path which I can think about is to provide not just a writable slice, but a wrapper around a writable slice, with the following semantics:

  • Writer implementation must guarantee that no blocking writes or data races will occur as long as the slice wrapper is in scope (that is stronger/more useful than a naive mmap!)
  • Writer trait provides a way to eagerly commit writes to I/O device, in a configurable fashion (e.g. nonblocking mode, schedule multiple writes at once).
  • Wrapper caches any information needed for the separate commit transaction to be fast.

I initially thought about automatically commiting on Drop, but think this is not a good idea after all because there is no way to cleanly handle I/O errors in a Drop implementation.

As soon as writes come into the equation, another thing to think about is transactional semantics, or lack thereof. Do we want storage writes to be atomic, or to provide a convincing illusion of being so? I personally think that this should not be the case, because such transactional semantics are very expensive to provide and are IMHO best provided by higher-level layers such as SQLite.

Code mockup of how that would extend the existing Write trait:

fn begin_write(&'a mut self) -> Result<WriteWrapper<'a>>;
fn commit_write<'a, 'b: 'a>(&'b mut self, write: WriteWrapper<'a>) -> Result<()>;

// Some Deref/DerefMut magic to access &mut [u8] backing the WriteWrapper

The buffer size considerations which I discussed concerning Read also apply here.

2 Likes

You can’t add read_managed to Read trait, as it assumes that reader has some kind of an underlying buffer which is not always true (e.g. non-buffered File), thus it must be in a separate trait.

This restriction is too severe and will make this API unusable for many use-cases, but, unfortunately, as I’ve mentioned in the parent thread, without “borrow regions” supported by traits we can’t describe a desired API. Also if I am not mistaken we would like also to have GAT, as reader can either own underlying buffer, or borrow it.

4 Likes

Essentially, Rust has most of the pieces already, like a LEGO kit that’s just been opened and poured out on the floor.

There’s the bytes::Buf trait which looks an awful lot like the System.IO.Pipeline stuff, there’s the std::io::BufRead trait, which is also very similar, but doesn’t appear to be directly compatible with the Buf trait unless I’m missing something. There’s the std::io::BufWriter struct, which unfortunately does not have a BufWrite trait that is symmetric with BufRead

The downside of the bytes crate are:

  • It is for… bytes only. That’s okay for low-level I/O, but that’s it. No Buf<char> or Buf<u16> for you!
  • It (optionally) pulls in Serde which in my opinion should be orthogonal to buffer management, and seems to be done for convenience. Due to this mixing of “parsing a buffer” and “manage a buffer” concerns, it also pulls in the byteorder crate which it ought not to require.
  • It appears to be primarily designed to create a single buffer at a time, instead of a pool of buffers, unless I’m missing something…
  • Buf implements Read instead of BufRead

The tokio effort seems very much a work-in-progress, and still appears (at a first glance) to be missing the crucial separation of buffer management from read() and write() calls. For example, even the latest async “sync vectored read” passes the buffer in.

Am I missing something?

It feels like Rust is really close to having all the pieces required to implement efficient zero-copy I/O with all the advantages outlined in the System.IO.Pipelines blog, but it just doesn’t seem to be “wired up”.

2 Likes

@newpavlov, can you elaborate on cases where you would like to keep multiple memory-mapped windows around instead of either allocating a bigger one or moving it around within the file?

I am asking because in combination with Seek, this worsens the memory-safety troubles of memory-mapping (it means that one can get a “window already mapped” error on read_managed even if a file is opened only once in a single application), and it is not immediately clear to me what one gets in return.

Expanding on this further, one argument against memory-mapping is that on Linux at least, it worsens concurrent filesystem I/O from a race condition to a full-blown data race. On a theoretical level, since there is no way to know for sure who is manipulating a file on this OS, the only truly safe way to access a memory-mapped file is with volatile reads and writes. On a practical level, we can use lockfiles and add some global state to the managed IO library so that at least accesses made via the managed IO library are safe.

3 Likes

A simple example is a file which you want to parse and which contains big binary blobs. You don’t want process those blobs on parser level (as well as copy them to heap) and instead would like to pass them further. Don’t forget that the disused API should be able to work not only with memory mapped files, but also with buffers already loaded into memory.

Can you elaborate regarding “window already mapped” error? In my understanding if we’ll forget for a moment that mapped file can be changed by other process and will assume that we work with read-only files, then memory-mapped file from user perspective will behave in the same way as a simple owned buffer.

And I don’t think that having restricted read_managed will help with data-race, you still have &[u8] which can be changed by other process behind your back, the only difference is that you’ll have one slice at a time.

3 Likes

An example that occurs to me would be a page- or extent-oriented database file, doing some compaction and cleaning. You’d have a write window and a set of read windows collecting records from sparsely-populated pages, which can then be freed and recycled for more writes. Probably something with larger pages than your typical postgresql MVCC - perhaps a CoW VM disk image file.

2 Likes

I see. In this case, we may want to allow for concurrent slices, which can be done by allowing one to create a read/write slice from an &self (and using as much internal mutability as necessary behind the scenes).

You would also need to wrap read slices in addition to write slices, because you must be able to discard the underlying memory mapping or buffer when it is not used anymore (which was done automatically in the previous API).

End result would look like this:

fn read_managed(&'a self) -> Result<ReadWrapper<'a>>
fn begin_write(&'a self) -> Result<WriteWrapper<'a>>;

One important question that must be resolved is whether the file slices should be borrowed (as I am currently proposing) or owned. The use case that @newpavlov presents would benefit from owned slices that can be e.g. passed from the IO thread to some processing thread, but it is also important to realize that such owned slices would come at a cost. IO sources, such as Files, would have be atomically reference-counted + synchronized so that their destruction is delayed until all slices in flight have been discarded.

Here is the perspective which I am coming from regarding the unsafety problems of memory-mapped files:

  • There is no such thing as a read-only file on a writable storage media, which is the vast majority of computer storage today. That could only be achieved by an OS with a strong commitment towards file immutability, and current OSes do not provide that.
  • On current-gen OSes, “read-only” filesystem flags are just a protection against basic usage errors and a documentation for system administrators. They are not a useful protection against mmap-induced data races:
    • The fact that a file was marked as read-only at the time where you opened it doesn’t mean that it will remain read-only during the entire time where you will be using it.
    • Even if the whole filesystem on which a file is located were mounted in read-only mode from your perspective, it can still be mounted in writable mode elsewhere (especially in VM or distributed filesystem scenarios).

Therefore, from my perspective, files should always be assumed to be shared mutable resources, and any attempt to provide a safe interface to memory-mapping must provide a synchronization protocol through which one can avoid data races between programs which concurrently mmap the same file.

The simplest reasonably efficient synchronization protocol that can be used here is to enforce reader-writer lock semantics between concurrent users of a given file region. I’m opened to other synchronization protocol suggestions if you have them.

One problem is that as a crufty heir of UNIX, Linux does not even provide the required building blocks for enforcing such synchronization system-wide. All we may (or may not) be able to achieve is to guarantee that consenting applications can opt-in to such a synchronization protocol. This is what I would like to achieve here. In this prospect, opening a file would be unsafe with a “make sure no one else will be touching that file as long as it is open” contract, but further transactions on that file could be considered safe if that core contract is upheld.

A reader-writer lock normally works by blocking until the target resource is free. However, in the case of file access, I think this is inappropriate, because it could result in an application hanging permanently on file open if another application is currently manipulating the same file. For an mmap use case, I think it would be more appropriate to assume that concurrent read+write or write+write memory mappings normally do not happen, and that if they happen it is the result of a bug. In this case, returning an error from the API when a racey memory mapping is requested would be the right thing to do.

I agree with you here. The hazards exist in any case, and must be handled, due to cross-process access to shared files. I just thought that the “one slice at a time” model would ease reasoning in the common case where a file is only opened by one application at a time, and is only opened once in that application.

If you think that the ergonomics (and implementation simplicity) benefits of only allowing one slice at a time are not worth it in the face of the increased flexibility that concurrent slices bring, then we can go for the more complex, but more flexible solution sketched above.

3 Likes

I believe the right approach to a read trait which will allow us to create several zero-copy views into underlying buffer is borrow regions functionality. In read_managed and seek methods we have mutable part (counter) and immutable (buffer), so we need somehow to convey this information to borrow checker. All other workarounds will be less ergonomic or efficient.

As for hazards associated with memory mapped files I have a feeling that we should change perspective a bit. Instead of making creation of mapped files unsafe, it could be better to make acquisition of &[u8] unsafe instead (same for &mut [u8]). So you will be able to safely create mapped files, slice (instead of &[u8] you’ll get opaque MmapSlice<'a>), index (volatile read of one byte) and to read and write data via existing Read and Write traits, implementations of which will use volatile operations under the hood. You will be able to get zero-copy &[u8] from MmapSlice<'a> using unsafe method, but you’ll have to deal with potential problems. Hopefully it will allow us to localize some of the problems.

3 Likes

Regarding use cases, you mention Network, Database, and File IO. File IO in particular is a deep topic that doesn’t end with the POSIX api and this is a place where Rust can do really well.

Generally there are three main use cases of File IO:

  1. Object - a file with write isolation and no dirty reads. It is only available when the application has finished writing it. (e.g. media files).
  2. Log - a file that is continually appending. Dirty reads are possible but only with record based framing (e.g. you can see the last record; but you cannot see a corrupt piece of a record being written). (e.g. WAL for a db).
  3. memmapped for persistent caches (with recoverability provided by logs). (e.g. actual db files).

Objects + Logs get you most of what you want. Being able to e.g. open up a typed channel to a file system sink (mspc::Receiver or crossbeam-io::channel::Receiver) would be a great.

4 Likes

From an API perspective, I would like to see a more fluent IO API for Rust. The current one effectively forces an imperative programming style:

...
fn foo() -> Result<String> {
    let mut contents = String::new();
    File::open("foo.txt")?.read_to_string(&mut contents)?;
    Ok(contents)
}

To have the API create and return the buffer as a Result would be more ergonomic IMHO:

...
fn foo() -> Result<String> {
    File::open("foo.txt")?.read_to_string()
}

This kind of stuff adds up fast. The former (our current API) feels much clunkier for cases in which the buffer is not being re-used. For these cases, I’d like to see a variant of the IO routines which create and return a Result<BufferType>, rather than having them passed in.

3 Likes

I kind of skimmed many parts of this threuad, but I did like @peter_bertok’s mention of iterators. Would it be a good idea/possible to implement IO/networking/streaming in the same fashion as iterators? You could do something like:

let file1 = File::open("foo.txt").unwrap();
let file2 = File::open("bar.doc").unwrap();
let combined = file1.into_stream()
    .add_file(file2)
    .mmap(mmap::Options::default())
    .collect<String>();

You could optionally add memory maps. You could add other methods for processing files and things with encodings or compressed formats. Each method would add a struct like the iterators do, and each struct would contain various configuration settings for that step (like the name of a file, or how large of memory maps to use, etc), then either process the data using a closure or collect the data into a vector of a certain type (integers, floats, whatever) or a String.
Each file could have an encoding specified or detected, and the data would not necessarily have to be represented in u8’s, it could be an associated type.

Just my quick thoughts from reading this and the other thread. It might be a horrible idea, I haven’t thought it out much but it would be nice to have an easy, flexible, and generic way to deal with data coming from some source (network, file, database, even a reference/slice to some data already in memory).

1 Like

Added this to the OP to highlight some recent new developments with respect to Tokio.

1 Like

This is already supported by the Read trait via fn chain(). This kind of thing is commonly required for compressed file formats that are provided in chunks or parts, such as the OOXML packaging format used by Microsoft Office.

I’ve been thinking about this a bit more, and the requirements are quite complex in the general case, much in the same way iterators and parser combinators have to support a wide range of scenarios.

I keep thinking of the worst-case scenarios, with the theory that working backwards from there to the simplest scenarios will then handle everything in between. Two good examples are:

  • Handling thousands of HTTP sockets using an efficient select()-style I/O, with:
    • TLS encryption ("CryptRead").
    • Chunked transfer encoding. This means that first you have to parse a header, and then chunks of data. Luckily, HTTP length-prefixes the chunks.
    • Each format has its own independent decoder, which itself takes a Read-derived source.
    • Some of these formats have 2-3 stage decoders, such as first a decompressor, and then a UTF-16 to UTF-8 converter, etc…
    • The final stage is possibly a parser such as the nom crate, JSON, XML, etc…
  • Handling a streaming format where the chunked layers are not length-prefixed. This can occur in some cases when nested formats are used and the inner format has a terminator marker instead of a length prefix.

In all cases, it’s important to be able to interact with some of the layer handlers:

  • Retrieve/verify authenticated encryption (AEAD) success or failure.
  • Retrieve TLS certificates.
  • Retrieve/verify compression checksums.
  • Retrieve the output of a parser or other consumer of the data.

So essentially a solution would have to support:

  • Format conversion that can change the length of the data (UTF-16 <-> UTF-8).
  • Format conversion that can’t change the length of the data, and is hence more efficient to perform in-place over the same buffer.
  • Chaining and Muxing multiple sources into multiple targets (HTTP/2).
  • Creating a subset similarly to slicing or skip(offs).take(len), e.g.: starting from some offset, create a child reader with a provided exact length to read.
  • Creating a subset that terminates itself (based on a marker in the stream).
  • Switching inner decoders mid-stream. E.g.: after decoding one chunk, the remaining data must then be passable to a different handler, which may have to be dynamically chosen (e.g.: based on a header).
  • Converting a stream to an iterator in various ways. E.g.: bytes(), lines(), packets(), or whatever…

Interestingly, there are virtually no I/O errors that are meaningful in these scenarios other than “unexpected end of stream”. For example, it generally makes no sense to have an “access is denied” for a stream that is already open. The only exception that I can think of is if one of the sources is something like a “RetryReader” that automatically re-opens the source file or socket if interrupted.

Conversely, it is fairly important to support non-I/O errors such as invalid cryptographic stream, corrupt compression, checksum failed, parsing errors, unexpected end-of-data, etc…

4 Likes

@peter_bertok is it your intention to come up with a solution that duplicates the functionality of things like tokio and nom/peg/etc? Because it seems like some of that would normally be handled by other crates currently. I’m just wondering how you’d want to handle that. I thought this was more like a base crate for streaming and io stuff but what you’re describing is much more complex.

If your desired solution is too complex to easily implement with the resources available it might be prudent to start with a base project then expand to more complex features.

1 Like

My sense is that @peter_bertok is showing the terrain that a generalized approach must address. The next step would seem to be sketching a set of traits that would support such a wide span of processing architectures. Only at that point will it become apparent what issues are already addressed by existing Rust crates, thus needing only minor adaptation, and which issues require substantial new work.

Personally, I applaud the constructive direction that this thread is taking, rather than the antagonistic one of its predecessor.

9 Likes

The reason I allowed multiple files to be added was to address something I read earlier about retrieving multiple files at a time instead of one per syscall, I don’t know if the Read trait recognizes that but I’m very skeptical it would.

1 Like

That sounds like “vectored I/O” aka “scatter-gather IO”.
I believe it would be possible for Read, because an implementation could do the gather on the first byte read, and then buffer the rest internally (so it likely also implements BufRead)

Writing would be harder, since by definition you must buffer; there exists a BufWriter struct, but I couldn’t find a corresponding Trait in the std docs.
I’m also wondering if a scattering-IO writer trait would make sense… The details are usually fairly platform-specific, and there needs to be a way to configure the targets before doing the aggregated write…

4 Likes

Do you know of any example where JVM (NIO DirectByteBuffer) and a Rust program (with mio) share > 2GB memory, zero-copy, please ?
Should be possible.