Towards a more perfect RustIO


#1

In response to issues brought up by the OP of the following thread, Rust beginner notes & questions, regarding the efficiency & long-term viability of Rust IO API’s in the stdlib (and elsewhere), I have created this thread as a discussion place to work through some of these concerns and flesh out the requirements and whether those requirements are currently being met by libraries on crates.io already under development, or, whether a new project needs to begin to address the concerns.

In particular, the following comments on the above referenced thread are what this discussion thread should center upon:

And here are some comments (by myself) from the other thread regarding possible ways forward to explore this topic more deeply:

In particular, the last comment I made on the previous thread, which I’ll repeat here as the jumping off point for this thread:

  • It would be good to consider how this API could fit comfortably into the Redox (and perhaps TockOS) world
  • Linux/Unix/Windows/Redox (all similar performance with a user-level API that abstracts away any and all OS-specific issues [to the degree possible])
  • Use Cases:
    • Network IO
    • Database IO
    • File IO
    • Modern Memory/Storage Architectures
  • Cache friendliness at all levels
  • Ergonomic
  • Opinionated (make the “right” thing easy and the wrong thing impossible or difficult)
  • Async/Future friendly
  • Generator friendly
  • Reactive (pull vs push) / Back-Pressure friendly
  • EDIT: Expose a “Safe” C-API (interop)

Things to accomplish on this thread:

  • solidify the requirements
  • survey the existing crates.io efforts and how they do/do not meet the requirements
  • analyze C# PIpes API and Java NIO and perhaps others for ideas/inspiration and/or problems to avoid
  • spin up a repo to begin the Trait/Interface designs and make decisions about how this should all interoperate etc. (as I mentioned above).

QUESTION: To what degree is tokio already addressing the concerns raised? Can tokio be enhanced with the necessary functionality? Can a higher-level library be created over tokio/mio to address the issues? Can should a lower-level library be created that tokio and other higher-level libs could leverage and that would work in conjunction with mio be the way forward?

Thoughts?


API’s for inspiration:

Existing relevant Crates on crates.io:

EDIT 2: Here is a Reddit discussion regarding some new developments on the Tokio front relevant to this topic: https://www.reddit.com/r/rust/comments/8yuo07/tokio_io_pool_a_runtime_for_io_heavy_applications/


Rust beginner notes & questions
Rust beginner notes & questions
Rust compared to C#
#2

I’ve mentioned it before on this forum, but seastar is a worthy model to look at. It’s essentially Linux only, however. Its principles are well thought out and could generalize to other platforms if the platform-specifics could be abstracted out nicely.

File I/O in particular is going to be problematic. Linux has fairly poor async I/O support. Most frameworks I’ve seen farm it out to a threadpool as a workaround. Seastar goes further and makes use of O_DIRECT (bypasses kernel page cache) and io_submit syscall for DMA transfers. They then also add their own I/O scheduler (and CPU scheduler as well) to drive I/O at a rate such that the underlying device isn’t saturated beyond capacity.

Seastar supports using dpdk networking or posix (kernel) stack.

Seastar also has its own memory allocator, which shards the memory across the # of cores allocated to the process. And then they build a futures/promise API over all of this :slight_smile:.

In essence, seastar tries to take the kernel out of the equation as much as possible so that it has as much control over the resources as it can.

One might claim they’re going too far. That’s certainly debatable. But for utmost performance, I think it’s the right approach. Certainly if you want a chance to do line rate packet processing or drive some of the more advanced storage devices. spdk might be worth investigating for the storage layer as well.

Tokio looks somewhat similar but it doesn’t go as deep down the rabbit hole as seastar (and tokio supports other OS’s). I’d be really curious what a Rust version of seastar would look like. For some domains, it would be the more appropriate starting point (if not the only one) than tokio.


#3

So, anyone like to spit-ball on some high-level requirements and constraints further? Try not to focus on any real or perceived deficiencies of current Rust related solution, but, instead focus on what the requirements of the “ideal” should be. We can then use that as a starting point to evaluate to what degree the requirements are already being met or not in the Rust ecosystem.


#4

A few more random thoughts on this topic …

tokio-codec seems very similar to the Pipelines API in terms of buffer management. Let’s take decoding as an example. Your Decoder impl receives a BytesMut which you can look/peek at without consuming (eg there aren’t enough available bytes to parse out a frame) or you can consume part of it with split_at() (eg there are enough bytes for a frame). There may be some technicalities that make it different from Pipeline but it very much seems the same: internally managed buffers provided by the library.

This is not zero-copy, however. Kernel buffers are copied into user. And if you want to look at zero-copy from the DMA perspective of “here’s a buffer I give to the NIC to fill in”, it’s not even close. If the goal of this project is to have a chance of doing linerate processing, as mentioned, you need kernel bypass networking (and similarly for storage, ala spdk). If the goal is instead to provide a “normal” lib, then tokio should be examined more closely because that’s where it’s fitting in, AFAICT.

Most people don’t use Java NIO directly. They’ll use something that builds on top or on the side, like Netty or Finagle. But NIO is essentially an abstraction over OS specific evented network I/O, which is what mio is (tokio is probably more like Netty or Finagle).

So, I think the first thing to identify here is what use cases to address, more specifically than just “network I/O” or “database I/O” - those are too high level to be meaningful. For example, most databases bypass kernel page cache and go directly against the block device (O_DIRECT or equivalent essentially). Once you do that, you’re going to need your own I/O scheduler.

On the networking side, I mentioned dpdk already. There are similar bypass technologies, such as Solarflare’s efvi; Mellanox has something similar. If you dip into this, you may be looking at building your own TCP stack (eg seastar has one). So as mentioned, the potential rabbit hole goes down pretty far :slight_smile:.

My impression is you’re probably interested more in the domain where tokio resides but I think it’s important to state the goals/use cases a bit more precisely. But going lower, such as bypass/user mode drivers for NICs/storage, is virtually empty in Rust right now AFAIK.


#5

I think what we’re shooting for here is something along the lines of NIO where you can get buffers filled directly from DMA (or something similarly efficient) and/or MMAP and then have the buffer “page-flipped” to user-space and allow peeking/consuming the buffer with zero-copy all the way through the composable pipeline. Ideally, the API would have high-level ops that could be called upon that would, behind the scenes, do the most efficient thing available on whatever platform it were on. For example, if copying in from a source to an out source, you could call an API for copy from-to source-dest and behind the scenes, it might simply decompose to a series of underlying DMA moves without further involvement at the higher-layer.

That being said, I’d like to hear other opinions on what the requirements should be from other interested parties. For example, if @peter_bertok, @BurntSushi, @jbowles and others.


#6

NIO does not do any DMA. It’s simply a wrapper around something like epoll, kqueue, or a Windows IOCP.

In general, you’ll be hard pressed to do true DMA from userland.

But, talking about buffer management/memory transfer is fine but that’s a very granular task. What types of applications/scenarios would you be targeting? Are they any different from tokio? To use seastar again, it’s targeting the absolute performance critical systems and you can see to what lengths they go to - it’s targeting scenarios beyond mio/tokio, for example.


#7

I don’t believe that is true. NIO Buffers are designed to be as low-level as possible and to leverage things like DMA and MMAP when they can. From Wikipedia article linked above:

NIO buffers

NIO data transfer is based on buffers (java.nio.Buffer and related classes). These classes represent a contiguous extent of memory, together with a small number of data transfer operations. Although theoretically these are general-purpose data structures, the implementation may select memory for alignment or paging characteristics, which are not otherwise accessible in Java. Typically, this would be used to allow the buffer contents to occupy the same physical memory used by the underlying operating system for its native I/O operations, thus allowing the most direct transfer mechanism, and eliminating the need for any additional copying. In most operating systems, provided the particular area of memory has the right properties, transfer can take place without using the CPU at all. The NIO buffer is intentionally limited in features in order to support these goals.

There are buffer classes for all of Java’s primitive types except boolean, which can share memory with byte buffers and allow arbitrary interpretation of the underlying bytes.


#8

The only NIO buffer of interest here is the DirectByteBuffer. This merely allows you to avoid java heap to C heap copies. The only thing NIO supports that’s close to what you’re after is, to use Linux as an example, a wrapper around sendfile. There’s no actual “custom” DMA that you can do with a NIO buffer beyond that, at least out of the box; certainly nothing more than you can do with a C buffer. To be clear, the type of DMA I was referring to is in the style of dpdk and its mbufs where you can have the device and the application share the buffers entirely, thus achieving true zero-copy.

NIO is to Java what mio is to Rust, roughly speaking.


#9

Much of this is new to me, tbh. I’m here to learn and contribute where I’m able. I’m new to rust and not really done systems programming (mostly web apis, data engineering, machine learning). Spent many days pushing text around with go’s IO interfaces (not sure go is of much use here :slight_smile: https://golang.org/pkg/io/#Reader, and here is an example of the bufio package implementing io.Reader interface for fucntion NewReaderSize, typically you’ll get a buffer with default size, but NewReaderSize lets you define it https://golang.org/src/bufio/bufio.go?s=1304:1354#L36 )…

That said, discussion here is exciting and I’m willing to contribute where I’m able.


#10

In my mind, my original “requirement” that I felt that Rust was failing to meet was composable and efficient streaming data processing, of which I/O is only a small piece. I think a lot of API designers focus on the I/O part, because that makes up the majority of the code in the standard library, but in user code, it’s just an “endpoint”.

For the Rust user, the abstract trait is the important thing. The goal is to have as many things as possible fit into the trait without compromise, so that it can be composed. The model to copy is very much the current Iterator trait, which is abstract, composable, general-purpose, and efficient.

The super-fancy libraries like seastar are amazing, but I feel like that’s far off. The aim is not to copy that, because it would be a massive undertaking, but to enable it as a use-case without users having to rewrite their code that consumes data coming in from seastar.

I would say that the two “primary” I/O use-cases to focus on are the ones that will be common in the near future and enable fantastic performance not available in less flexible or legacy languages:

  1. User-mode network sockets and RDMA networking. Both write directly to a pool of buffers allocated in user space. This is practically a requirement for 40 Gbps or faster Ethernet, which is becoming common. Rust is an ideal language for many use-cases in this space, such as memcached-style service layers.
  2. Memory mapped files, and by extension, support for the new non-volatile memory that’s about to go mainstream and uses mmap-style I/O.

Orthogonal to this is support for asynchronous I/O. Whatever this solution does, it has to be extensible to support the tokio effort and efficient handling of many source streams. This solution doesn’t have to actually provide concrete implementations, it just has to be forwards-compatible with them.

From the users’ perspective, I’d like to see:

  1. Zero copy of bulk data unless necessary.
  2. Adapters for the legacy Read and Write traits.
  3. Composability much like the implementations of the Iterator trait, which build up composite types such as: Skip<Cycle<Chain<FooIter,BarIter>>>.
  4. Pre-built common handlers or adapters, such as BOM detection, transcoding, cryptography, compression, various buffering strategies, etc…

Key issues I foresee are:

  • This style of buffer management has a higher chance of resulting in unsafe behaviour. For example: memory mapping a file as both read and write separately and then concurrently performing both read and write I/O on overlapping offsets. That violates the expectation of the user and the compiler that there are no mutable aliasing borrows of a section of memory borrowed immutably.
  • This style of I/O may be less elegant, less performant, or even unsafe if opportunistically reading ahead large amounts when one of the pipeline layers cannot handle this well. For example:
    • Reading opportunistically too far into a sparse file could cause unexpected I/O errors.
    • Reading too far into a file being modified concurrently may cache data that is not yet valid, such as database-like files being modified by other processes.
    • Reading a file using mmap that is simultaneously being appended to.
    • Reading ahead too far opportunistically may cause poor performance due to the chunks of data being processed by each pipeline stage not fitting into CPU caches.

It’s possible that most of the above may be solved by providing both “minimum required items” and a “maximum items” parameter. Alternatively, an appropriate pluggable “buffering strategy” trait implementation can have the same effect, but may be less flexible.


#11

on the point of composability and modelling after iterators, it’s perhaps obvious but worth emphasising: one should compose directly to the other. Taking an iterator of lines/records/frames from a stack of IO adaptors should be as natural and performant as possible. Of course for sequential IO, but also for other cases like skip, take, step_by (as forms of forward seeking), flatten (for coalescing), and some of the peeking and reversing adaptors.

on the point of possible unsafety with (what amounts to) memory and resource aliasing: there is some amazing work being done in japaric’s RTFM and embedded-hal on using compile-time type-safety to prevent resource allocation errors, for things like IO pins and DMA channels, and it’s all zero-cost. It probably isn’t directly applicable to more dynamic OS-level allocations, but it would be worth looking at and mining for ideas.


#12

Since you pinged me, I’d just like to note that while I love that this discussion is happening, it’s not really the kind of thing I’m good at. It’s hard for me to set requirements at this kind of granularity. I mean, I can say things like

  1. I/O must be composable.
  2. When required, applications and/or libraries should be capable of precise control over I/O to manage performance requirements, ideally without throwing out (1).

But I feel that this just kicks the can down the road, so I’m not sure how useful it is.

Also, my particular approach to these sorts of problems, especially in such a nascent state, is to pick a problem that can’t be easily solved today in Rust’s ecosystem that you think would be solved by improvements to Rust I/O’s story and just go out and build it. You’re going to learn a lot more and much more quickly that way.


#13

I was just about to reply to this thread and essentially say the above, just phrased differently.

I’m afraid that stating the goal as delivering a set of traits that someone else can use to build these things is … just not workable. It runs an extremely high chance of an “API designed in an ivory tower”. API/abstraction design is hard. The best way is to have a couple of usecases in concretion, and then see what can be generalized/abstracted out. Bottom up, if you will. Top down is virtually doomed from the beginning unless you’ve already built this thing many times and know exactly what needs to be done.

This is particularly true for things being discussed in this thread, like user space networking stacks. Every little detail matters. If you browse the dpdk-dev mailing list, you’ll see plenty of discussion around things like where to place a certain field - should it be in the 1st cacheline or 2nd. There’s just no way you can start this with a “set of traits first” type of thing.

Finally, let’s assume that, through some miracle, you come up with these traits. The Rust ecosystem isn’t really advanced - someone has to do the hard work of actually implementing these; until then, nothing can actually be done, certainly no more than today.

So if someone really wants to take this on and has a lot of spare time, I’d suggest building something concrete first and then see if you have more insight into where things can be abstracted and how. Or, try copying seastar (with a Rust twist) whose authors have already done this exercise a few times.


#14

I agree with most of what you and @BurntSushi had to say about this. I agree that just stating what the traits should look like probably doesn’t lead anywhere.

To me I think the “Requirements” to the degree there are any are sort of congealing around the ability to process a multi-format document/file/source that might have multiple nested layers of differing content that needs decoded/processed in particular ways with as little copying of buffers as possible (from the HW drivers all the way through the user-space processing). At least, so far, that is the only somewhat unique requirement I’ve heard articulated.

Does anyone know of any particular use-cases that could be explored that aren’t already well-served in the Rust ecosystem?


#15

If you take the previous discussion, one of the things that kept coming up was the fact that ripgrep has a few shims for things that probably could be put into separate crates. e.g., ripgrep really just wants to search a &[u8] as directly as possible, but we often can’t search the contents of a file directly. It might need to pass through, say, decompression followed by transcoding before it can be meaningfully searched. As far as I know, there is nothing in the ecosystem that provides that sort of plug-and-play functionality. One of the points of disagreement (I think) I had in the prior thread was whether this was something fundamental or whether it was just because there was a missing crate or two.

So if you ignore what ripgrep specifically is doing, and take the problem of searching files that may need one or more transforms applied to them before being searched, then what is the ideal solution to that problem? Is what ripgrep has done ideal and we just need to put it in other crates? Or does there exist some better abstractions on which a solution can be built? (I believe @peter_bertok was strenuously arguing in favor of the latter.)


#16

The main point I was trying to articulate is that ripgrep managed to hit quite a number of the potential use-cases already, or would naturally in the future if extended to scenarios such as searching through common non-text document formats (OOXML, PDF, etc…).

Agreed.

I would start with:

  1. Fixing the small issue in memmap-rs blocking 32-bit support.
  2. Come up with a trait that allows both mmap and “traditional” I/O. I toyed around with wrapping memmap-rs in my experimental trait, and it doesn’t appear to be terribly difficult, just a bit fiddly.
  3. Experimentally rewrite ripgrep’s I/O layer in terms of that new interface.

Concurrently, I would plan ahead to:

  1. OOXML text searching, which is a good example because it’s got layers upon layers of streaming data that may not fit into memory.
  2. Decompression/Compression streams, which ripgrep appears to be doing via external tools and Unix pipes. These are also prime candidates for in-process decompression for higher performance. If the stream trait works as intended, both scenarios should be plug & play.

#17

I haven’t had a chance to read through this thread just yet (forgive me, but I spent a few hours going through OP’s thread that spawned this one, and I have some resources I would like to share). To start off, I am no expert with file I/O performance, so I will defer to developers with more experience in that area. But what I do know is that virtual memory is far older than I am, and it is almost always a bad thing to ignore. Cited here:

These articles cover mmap in combination with algorithms to offer performance insights. But the primary point is pretty clear; use virtual memory instead of streaming from persistent storage.


#18

Note that although good for single-task throughput, virtual memory is downright terrible for scalable multitasking, because with it you lose all hope of ever understanding at which point your code is going to block (every pointer dereference is potentially blocking, in a non-local way). This makes it almost impossible to apply common I/O scalability optimizations such as trying to confine all I/O inside of a small thread pool.


#19

I think it is safe to say that mmap is not so much about performance as convenience. It makes it “convenient” to treat a large file as a contiguous block of memory when it really isn’t resulting in shite performance. It’s kind of like extending posix file-io to the network, it just is the wrong paradigm.


#20

IIUC, isn’t that the definition of virtual memory pressure? Single threaded applications can suffer from VM pressure just as easily.