Rust beginner notes & questions

I've made a career fixing issues exactly like this, so I dunno... it may be hyperbole, but it's effective. 8)

I've seen - repeatedly - clusters of servers worth millions running like slow molasses because someone used a too-small network buffer throughout the codebase. It was the default in C++. It's going to be the default in Rust. It's going to be slow too. It's not complicated.

Firstly, the BufRead or Read2 traits are 100% stream-based I/O. The difference between those and Read is only that the buffer is "handed to the user" instead of being "passed in" AND that the source position is not forcibly advanced after the buffer is available to consume. This gives the API designers more flexibility if this is the default, and the user code is more elegant and more efficient by default.

There are a couple of interacting / composable use-cases here, with mmap just being one. Picture a scenario where there are many libraries, some third-party, some low-level, some simple, some complex. Ideally you'd want to be able to compose these:

  • With a minimum of fuss.
  • With a minimum of unnecessary copies.

So the traditional Read model seems fine, but lets take the simplest scenario you're ever likely to face: copying. You just want to copy a file to a file, or a file to a socket, or a socket to a file. Something simple like that.

Now in the naive model, the timeline would look something like:

let buf = ...; // typically ~64KB or whatever
...
src.read(buf); // obviously this is a loop, but pretend it's unrolled.
dst.write(buf)
src.read(buf);
dst.write(buf);

At no point is your program simultaneously reading and writing.

Secondly, if the incoming data is a socket, then the src stream behind the scenes needs a second buffer for the network device to write to while you're writing to dst, otherwise the incoming packet data would be dropped on the floor during this time. So really, there's a device -> kernel and then a second kernel -> user copy. While you're receiving that second copy from the kernel, you aren't sending anything to the destination. Which is also a user -> kernel copy. We're up to 3 copies already, but logically we only wanted to do 1 copy. Ugh.

There are all sorts of other complexities at play here as well that are commonly overlooked. For example, if the user-provided buffer is too small -- I've seen 512 bytes in legacy code -- the kernel transitions will get you down as low as a few hundred KB/sec no matter what the hardware. Ouch.

If the buffer is too big then you still have problems:

  • It'll blow through your CPU caches and hit main memory. ~8KB is pushing it for L1 cache, 32KB for L2 cache, and a few MB for L3 cache. Sometimes this helps performance. Sometimes it doesn't. Your naive copy code has no chance of determining what to do reliably.
  • While you're synchronously waiting for the the dst.write() to clear, the NIC can't arbitrarily fill up the kernel with junk you're not consuming. It eventually will start dropping data on the floor, typically around 64KB for older operating systems and up to 2-16MB on modern ones. Now you're getting retransmits and other fun inefficiencies. Your throughput graphs start to look like the teeth of a comb, and everybody is scratching their heads wondering why the app can't get "full throughput" on modern equipment. People usually start going through Wireshark traces at this point and blaming the network team down the corridor. That rarely helps.
  • If this is, say, a web server handling thousands of simultaneous connections, you could exhaust main memory or create a massive backlog of IOPS on the destination drive, causing all sorts of grief. How is the "simple" copy code to know that it needs to reduce its buffer size due to server-wide contention? What if the buffer size is a constant six layers deep in some "ftp-rs" lib or something?
  • If this is a socket-to-socket copy, you're just contributing to buffer bloat, and now the aforementioned network engineers hate you with a burning passion.

None of this is news, everybody knows that this is a problem, that's why asynchronous I/O libraries like tokio are a thing. Well... okay, not everybody, the classic GNU cp command-line tool uses wonky buffer sizes sometimes, does synchronous copies, and had a number of decades-old bugs that only got fixed recently. They're clearly noobs though. Weeeellll.. okay, not just them, Microsoft messed up too back in the days with Windows Vista, which had notoriously slow file copies because they got the buffer sizes wrong. Oops.

Anyway, on to BufRead/Read2: by default, the timeline would pretty much look like the above, with only an extra consume() call, which is not much more than incrementing an integer:

let buf = src.read(0); // we *get* a buffer instead of passing one in...
dst.write(buf); // ideally, this ought to be a 'move' so we lose control of the buffer.
src.consume(buf.len()); // just copy as much as we can, fast as we can...
let buf = src.read(0); // The magic: nobody said this is the SAME buffer!!!
dst.write(buf);  // Now we're just passing buffers like a bucket brigade...
src.consume(buf.len());

It looks similar enough, right? Not exactly rocket-science for the end user. It's not even 1 extra call, because we got to skip the buffer allocation at the beginning. However, behind the scenes, nothing stops the implementation of the src and dst streams doing all sorts of magic.

For example, imagine that src is a user-mode network library socket. Still a "socket", still a synchronous "stream", no fancy tokio asynchronous stuff in the picture at all.

The way these work is that when you open the src socket stream, the library allocates a pool of buffers, say ten 64KB buffers in your process. As data comes in, the NIC directly writes the data into the user process memory space, bypassing the internal kernel buffering and copying. When you request a buffer, it "peels off" a filled one and you get to consume it. Meanwhile, the network adapter continues asynchronously writing to the remaining buffers in the background while you're busy sending data to dst. You get the buffer size that the network driver likes. It can increase the number of buffers for you if the driver thinks this is a good idea. The buffer can be shared process-wide to prevent memory exhaustion.

But wait a minute.. Rust has "move semantics" by default! If the API is designed the right way, then the default write operation will move the buffer, taking ownership of it. At this point, it is potentially free to asynchronously perform the write behind the scenes -- until you call flush() or whatever.

In other words, the naive file-copy scenario could have tokio-rs level asynchronous magic going on... or not... depending on the specific implementations of the reader and writer traits. This would be relatively transparent to the read/write copy loop, which never even got to make a buffer-size decision, which it can't do properly anyway out-of-context.

Realistically the mmap case isn't a massively better scenario than traditional I/O, it just allows the kernel to share a memory space with the user-mode program, reducing the copies from 3 to 2 for a "copy," and from 2 to 1 for a "reading" use-case. That's not a massive win in practice. Realistically, what tends to happen is people skip over the fiddly "window sliding" code they should be writing, incorrectly assume that mmap == &[u8], then their code can be simpler, much faster, and wrong.

Generally in this thread the feedback from some people seems to be that they want Rust to provide "full control" and that this style of API design is somehow "taking control away", except that it's not. The application developer still gets 100% of the control over buffer allocation. After all, they get to choose which trait implementation they want to use. Plain old File couple with a a single-buffer-reader? You get crappy performance, but hey, that may be fine for your use-case. Choose a memory-mapped file reader or a user-mode network socket? Enjoy your blazing-fast 0-copy performance without having to rewrite your reader code! Let the std::io default decide for you? Enjoy your automatic performance boost when people finally realise that NVDIMMs and RDMA is the new standard and Rust gets updated to match.

Some people are happy to hand-roll things themselves and sticky-tape a bunch of complex libraries together, which works great... until they need to decrypt a stream, decompress its contents, and parse that data. Some of these are forced to copy, then ah-well, nothing you can do. However, some streaming libraries could benefit from being able to optionally pass through huge chunks as-is, but if you're composing streams from third-party library code then you might be doing 6 or 7 copies in the various Read layers before you are done. Think of HTTP with its embedded binary streams, or an automatic text decoder that detected UTF-8, or retrieving a binary BLOB via something like SqlDataReader.GetStream()...

PS: I hope you like tiny buffers, because std::io::copy uses 8 KB, set with a compile time constant:

And tokio's version (nearly 100 lines just to copy data!) uses 2KB, also a compile-time constant:

3 Likes

I'm intrigued by what you have to say regarding this topic. I've been looking for something to dig my teeth into with respect to Rust that I felt was interesting. I'd like to spin-up a Git repo to begin working on this idea. I'd like your participation, if nothing else at least advisory, but, any collaboration would be appreciated.

Honestly, I see that as the only useful way to move forward on something like this. Continued discussion about it is probably not useful without starting to actually implement something.

Some things to note:

  • It would be good to consider how this API could fit comfortably into the Redox (and perhaps TockOS) world
  • Linux/Unix/Windows/Redox (all similar performance with a user-level API that abstracts away any and all OS-specific issues [to the degree possible])
  • Use Cases:
    • Network IO
    • Database IO
    • File IO
    • Modern Memory/Storage Architectures
  • Cache friendliness at all levels
  • Ergonomic
  • Opinionated (make the "right" thing easy and the wrong thing impossible or difficult)
  • Async/Future friendly
  • Generator friendly
  • Reactive (pull vs push) / Back-Pressure friendly
  • EDIT: Expose a "Safe" C-API (interop)

Anything else?

I'd like to start by creating a Repo, spinning up a README.md in it and begin laying out requirements, designing Traits/Interfaces/Abstractions first (ideally with your seemingly knowledgeable input).

Let me know if you are interested in any sort of collaboration on this. If not, I'll probably pursue anyway because you've piqued my interest, but, I feel you (and others) could add a lot of missing experience and knowledge to the endeavor.

7 Likes

To be honest, I feel like I'm learning alongside with everyone else. 8)

I didn't start out with the zero-copy thing as a goal, I only noticed that as a possibility after reading about the System.IO.Pipeline design.

I think a lot more research coupled with some experimental API tire-kicking is the best bet, and I would definitely seek the involvement of the tokio guys. Asynchronous I/O is usually used when performance matters the most, and zero copy = more performance!

I haven't personally read through the full System.IO.Pipeline source yet, but I might over the weekend, time permitting. Similarly, it would be worth it to see what the Java guys did in the NIO library.

I also discovered that the borrow-checker doesn't like my naive design, making it a little painful to use. That might just be the way it is, but I'm not sure. A more experienced Rust API designer could probably provide hints...

2 Likes

Have you looked at the mio? Also, see the book: GitBook - Where software teams break knowledge silos.

EDIT: I knew about mio (in a trivial sense) before, but, I hadn't yet spent much time digging into it. As I begin to, I think that starting any sort of project (as opposed to mio) is probably not useful (at least until I understand mio fully and am sure that it isn't meeting, or at least intending/endeavoring to meet, the needs you have described).

EDIT: Does anyone else know of other Crates that should be examined in depth and understood before deciding that something like what is proposed in this thread is needed/not already in the work?

1 Like

I did look at that, it's "just" doing things like wrapping the system-provided APIs for notification. It's useful in scenarios where you might be reading from 1000 sockets at once, all of which are trickling data into a dozen server CPU cores doing the processing.

The Read2 API is much simpler than this conceptually, and is only tangentially related. E.g.: it might be feasible to extend it with a fn peek_any(...) API somehow via a related trait or something.

PS: The Performance of Open Source Software | Parsing <span class="caps">XML</span> at the Speed of Light was one of the reasons I went down this rabbit-hole. I've recently been thinking about how to parse XML as fast as possible for use-cases similar to ripgrep. The common-case is almost, but not-quite, pass-through for a lot of the bulk data processing. If the encoding is UTF-8, which it usually is, then you can often pass the text onto the reader as-is. Except you can't, because of &nbsp and the like. You can also often pass through the text as &str, but just like with the mmap situation, that can get a bit iffy in memory-constrained environments.

It's an interesting design problem, that's for sure! 8)

1 Like

Yes, I agree, as I look more into mio.

Yes, it is.

I think it might be advantageous to end/close this thread and spawn off a new thread to do the following:

  • solidify the requirements
  • survey the existing crates.io efforts and how they do/do not meet the requirements
  • analyze C# PIpes API and Java NIO and perhaps others for ideas/inspiration and/or problems to avoid
  • spin up a repo to begin the Trait/Interface designs and make decisions about how this should all interoperate etc. (as I mentioned above).

EDIT: Unfortunately, today, I need to keep focused on other work, so, I'll try to get back to this when I can.

EDIT: I've started a new thread here (Towards a more perfect RustIO) to provide a place to discuss the "Rust IO" issue. Further discussion on this thread should be be limited to other concerns (if any).

4 Likes

Idea: a Read-like trait that returns an owned buffer (like a buffer from the bytes crate). This sounds even more generic than BufRead / Read2 to me, because it means that use cases where the buffer must be owned by the caller for later usage are handled. In exchange, it makes the user feel less 0-cost, because there'll be an additional layer of refcounting (but in exchange there'll be true 0-cost available for the additional use cases like the kernel ring buffer).

FWIW, this is the approach taken for tokio's Streams: passing around owned buffers of the data.

1 Like

(also, I'm being told if this can't make it into libstd, which is likely until there is at least one widely-used implementation, I guess, then it could maybe make it into mio)

As promised, I've spun up a new thread for further discussion on the IO issues here (Towards a more perfect RustIO) to provide a place to discuss the "Rust IO" issue. Further discussion on this thread should be be limited to other concerns (if any).

2 Likes

I request that you be a bit more careful in your words.
I now understand that you mean "sympathy horrified": that he had to implement so many "wheels" himself, but my first three readings had me sputtering that you dare insult one of the most advanced, stable and high-quality Rust codebases out there.

@burntsushi has put a tremendous amount of thinking into his work, and has has gotten literally world-record speeds out of it. There is a reason that Visual Studio Code adopted it to power their search.

All your criticism seem to boil down to "why is this not in std?!?", and Rust's answer is: "so that brilliant people like @burntsushi can experiment inside crates, find the perfect API, and then we'll take it into std" (caveat: better yet keep it as an easily included crate. After all, "std is where libraries go to die_" in other languages).

The reason these crates are not there yet is because we're still building them. Rust's advancement stage might be at "level 4", age-wise it is still around java 1.1 levels.. there's only so much foundation you can build in three years with a mostly volunteer community.

I'm now taking a breather for a while, I find that I am physically angry from reading this..


Updated to add, I see that in the past 24 hours (post I hadn't gotten to yet when writing the above), this thread has taken a tremendously productive turn. I'm happily surprised! Thank you @peter_bertok and @gbutler for turning this heated discussion into a productive direction! : Heart:

14 Likes

Thanks for your flattering words! I know you didn't mean this, but a small note of clarity: no code should be held sacred or unassailable. Especially code that others gravitate towards learning from. There are parts of ripgrep I'm proud of and parts that I'm not so proud of. Small iterative improvement is the name of the game. :slight_smile:

And yes, I am glad this conversation took a productive turn. It was frustrating at times!

13 Likes

Ok, I’ll bite - which parts are you not proud of? :slight_smile:

Humor aside, I do think it’s valuable to have Rust codebases that one can point people at when they ask “can you show me an example of good/canonical/modern/well-structured/etc Rust code”. Ripgrep would be one candidate that certainly crosses my mind.

3 Likes

Quite a bit!

  • It is hard to reason about the interaction between flags. There are a lot of flags, and every time a new one is added, it's almost guaranteed that there is some interaction that wasn't anticipated.
  • This one came up in the conversation above: there are two different implementations of search in ripgrep, where one operates over buffers and one operates over a single big slice. The latter doesn't have some features implemented, such as context handling. So one consequence of this is that any time you request contexts, ripgrep will never be able to use memory maps. This is a failing on my part to write a more generic search routine and this is fixed in my in-progress work on libripgrep.
  • While the UTF-16 transcoder wasn't a lot of work on my part, I really think it should live in a separate crate since it is generically useful.
  • The parallel directory iterator API is baroque.
  • The printer has become very complex. This is partially due to the fact that this is primarily where all the interactions between flags are resolved. I'm working on a rewrite of the printer in libripgrep, and there are some marginal improvements, but there is still case analysis strewn about everywhere that I know is going to make the code hard to read. I don't have any good ideas here unfortunately.
  • ripgrep's src/main.rs is a bit too busy for my taste. There's a lot going on and very little meaningful abstraction happening.
  • The ignore crate's directory traversal code is quite complex and also very busy.
4 Likes

Forgive my ignorance, but is this mostly a result of some flags being basic primitives (ie bool, usize, etc)? I’m sure you’ve thought this through, but I’ll ask anyway: is there room to turn them into higher level types and attach some behavior to them that they can apply to the output? Or would that create its own abstraction spaghetti?

Is there a particular method/fn that demonstrates the issue you’re describing?

1 Like

Dunno. I mean, the technique you're describing is one I've employed many times over the years, so I'm not unfamiliar with it. :slight_smile: To me, the complexity arises in the permutation of all possible flags. There are probably competing (but perhaps not fundamental) pressures from the reduction of code duplication, abstraction soup (as you say) and performance (printing must be fast).

What I'm describing is emergent complexity in the interaction between different features. As such, there is no one bit of code I can just point to and say "yeah clean this up." It is spread out all throughout ripgrep. The printer just happens to be one particularly highly concentrated part of this: https://github.com/BurntSushi/ripgrep/blob/master/src/printer.rs --- But it's not the only the part. The part where CLI arguments are turned into a slightly higher level set of structured config knobs also exhibits this: https://github.com/BurntSushi/ripgrep/blob/master/src/args.rs

The reason why I'm not proud of this is partially because I can use my eyes to look at this code and say "that's inscrutable." But it's also partially because it has empirical grounding. There are bugs and/or feature requests that exist due to the unexpected interaction between flags. It would take a while for me to collate those issues, but it does seem like a worthwhile dimension to track, so if I think of it, I will start tagging issues with a new label that indicates that I believe it's a consequence of this emergent complexity. Perhaps a solution will emerge once I have more data!

5 Likes

Just to add my 2 cents. It should be noted that the reason many are ecstatic about the Rust lang project is that it's the first real contender for taking the c/c++ king-of-the-hill crown. None of the other more "elegant" languages are in the same league as Rust, because, fundamentally, Rust targets the bare metal.

While many would like to see Rust used for web dev, there is a large need for the electrical engineering community to have a C/C++ replacement for DSP/embedded/IoT/smart toys. Thus, while Rust should be as elegant and easy to learn as possible, the dream is that Rust will power the compilers and interpreters of all those other 'elegant' languages for the next 30 years. That in itself sets Rust miles apart from most other language projects out there.

13 Likes

What are the "more elegant languages" you reference here?

I think he was thinking about the many modern programming languages which essentially cannot be implemented without a thick runtime (garbage collection, run-time compilation...).

Such a runtime allows programming languages to be much easier to use in common application development use cases, but it is also a large price to pay in terms of memory requirements, achievable performance, and predictability. This is why languages which such runtime requirements are typically not used in scenarios such as low-level OS layers or foundational numeric libraries.

2 Likes

What are the “more elegant languages” you reference here?

Like Python. If Python could do concurrency, run on the metal, and at C-like speeds, I would never need to use Rust. But the very constructs that allow those things, makes Rust significantly harder to learn than Python. The way Rust implements these things is admirable and makes any attempt at a comparison to Python rather silly since Python lacks it completely.

Several software devs have written negative things about Rust on the web because they clearly have never used C for DSP and embedded, or needed to code for high-performance computing.

3 Likes