Rust beginner notes & questions

Having a flip through the memmap-rs crate shows that in this case it's technically correct to return Result<usize> based on the description of get_len(), but this is confusing for the user.

Files can be bigger than u32::MAX, and it's always been possible to memory-map files bigger than 4 GB on 32-bit platforms, exactly the same way it's been possible to read files of arbitrary size using streaming APIs since forever.

The "mapped window size", and the "file size" are distinct concepts, only the former is limited to isize (not usize!). The memmap-rs crate conflates the two in several places. Similarly, it uses the wrong integer type for the "offset into the file":

https://github.com/danburkert/memmap-rs/blob/b8b5411fcac76560a9dccafd512fa12044dd64b3/src/lib.rs#L89-L92

The offset should always be u64, even on 32-bit platforms. For example MapViewOfFileEx takes dwFileOffsetHigh and dwFileOffsetLow parameters so that 64-bit offsets can be passed in using two 32-bit DWORD parameters. I think this API has been there since... NT 3.11 or NT 4. I dunno. A long time, certainly.

Submitting an issue for memmap-rs now...

5 Likes

Wouldn't it be more correct to say that you can mmap a 4GB window within a file larger than 4GB on the 32-bit platform? You can't actually map the whole file contiguously, right?

2 Likes

Yes.

You can map up to a 2 GB view (actually several!) into a file of any size on Win32. The upper half of the address space is reserved for the kernel and cannot be used for mapping, even with the /3GB flag (apparently).

I suspect Linux 32-bit has similar behaviour, but don't quote me on that.

There's a diagram on this page: Memory-Mapped Files | Microsoft Docs

To quote the KB article:

"Multiple views may also be necessary if the file is greater than the size of the application’s logical memory space available for memory mapping (2 GB on a 32-bit computer)."

2 Likes

I feel compelled to add that this was a thing in the 90's for Sun boxes, and was used to hide latency and coalesce IOPS to disk.

1 Like

I suspect that was a battery-backed RAID controller cache, which is pretty common in all modern servers or SAN disk arrays.

Compared to that, the future looks amazing: Intel Launches Optane DIMMs Up To 512GB: Apache Pass Is Here!

This is just the beginning! Pretty soon you'll be seeing practically everyone running databases with 100% of the I/O coming directly from these. Compared to that, NVMe flash storage at a "mere" 3500 MB/s looks glacially slow!!

Somewhat ironically, the great work that has been done by the various C#, Java, JavaScript, and Rust core developers bringing asynchronous I/O programming to the masses is now hampering performance in this shiny new world of non-volatile memory. The overhead of even the most efficient futures-based async I/O is simply massive in comparison to the latency of NVDIMMs, which are byte-addressible and are only 2-3x slower than normal DIMMs. When your storage latency is measured in nanoseconds, individual instructions matter and user-to-kernel transitions are murder...

Also see: https://www.snia.org/sites/default/orig/SDC2013/presentations/GeneralSession/AndyRudoff_Impact_NVM.pdf

2 Likes

Intel have made great claims about Optane before, and delivered in the end something which barely matches a modern SSD (and therefore does not justify the vendor lock-in or the software performance pitfall of adding yet another slower storage device to the virtual address space). I will wait for a more impressive product before calling this the future of storage.

Moreover, asynchronous I/O is mostly used for network I/O, not disk I/O. Because for disk, we pay the price of yet another stupid legacy decision, namely the Linux kernel devs declaring that you do not need asynchronous storage I/O as the disk cache will take care of everything for you. Tell that to performance engineers who have to debug latency spikes caused by software randomly blocking, now that every pointer dereference can turn into a disk access...

4 Likes

A slice is a window of usize indexable range. The actual range may be less, ie isize on 32bit windows, but that’s beside the point.

3 Likes

I have never, not once, claimed to solve all problems related to memory maps.

4 Likes

Now that I've toyed around with my pseudo-code of what it might look like (Read2), finding real issues (the avoidable mmap-related error in ripgrep), and realising that zero-copy is the future (NVDIMM), I feel like something akin to System.IO.Pipelines is necessary.

I'm not at all saying that it'll look exactly like my Read2 sample, that's just me doodling. A realistic example would probably have 3 to 4 low-level traits in some sort of hierarchy. It would obviously have to handle both read and write. It would probably need some restrictions on the types it can handle, likely Copy or Clone. I quickly discovered that the borrow checker makes this kind of API design... hard. Really hard. Someone with more Rust skill than me would have to put serious effort in.

The best chance this has if the current futures and tokio refactoring effort embraces this I/O model and it eventually replaces legacy Rust I/O. Paging @carllerche and @anon15139276 -- you guys might be interested in this thread...

Both. 8(

Missing trivial things like hex-encoding in a standard library? In 2018? Too thin. A standard library that leans heavily on macros? Will never be properly IDE-friendly. The current direction is "Macros 2.0". People want more of this. Some of the same people that complain about compile times, I bet. Sigh...

I'm saying Rust has most of the language features needed to skip over the meandering path taken by other languages, a running-start if you will. But huge chunks of it seems to be starting at step #1 and repeating the same mistakes, accumulating the same baggage along the way.

At some point someone will say: "This Rust language is too complicated and messy, I've got this great idea for a far simpler language!" and the wheel will turn one more time. 8)

My $0.02 is that pace of language development is accelerating. I think Rust tried to "stabilize" too early, and instead should have embraced true "versioning". I would love to see a language that has a file extension such as .rs1 and then .rs2 the year after, dropping source-level compatibility on the floor. The PowerShell guys nearly did this, and now they're pretending that PS6 Core on Linux is 100% compatible with .ps1 files written for Windows 2008 Server! 8)

I'm not sure about the latest C++ stuff, I've avoided C++ for a decade because of productivity-killing issues like the linker exploding in my face with gibberish errors because of external libraries being incompatible with each other.

For another real world example of zero-copy in the wild, refer to Java's NIO library. Note that just like how I mentioned that my sample Read2 trait would likely need a variety of implementations for different styles of buffering, the Java guys have HeapByteBuffer, DirectByteBuffer, and MappedByteBuffer. A lot of people assumed that this style of I/O implies dynamically growing heap allocation only, but that's not a constraint at all.

No, it was host cpu memory, though it was mapped in the kernel with a device driver layer that used it exactly like RAID controllers would later.

My apologies, I misunderstood. You did claim though that BOM-detection was something you've solved without much difficulty. That may be true, but it's an especially simple case in a larger class of problems, including things like smoothly handling memory-mapped I/O and then layering things like BOM-detection on top and also doing so safely on 32-bit platforms.

Or even 64-bit platforms, actually, as I was pointing out above: https://github.com/BurntSushi/ripgrep/issues/922

I'd also like to apologise for picking on ripgrep, in no way is this reflective of the quality of your coding. I'd rather that the situation be that people with less skill than you be able to write Rust code that is fast and correct. This takes special care and attention in the std library to "steer people into the pit of success" instead of the "pit of failure". Right now I suspect a lot of people are falling into pits full of stabby, pointy things...

You did, and you're far better at Rust than I am. What chance do I have?

Yeah, and? I have that goal as well! (But to be clear, I don't like you're phrasing, but I get your drift.) This very thing is a large part of why I go out and library-ize a lot of the code I write. Some day, maybe those shims will become library-ized and others will be able to do what ripgrep does without needing to rewrite those shims. I see no reason why this cannot be done. I see no reason why the Read trait isn't composable enough to build that kind of infrastructure. I see no reason why memory maps couldn't be made to work correctly on 32 bit systems if there is demand for it.

Like, we get it. You're unhappy because your favored abstraction hasn't been built in Rust yet and library-ized. You've made your point. The next step is (for someone) to prototype it and solve a real problem that folks care about with it. This navel gazing is pointless IMO. I tried to tell you this a dozens of comments ago.

11 Likes

I may be missing a subtle point, but to me the proposed Read2 trait:

trait Read2  {
    type Data; //  = u8; // with associated type defaults.
    type Error; // = (); // with associated type defaults.

    /// Returns at least 'items', which can be 0 for best-effort.
    fn peek(&mut self, items: usize ) -> Result<&[Self::Data],Self::Error>;

    /// Can consume any number of items, acting much like `skip()`.
    fn consume(&mut self, items: usize ) -> Result<(), Self::Error>;
}

looks awfully similar to the std::io::BufRead trait:

pub trait BufRead {
    /// Fills the internal buffer of this object, returning the buffer contents.
    ///
    /// This function is a lower-level call. It needs to be paired with the consume method to function properly.
    /// When calling this method, none of the contents will be "read" in the sense that later calling read may
    /// return the same contents. As such, consume must be called with the number of bytes that are
    /// consumed from this buffer to ensure that the bytes are never returned twice.
    fn fill_buf(&mut self) -> Result<&[u8]>;

    /// Tells this buffer that `amt` bytes have been consumed from the buffer,
    /// so they should no longer be returned in calls to `read`.
    fn consume(&mut self, amt: usize);
    ...
}

edit: to elaborate a little more, I'd say imo Read and BufRead are two approaches to the IO API which mainly differ on who owns the buffer, the caller (for Read) or the IO construct (for BufRead).

Depending on the underlying OS APIs, one or the other may be the cheapest (after all, implementing BufRead directly for a mmaped file rather than on top of Read makes a lot of sense). But non is intrinsically more general / always better than the other, and both likely are worth having and keeping.

4 Likes

Ah, okay, that sounds like a substantially different complaint than the one about streams/pipes. So are you no longer concerned that the existence of a stream API is fundamentally a mistake? (I understand you still prefer the "pipe-like" API and think it's a fatal flaw for a language not to support it in the standard library, but I don't see a connection between this and wishing that streams didn't exist.)

I've heard the argument that macros increase compile time, but I'm not sure if it's supported by data, and I'd love to see a more full analysis. The rustc provides -Ztime-passes for basic profiling, and as far as I've seen (e.g. in this post), macro expansion is generally not a major time-sink; the general consensus seems to be that codegen is the biggest contributor to slow compiles. That said, perhaps macro usage is a major contributor to the amount of code being generated.

As for IDE-friendliness, you may be right. In principle I can't think of any reason why the Rust Language Server (RLS) or another IDE-backend couldn't provide pretty robust support for macros, since they are better integrated into the language proper than preprocessor-based macros. But as it stands, it appears that RLS has some major struggles with macros, though the IntelliJ plugin has apparently made good progress here.

1 Like

Surely you don't think that's a bad thing! I'm sure even Rust's most vocal proponents are aware that no one language (at least not this early in the history of computing) will become the "one true language" (even for a given design space) indefinitely into the future.

That's...a really fascinating idea, but one that, at first glance, seems to have only downsides. Backwards compatibility is certainly a pain, but it appears to be a sine qua non for industry adoption. When you say "dropping source-level compatibility", do you mean that some kind of lower-level compatibility (e.g. with FFI between different language versions) should still be maintained?

That post seems to reaffirm that the mmap approach wouldn't really be preferable to the stream approach unless the data in question is already in-memory on a fast hardware device. Since that's not something that can generally be assumed, I still don't see the existence of memory mapping as a good reason to avoid stream-based IO by default, especially in the case of networked or distributed systems (since data over IP by definition isn't already in memory). Am I missing something, or is that not the argument you're trying to make?

I would argue that BufRead/Read2 approach is always better, but this takes some insight into API design. An incredibly common "learning curve" I see goes like this:

Q: I did this I/O code! It's slow! Can someone help please?
A: You're using Read, but you're doing too many kernel calls because you're consuming a few bytes at a time. You should use BufReader. This "lesson" is right there in the doco in the first example.

This was the underlying root cause of poor performance of the pre-Firefox Mozilla web browser for years. Practically all I/O it did was line-by-line, unbuffered, and flushed between lines when writing text files like the JavaScript profile contents. Never mind that terrible performance, this was downright dangerous. The Mozilla suite also handled POP3/IMAP email, and tens of thousands of people -- including me -- lost all of their mail data because the Mozilla suite would shred their profile on exit by cutting files in half. Oops. I submitted a bug ticket, which I discovered was one of dozens and was promptly ignored along with everyone else for a decade. Thousands of desperate people had submitted comments along the lines of "Please help! I've lost all my emails!!!".

The Mozilla team basically ignored this because it was just too hard to dig through all of the I/O code littered throughout the codebase and carefully ensure that all of it is appropriately buffered, flushed only at safe transaction points, and file replace operations were appropriately atomic. Firefox thankfully now uses SQLite for most I/O which doesn't have these issues.

So what's the true root cause here? The issue is that the "Read a buffer / Write a buffer API is a simple common denominator" is a trap. It's a pit full of pointy spikes. It's the C/C++ approach. Professionals fall into it all of the time. The entire Mozilla team did. It's not efficient for reads, it's dangerous for writes, and it doesn't scale to even user-mode applications like Firefox, let alone high-performance severs. Like I was saying in this thread, it can't even handle memory-mapped file I/O, which dates back to at least the 90s.

What the "API user" actually wants from any I/O is typically: "Give me as much data is efficiently available right now, and I'll see how much I can consume, most likely all of it. Don't stop reading just because I'm processing data."

Read doesn't do this. You provide a pre-constrained buffer of some fixed size for each call. You have to guess at what is a good size for this. Your guess will be wrong. If you make this buffer too big, then the inherent copy in the API will blow through your L1/L2 CPU caches and your performance will be bad. If you ask for too little, then you will spam the kernel with transitions and your performance will be terrible. If you try to layer things on top of each other (ZipStream on ChunkStream on CryptoStream) then you will have an absolute nightmare holding onto bytes not consumed by the various layers as they reach the end of their roles. As the consumer of this API, everything you do is difficult and likely to be bad.

There is no scenario, ever, where Read is truly easier for the API user. The single call vs the 2 calls may seem like a "lighter weight" API, but this is just going to lead to poor performance, unnecessary copies, cache thrashing, and even lost data and crying users. Always. Every time. Everywhere. To the point that the Mozilla guys failed to fix the bug for a decade.

Sure, it's possible that superhuman developers will not fall into this trap. I admit I fell into this trap at least a few times when I was a junior developer. I bet everyone reading this forum did at one point or another.

Meanwhile, the BufRead/Read2 style of API design allows the system with the knowledge -- the platform I/O library -- to make the judgement call of the best buffer size. The user can provide a minimum and allow the platform to provide that plus a best-effort extra on top. The best effort can dynamically grow to be the entire file if mmap is available. Or... most of the file if mmap is available and the platform is 32-bit. The API user can then wrap this in something consuming the input byte-by-byte such a decompressor and not have to worry about the number of kernel calls. Similarly, the default non-tokio version can still use async I/O behind the scenes without the consumer being forced to use an async API themselves. It all just... works by default, as long as it is the default.

3 Likes

In my experience, a claim like this is rarely true. So when I read it, I tend to discount the other things you say.

1 Like

I've made a career fixing issues exactly like this, so I dunno... it may be hyperbole, but it's effective. 8)

I've seen - repeatedly - clusters of servers worth millions running like slow molasses because someone used a too-small network buffer throughout the codebase. It was the default in C++. It's going to be the default in Rust. It's going to be slow too. It's not complicated.

Firstly, the BufRead or Read2 traits are 100% stream-based I/O. The difference between those and Read is only that the buffer is "handed to the user" instead of being "passed in" AND that the source position is not forcibly advanced after the buffer is available to consume. This gives the API designers more flexibility if this is the default, and the user code is more elegant and more efficient by default.

There are a couple of interacting / composable use-cases here, with mmap just being one. Picture a scenario where there are many libraries, some third-party, some low-level, some simple, some complex. Ideally you'd want to be able to compose these:

  • With a minimum of fuss.
  • With a minimum of unnecessary copies.

So the traditional Read model seems fine, but lets take the simplest scenario you're ever likely to face: copying. You just want to copy a file to a file, or a file to a socket, or a socket to a file. Something simple like that.

Now in the naive model, the timeline would look something like:

let buf = ...; // typically ~64KB or whatever
...
src.read(buf); // obviously this is a loop, but pretend it's unrolled.
dst.write(buf)
src.read(buf);
dst.write(buf);

At no point is your program simultaneously reading and writing.

Secondly, if the incoming data is a socket, then the src stream behind the scenes needs a second buffer for the network device to write to while you're writing to dst, otherwise the incoming packet data would be dropped on the floor during this time. So really, there's a device -> kernel and then a second kernel -> user copy. While you're receiving that second copy from the kernel, you aren't sending anything to the destination. Which is also a user -> kernel copy. We're up to 3 copies already, but logically we only wanted to do 1 copy. Ugh.

There are all sorts of other complexities at play here as well that are commonly overlooked. For example, if the user-provided buffer is too small -- I've seen 512 bytes in legacy code -- the kernel transitions will get you down as low as a few hundred KB/sec no matter what the hardware. Ouch.

If the buffer is too big then you still have problems:

  • It'll blow through your CPU caches and hit main memory. ~8KB is pushing it for L1 cache, 32KB for L2 cache, and a few MB for L3 cache. Sometimes this helps performance. Sometimes it doesn't. Your naive copy code has no chance of determining what to do reliably.
  • While you're synchronously waiting for the the dst.write() to clear, the NIC can't arbitrarily fill up the kernel with junk you're not consuming. It eventually will start dropping data on the floor, typically around 64KB for older operating systems and up to 2-16MB on modern ones. Now you're getting retransmits and other fun inefficiencies. Your throughput graphs start to look like the teeth of a comb, and everybody is scratching their heads wondering why the app can't get "full throughput" on modern equipment. People usually start going through Wireshark traces at this point and blaming the network team down the corridor. That rarely helps.
  • If this is, say, a web server handling thousands of simultaneous connections, you could exhaust main memory or create a massive backlog of IOPS on the destination drive, causing all sorts of grief. How is the "simple" copy code to know that it needs to reduce its buffer size due to server-wide contention? What if the buffer size is a constant six layers deep in some "ftp-rs" lib or something?
  • If this is a socket-to-socket copy, you're just contributing to buffer bloat, and now the aforementioned network engineers hate you with a burning passion.

None of this is news, everybody knows that this is a problem, that's why asynchronous I/O libraries like tokio are a thing. Well... okay, not everybody, the classic GNU cp command-line tool uses wonky buffer sizes sometimes, does synchronous copies, and had a number of decades-old bugs that only got fixed recently. They're clearly noobs though. Weeeellll.. okay, not just them, Microsoft messed up too back in the days with Windows Vista, which had notoriously slow file copies because they got the buffer sizes wrong. Oops.

Anyway, on to BufRead/Read2: by default, the timeline would pretty much look like the above, with only an extra consume() call, which is not much more than incrementing an integer:

let buf = src.read(0); // we *get* a buffer instead of passing one in...
dst.write(buf); // ideally, this ought to be a 'move' so we lose control of the buffer.
src.consume(buf.len()); // just copy as much as we can, fast as we can...
let buf = src.read(0); // The magic: nobody said this is the SAME buffer!!!
dst.write(buf);  // Now we're just passing buffers like a bucket brigade...
src.consume(buf.len());

It looks similar enough, right? Not exactly rocket-science for the end user. It's not even 1 extra call, because we got to skip the buffer allocation at the beginning. However, behind the scenes, nothing stops the implementation of the src and dst streams doing all sorts of magic.

For example, imagine that src is a user-mode network library socket. Still a "socket", still a synchronous "stream", no fancy tokio asynchronous stuff in the picture at all.

The way these work is that when you open the src socket stream, the library allocates a pool of buffers, say ten 64KB buffers in your process. As data comes in, the NIC directly writes the data into the user process memory space, bypassing the internal kernel buffering and copying. When you request a buffer, it "peels off" a filled one and you get to consume it. Meanwhile, the network adapter continues asynchronously writing to the remaining buffers in the background while you're busy sending data to dst. You get the buffer size that the network driver likes. It can increase the number of buffers for you if the driver thinks this is a good idea. The buffer can be shared process-wide to prevent memory exhaustion.

But wait a minute.. Rust has "move semantics" by default! If the API is designed the right way, then the default write operation will move the buffer, taking ownership of it. At this point, it is potentially free to asynchronously perform the write behind the scenes -- until you call flush() or whatever.

In other words, the naive file-copy scenario could have tokio-rs level asynchronous magic going on... or not... depending on the specific implementations of the reader and writer traits. This would be relatively transparent to the read/write copy loop, which never even got to make a buffer-size decision, which it can't do properly anyway out-of-context.

Realistically the mmap case isn't a massively better scenario than traditional I/O, it just allows the kernel to share a memory space with the user-mode program, reducing the copies from 3 to 2 for a "copy," and from 2 to 1 for a "reading" use-case. That's not a massive win in practice. Realistically, what tends to happen is people skip over the fiddly "window sliding" code they should be writing, incorrectly assume that mmap == &[u8], then their code can be simpler, much faster, and wrong.

Generally in this thread the feedback from some people seems to be that they want Rust to provide "full control" and that this style of API design is somehow "taking control away", except that it's not. The application developer still gets 100% of the control over buffer allocation. After all, they get to choose which trait implementation they want to use. Plain old File couple with a a single-buffer-reader? You get crappy performance, but hey, that may be fine for your use-case. Choose a memory-mapped file reader or a user-mode network socket? Enjoy your blazing-fast 0-copy performance without having to rewrite your reader code! Let the std::io default decide for you? Enjoy your automatic performance boost when people finally realise that NVDIMMs and RDMA is the new standard and Rust gets updated to match.

Some people are happy to hand-roll things themselves and sticky-tape a bunch of complex libraries together, which works great... until they need to decrypt a stream, decompress its contents, and parse that data. Some of these are forced to copy, then ah-well, nothing you can do. However, some streaming libraries could benefit from being able to optionally pass through huge chunks as-is, but if you're composing streams from third-party library code then you might be doing 6 or 7 copies in the various Read layers before you are done. Think of HTTP with its embedded binary streams, or an automatic text decoder that detected UTF-8, or retrieving a binary BLOB via something like SqlDataReader.GetStream()...

PS: I hope you like tiny buffers, because std::io::copy uses 8 KB, set with a compile time constant:

https://github.com/rust-lang/rust/blob/e5277c1457d397f22ba18a1d40c1318729becbb4/src/libstd/io/util.rs#L47-L55

And tokio's version (nearly 100 lines just to copy data!) uses 2KB, also a compile-time constant:

https://github.com/tokio-rs/tokio/blob/f1a7caea3fb805c5b1b1fe1ba1551910e4a95911/tokio-io/src/io/copy.rs#L35-L48

3 Likes