Rust beginner notes & questions

I think the case you lay out for read2 over read is compelling in the sense that an API like read2 needs to be created for Rust. It may even need pushed into the standard library at some point, and read reimplemented in terms of read2 at some point. That being said, I don't see why it isn't entirely appropriate to create and iterate on such an API on crates.io and only once the kinks are worked out worry about whether or not it belongs in the standard library.

As has been pointed out, even C# with .Net Core is moving more away from a monolithic standard library and more towards a model similar to cargo (Nuget).

You seem to have a lot of low-level, yet, Enterprise-class experience that could be good for the Rust community. I think that no one here takes your insights lightly, I think there is just disagreement about the balance between what needs to be in the standard library vs what needs to be developed and nurtured in crates.io.

8 Likes

I don't think this blindly assumes a file can't be greater than u32 (equiv to usize on 32-bit); rather, I think it says, "If I'm on a platform with a maximum (directly) addressable range of 32-bits, I can't mmap a file bigger than that." Granted, there are ways to do paging/windowing and such (as you've described), but, this is simply a wrapper around the platform's mmap implementation, and that would not support anything to be mmappable greater than the usize for that platform.

Having a flip through the memmap-rs crate shows that in this case it's technically correct to return Result<usize> based on the description of get_len(), but this is confusing for the user.

Files can be bigger than u32::MAX, and it's always been possible to memory-map files bigger than 4 GB on 32-bit platforms, exactly the same way it's been possible to read files of arbitrary size using streaming APIs since forever.

The "mapped window size", and the "file size" are distinct concepts, only the former is limited to isize (not usize!). The memmap-rs crate conflates the two in several places. Similarly, it uses the wrong integer type for the "offset into the file":

https://github.com/danburkert/memmap-rs/blob/b8b5411fcac76560a9dccafd512fa12044dd64b3/src/lib.rs#L89-L92

The offset should always be u64, even on 32-bit platforms. For example MapViewOfFileEx takes dwFileOffsetHigh and dwFileOffsetLow parameters so that 64-bit offsets can be passed in using two 32-bit DWORD parameters. I think this API has been there since... NT 3.11 or NT 4. I dunno. A long time, certainly.

Submitting an issue for memmap-rs now...

5 Likes

Wouldn't it be more correct to say that you can mmap a 4GB window within a file larger than 4GB on the 32-bit platform? You can't actually map the whole file contiguously, right?

2 Likes

Yes.

You can map up to a 2 GB view (actually several!) into a file of any size on Win32. The upper half of the address space is reserved for the kernel and cannot be used for mapping, even with the /3GB flag (apparently).

I suspect Linux 32-bit has similar behaviour, but don't quote me on that.

There's a diagram on this page: Memory-Mapped Files | Microsoft Docs

To quote the KB article:

"Multiple views may also be necessary if the file is greater than the size of the application’s logical memory space available for memory mapping (2 GB on a 32-bit computer)."

2 Likes

I feel compelled to add that this was a thing in the 90's for Sun boxes, and was used to hide latency and coalesce IOPS to disk.

1 Like

I suspect that was a battery-backed RAID controller cache, which is pretty common in all modern servers or SAN disk arrays.

Compared to that, the future looks amazing: Intel Launches Optane DIMMs Up To 512GB: Apache Pass Is Here!

This is just the beginning! Pretty soon you'll be seeing practically everyone running databases with 100% of the I/O coming directly from these. Compared to that, NVMe flash storage at a "mere" 3500 MB/s looks glacially slow!!

Somewhat ironically, the great work that has been done by the various C#, Java, JavaScript, and Rust core developers bringing asynchronous I/O programming to the masses is now hampering performance in this shiny new world of non-volatile memory. The overhead of even the most efficient futures-based async I/O is simply massive in comparison to the latency of NVDIMMs, which are byte-addressible and are only 2-3x slower than normal DIMMs. When your storage latency is measured in nanoseconds, individual instructions matter and user-to-kernel transitions are murder...

Also see: https://www.snia.org/sites/default/orig/SDC2013/presentations/GeneralSession/AndyRudoff_Impact_NVM.pdf

2 Likes

Intel have made great claims about Optane before, and delivered in the end something which barely matches a modern SSD (and therefore does not justify the vendor lock-in or the software performance pitfall of adding yet another slower storage device to the virtual address space). I will wait for a more impressive product before calling this the future of storage.

Moreover, asynchronous I/O is mostly used for network I/O, not disk I/O. Because for disk, we pay the price of yet another stupid legacy decision, namely the Linux kernel devs declaring that you do not need asynchronous storage I/O as the disk cache will take care of everything for you. Tell that to performance engineers who have to debug latency spikes caused by software randomly blocking, now that every pointer dereference can turn into a disk access...

4 Likes

A slice is a window of usize indexable range. The actual range may be less, ie isize on 32bit windows, but that’s beside the point.

3 Likes

I have never, not once, claimed to solve all problems related to memory maps.

4 Likes

Now that I've toyed around with my pseudo-code of what it might look like (Read2), finding real issues (the avoidable mmap-related error in ripgrep), and realising that zero-copy is the future (NVDIMM), I feel like something akin to System.IO.Pipelines is necessary.

I'm not at all saying that it'll look exactly like my Read2 sample, that's just me doodling. A realistic example would probably have 3 to 4 low-level traits in some sort of hierarchy. It would obviously have to handle both read and write. It would probably need some restrictions on the types it can handle, likely Copy or Clone. I quickly discovered that the borrow checker makes this kind of API design... hard. Really hard. Someone with more Rust skill than me would have to put serious effort in.

The best chance this has if the current futures and tokio refactoring effort embraces this I/O model and it eventually replaces legacy Rust I/O. Paging @carllerche and @anon15139276 -- you guys might be interested in this thread...

Both. 8(

Missing trivial things like hex-encoding in a standard library? In 2018? Too thin. A standard library that leans heavily on macros? Will never be properly IDE-friendly. The current direction is "Macros 2.0". People want more of this. Some of the same people that complain about compile times, I bet. Sigh...

I'm saying Rust has most of the language features needed to skip over the meandering path taken by other languages, a running-start if you will. But huge chunks of it seems to be starting at step #1 and repeating the same mistakes, accumulating the same baggage along the way.

At some point someone will say: "This Rust language is too complicated and messy, I've got this great idea for a far simpler language!" and the wheel will turn one more time. 8)

My $0.02 is that pace of language development is accelerating. I think Rust tried to "stabilize" too early, and instead should have embraced true "versioning". I would love to see a language that has a file extension such as .rs1 and then .rs2 the year after, dropping source-level compatibility on the floor. The PowerShell guys nearly did this, and now they're pretending that PS6 Core on Linux is 100% compatible with .ps1 files written for Windows 2008 Server! 8)

I'm not sure about the latest C++ stuff, I've avoided C++ for a decade because of productivity-killing issues like the linker exploding in my face with gibberish errors because of external libraries being incompatible with each other.

For another real world example of zero-copy in the wild, refer to Java's NIO library. Note that just like how I mentioned that my sample Read2 trait would likely need a variety of implementations for different styles of buffering, the Java guys have HeapByteBuffer, DirectByteBuffer, and MappedByteBuffer. A lot of people assumed that this style of I/O implies dynamically growing heap allocation only, but that's not a constraint at all.

No, it was host cpu memory, though it was mapped in the kernel with a device driver layer that used it exactly like RAID controllers would later.

My apologies, I misunderstood. You did claim though that BOM-detection was something you've solved without much difficulty. That may be true, but it's an especially simple case in a larger class of problems, including things like smoothly handling memory-mapped I/O and then layering things like BOM-detection on top and also doing so safely on 32-bit platforms.

Or even 64-bit platforms, actually, as I was pointing out above: https://github.com/BurntSushi/ripgrep/issues/922

I'd also like to apologise for picking on ripgrep, in no way is this reflective of the quality of your coding. I'd rather that the situation be that people with less skill than you be able to write Rust code that is fast and correct. This takes special care and attention in the std library to "steer people into the pit of success" instead of the "pit of failure". Right now I suspect a lot of people are falling into pits full of stabby, pointy things...

You did, and you're far better at Rust than I am. What chance do I have?

Yeah, and? I have that goal as well! (But to be clear, I don't like you're phrasing, but I get your drift.) This very thing is a large part of why I go out and library-ize a lot of the code I write. Some day, maybe those shims will become library-ized and others will be able to do what ripgrep does without needing to rewrite those shims. I see no reason why this cannot be done. I see no reason why the Read trait isn't composable enough to build that kind of infrastructure. I see no reason why memory maps couldn't be made to work correctly on 32 bit systems if there is demand for it.

Like, we get it. You're unhappy because your favored abstraction hasn't been built in Rust yet and library-ized. You've made your point. The next step is (for someone) to prototype it and solve a real problem that folks care about with it. This navel gazing is pointless IMO. I tried to tell you this a dozens of comments ago.

11 Likes

I may be missing a subtle point, but to me the proposed Read2 trait:

trait Read2  {
    type Data; //  = u8; // with associated type defaults.
    type Error; // = (); // with associated type defaults.

    /// Returns at least 'items', which can be 0 for best-effort.
    fn peek(&mut self, items: usize ) -> Result<&[Self::Data],Self::Error>;

    /// Can consume any number of items, acting much like `skip()`.
    fn consume(&mut self, items: usize ) -> Result<(), Self::Error>;
}

looks awfully similar to the std::io::BufRead trait:

pub trait BufRead {
    /// Fills the internal buffer of this object, returning the buffer contents.
    ///
    /// This function is a lower-level call. It needs to be paired with the consume method to function properly.
    /// When calling this method, none of the contents will be "read" in the sense that later calling read may
    /// return the same contents. As such, consume must be called with the number of bytes that are
    /// consumed from this buffer to ensure that the bytes are never returned twice.
    fn fill_buf(&mut self) -> Result<&[u8]>;

    /// Tells this buffer that `amt` bytes have been consumed from the buffer,
    /// so they should no longer be returned in calls to `read`.
    fn consume(&mut self, amt: usize);
    ...
}

edit: to elaborate a little more, I'd say imo Read and BufRead are two approaches to the IO API which mainly differ on who owns the buffer, the caller (for Read) or the IO construct (for BufRead).

Depending on the underlying OS APIs, one or the other may be the cheapest (after all, implementing BufRead directly for a mmaped file rather than on top of Read makes a lot of sense). But non is intrinsically more general / always better than the other, and both likely are worth having and keeping.

4 Likes

Ah, okay, that sounds like a substantially different complaint than the one about streams/pipes. So are you no longer concerned that the existence of a stream API is fundamentally a mistake? (I understand you still prefer the "pipe-like" API and think it's a fatal flaw for a language not to support it in the standard library, but I don't see a connection between this and wishing that streams didn't exist.)

I've heard the argument that macros increase compile time, but I'm not sure if it's supported by data, and I'd love to see a more full analysis. The rustc provides -Ztime-passes for basic profiling, and as far as I've seen (e.g. in this post), macro expansion is generally not a major time-sink; the general consensus seems to be that codegen is the biggest contributor to slow compiles. That said, perhaps macro usage is a major contributor to the amount of code being generated.

As for IDE-friendliness, you may be right. In principle I can't think of any reason why the Rust Language Server (RLS) or another IDE-backend couldn't provide pretty robust support for macros, since they are better integrated into the language proper than preprocessor-based macros. But as it stands, it appears that RLS has some major struggles with macros, though the IntelliJ plugin has apparently made good progress here.

1 Like

Surely you don't think that's a bad thing! I'm sure even Rust's most vocal proponents are aware that no one language (at least not this early in the history of computing) will become the "one true language" (even for a given design space) indefinitely into the future.

That's...a really fascinating idea, but one that, at first glance, seems to have only downsides. Backwards compatibility is certainly a pain, but it appears to be a sine qua non for industry adoption. When you say "dropping source-level compatibility", do you mean that some kind of lower-level compatibility (e.g. with FFI between different language versions) should still be maintained?

That post seems to reaffirm that the mmap approach wouldn't really be preferable to the stream approach unless the data in question is already in-memory on a fast hardware device. Since that's not something that can generally be assumed, I still don't see the existence of memory mapping as a good reason to avoid stream-based IO by default, especially in the case of networked or distributed systems (since data over IP by definition isn't already in memory). Am I missing something, or is that not the argument you're trying to make?

I would argue that BufRead/Read2 approach is always better, but this takes some insight into API design. An incredibly common "learning curve" I see goes like this:

Q: I did this I/O code! It's slow! Can someone help please?
A: You're using Read, but you're doing too many kernel calls because you're consuming a few bytes at a time. You should use BufReader. This "lesson" is right there in the doco in the first example.

This was the underlying root cause of poor performance of the pre-Firefox Mozilla web browser for years. Practically all I/O it did was line-by-line, unbuffered, and flushed between lines when writing text files like the JavaScript profile contents. Never mind that terrible performance, this was downright dangerous. The Mozilla suite also handled POP3/IMAP email, and tens of thousands of people -- including me -- lost all of their mail data because the Mozilla suite would shred their profile on exit by cutting files in half. Oops. I submitted a bug ticket, which I discovered was one of dozens and was promptly ignored along with everyone else for a decade. Thousands of desperate people had submitted comments along the lines of "Please help! I've lost all my emails!!!".

The Mozilla team basically ignored this because it was just too hard to dig through all of the I/O code littered throughout the codebase and carefully ensure that all of it is appropriately buffered, flushed only at safe transaction points, and file replace operations were appropriately atomic. Firefox thankfully now uses SQLite for most I/O which doesn't have these issues.

So what's the true root cause here? The issue is that the "Read a buffer / Write a buffer API is a simple common denominator" is a trap. It's a pit full of pointy spikes. It's the C/C++ approach. Professionals fall into it all of the time. The entire Mozilla team did. It's not efficient for reads, it's dangerous for writes, and it doesn't scale to even user-mode applications like Firefox, let alone high-performance severs. Like I was saying in this thread, it can't even handle memory-mapped file I/O, which dates back to at least the 90s.

What the "API user" actually wants from any I/O is typically: "Give me as much data is efficiently available right now, and I'll see how much I can consume, most likely all of it. Don't stop reading just because I'm processing data."

Read doesn't do this. You provide a pre-constrained buffer of some fixed size for each call. You have to guess at what is a good size for this. Your guess will be wrong. If you make this buffer too big, then the inherent copy in the API will blow through your L1/L2 CPU caches and your performance will be bad. If you ask for too little, then you will spam the kernel with transitions and your performance will be terrible. If you try to layer things on top of each other (ZipStream on ChunkStream on CryptoStream) then you will have an absolute nightmare holding onto bytes not consumed by the various layers as they reach the end of their roles. As the consumer of this API, everything you do is difficult and likely to be bad.

There is no scenario, ever, where Read is truly easier for the API user. The single call vs the 2 calls may seem like a "lighter weight" API, but this is just going to lead to poor performance, unnecessary copies, cache thrashing, and even lost data and crying users. Always. Every time. Everywhere. To the point that the Mozilla guys failed to fix the bug for a decade.

Sure, it's possible that superhuman developers will not fall into this trap. I admit I fell into this trap at least a few times when I was a junior developer. I bet everyone reading this forum did at one point or another.

Meanwhile, the BufRead/Read2 style of API design allows the system with the knowledge -- the platform I/O library -- to make the judgement call of the best buffer size. The user can provide a minimum and allow the platform to provide that plus a best-effort extra on top. The best effort can dynamically grow to be the entire file if mmap is available. Or... most of the file if mmap is available and the platform is 32-bit. The API user can then wrap this in something consuming the input byte-by-byte such a decompressor and not have to worry about the number of kernel calls. Similarly, the default non-tokio version can still use async I/O behind the scenes without the consumer being forced to use an async API themselves. It all just... works by default, as long as it is the default.

3 Likes

In my experience, a claim like this is rarely true. So when I read it, I tend to discount the other things you say.

1 Like