Iterators and eliminating *all* runtime bounds checks


#1

Hi, I’m not very fluent in Rust.

I’m also told that “new users can only put 2 links in a post”, so I’ll write links like [L0] which means to look up “L0” in https://play.golang.org/p/8BPZBnE3ZD which I’m using as a pastebin. Sorry.

I recently released Puffs [L0], which is not about Rust per se, but the /r/rust sub-reddit discussed Puffs at [L1] and Rust also came up often on Hacker News’ Puffs discussion at [L2]

One discussion point was about implicit bounds checking and the efficiency of an array indexing expression like “a[i]”. Y’all don’t need telling that, in Rust, it’s more idiomatic to work with iterators instead of indexing expressions. Combined with a for loop, array bounds checks can be eliminated entirely, within the loop body.

Still, it’s not obvious to me how the Rust equivalent of my Puffs code at [L3] would be able to eliminate all of the array or slice bounds checks. I’m using “bounds checks” here to also include an iterator hitting the end of the slice.


Here’s a brief aside about Puffs syntax. It is a curly-brace, C like language, with a few idiosyncracies. These aren’t really relevant to my questions (below), but some of you might be curious.

There are “assert etc” statements in that file. Puffs assertions are entirely compile time concepts, they do not have a runtime representation. There is also no notion in Puffs of debug and release modes.

Some method calls have a question mark: “foo?(arguments)”. Roughly speaking, the question mark indicates a call (usually involving I/O) that can result in a co-routine suspension (e.g. “my write buffer is full” or “my read buffer is empty”). The arguments also have to be explicitly named: “foo?(name0:val0, name1:val1)” and not “foo?(val0, val1)”.

Integer types can be refined: “var x u32[…511]” means that x is a 32-bit integer that is no larger than 511.


Back to the main topic… I am iterating, but I’m not consuming exactly one byte per iteration. Each iteration (of the “while:loop” outer loop) consumes a variable number of bits (because that’s how Huffman coding works), and if there aren’t enough bits held in local variables (“bits” and “n_bits”), then more are pulled from the underlying source of byte data (in my code, called “in.src”). To repeat, each iteration consumes a variable number of bits, and depending on “n_bits”'s ebb and flow, therefore consumes a variable number of bytes.

I’m happy to be proven wrong, but I believe that that means that I couldn’t use the “for val in bytes_iter { etc }” mechanism if I tried to write this in Rust.

Instead, IIUC, every time I (conditionally) read the next byte (in Puffs, a “read_u8?” call, which happens at up to 12 places during that outer loop), the equivalent Rust code would be calling an Iterator<u8>'s next method. Sorry if I’m mangling Rust syntax or terminology. As I said, I’m not fluent in Rust.

The thing is that the callee would have to make a bounds check (comparing to the end of the slice), and the caller would have to handle both the None and Some cases of the Option returned, right? The compiler can optimize the two if-checks (one callee, one caller) into one, but the compiler can’t optimize the one to zero, right?

In Puffs, runtime bounds checks can be entirely eliminated because the compile time assertion mechanism propagates concepts like “starting with at least 12 bytes of data available in my read buffer, and conditionally reading 2 of those bytes, there is guaranteed to be at least 10 bytes of data available, regardless of which if-branch was taken”. Is there a similar compile time concept in Rust that could reduce that one conditional to zero (within the loop body)?


There was also some talk of dependent types in those Hacker News / Reddit discussions. IIUC there are proposals for dependent types in Rust, but they’re still proposals and not yet confirmed to be in Rust? I am not an expert on dependent types, but if there is an expert here that can show how those language proposals would also solve my problem (getting to zero runtime bounds checks), I’m happy to listen.


#2

At the risk of stating the obvious, I think this is not aiming for the same niche as Rust.

Rust aims to be general-purpose. Whereas Puffs is specifically targeted at the task of parsing the subset of computer file formats that its restricted feature set can grasp.

One consequence of Rust’s general-purpose focus is it must be easy to write readable and maintainable Rust code. Whereas Puffs seems to aim for a usage scenario where it is written once, transpiled to C and left to rest forever:

  • The documentation puts a high emphasis on interoperability with a pure C/++ toolchain, which in the envisioned compilation model is only achieved when code written in Puffs is left unmodified.
  • The checker is designed to be “dumb and fast”, which from an implementor’s points of view translates to “easy to write”, but from a programmer’s point of view translates to “painful to use”. Usually, the dumber a static checker is, the more its user needs to refrain from using specific programming constructs (e.g. pointers and modulo arithmetics) and to write redundant assertions over and over again to help it at its task. And indeed, your code examples do expose many traces of this.
  • This was justified in one discussion thread IIRC, on the ground that it is fine for the target application domain since file formats change rarely.

Personally, if I were going through the trouble of using a special and verbose programming language in order to assert at compile time, a property that must absolutely hold true (such as memory safety in the presence of untrusted input), I would rather go all the way to something like SPARK in order to make the code more maintainable, leverage the power of more competent theorem provers that save me from the need to repeat myself over and over again, and prove other parts of the specification along the way, so that my effort is amortized. But to each its own.


#3

In safe Rust bounds checks elimination is left up to LLVM. Iterator’s next() is inlined, so the optimizer can see what actually is checked and doesn’t need to create a literal Some/None wrapper at any point.

For slices there’s also get_unchecked() which you can use when you are sure the check is not needed.

in Puffs, a “read_u8?” call, which happens at up to 12 places during that outer loop), the equivalent Rust code would be calling an Iterator’s next method.

If the read_u8? suspends execution if the buffer is empty, there must be a check there, right? So in this case I think iter.next() is equivalent.

If you know you have n bytes left in the buffer, in Rust you can do this:

let buf = &buf[0..n]; // this will bounds check
for i in 0..n {
   buf[i]; // this will not do bounds check, because LLVM can see i < n
}

You can use the ASM button in the playground to see how code is compiled:

https://play.rust-lang.org/?gist=00e7568a195caaf758ad41002a129859&version=nightly


#4

Any time you can use the “is i in bounds or not” question to guide control flow, you are effectively omitting bounds checks. Stated differently, your program only has the necessary conditionals to guide the program.

This happens in a typical for loop:

for element in iterator { .. }

Before each iteration, we get the next element from the iterator, and if there’s no next element, the loop stops. So the “has next?” question is a “bounds check” but it’s one that guides the control flow of the program.

If we expand upon that to make a Rust program that uses two bytes per iteration, below. We have to use a one step less sugary construct than the regular for loop. In this program we use the check if there’s a second next byte, per iteration, and emit an error if there is not. In that sense there are no bounds checks in this program – only the necessary checks to guide program flow.

Only a program that does not check for this error case would be able to use bounds checked or non-bounds checked accesses there.

(playground link)

#[derive(Debug)]
enum MyError {
    PrematureEnd,
}

// This parser reads a sequence of i16 numbers as plain bytes in little endian order
fn parser(data: &[u8]) -> Result<Vec<i16>, MyError> {
    let mut iterator = data.iter();
    let mut result = Vec::new();
   
    // Get the first next byte
    while let Some(&byte_1) = iterator.next() {
        // Try to get a second byte this iteration
        // if there is no next byte, return an error
        let byte_2 = match iterator.next() {
            Some(&byte_2) => byte_2,
            None => return Err(MyError::PrematureEnd),
        };
        
        // we have two bytes, so produce the next result and continue
        let next_value = (byte_1 as u16 | ((byte_2 as u16) << 8)) as i16;
        result.push(next_value);
    }
   
   Ok(result)
}

What can be improved upon this? If we can use implied or nonlocal information, it would be better. For example in your programming language, is it possible to check that the length is divisible by two up front? In that case you make a one-time check for that instead of once every loop, and I don’t think Rust has a nice built in way to do the same thing at the moment. Except for when the compiler can eliminate bounds checks in optimization.


#5

I think in Rust the way you’d mimic dependent types is by careful encoding of invariants in the types and then using unsafe/unchecked code in the implementation. This is laborious and prone to type explosion (and human mistakes!) so having a language (compiler) assisted means of getting this right (with potentially fewer types and boilerplate) would be nice.


#7

I agree absolutely that Puffs is not aiming for the same niche as Rust. I just had people telling me “Rust can already do this”, and I wanted to learn from this Rust forum whether that’s entirely true.

Drifting off topic, but I discuss SPARK in https://github.com/google/puffs/blob/master/doc/related-work.md


As I said, the optimizer can eliminate the literal Some/None wrapper, but next() still has to do at least one check (to see if it’s hit the end), right?

Yes, but get_unchecked is an unsafe method. The purpose of the OP is to have no unsafe code whatsoever.

There’s one up-front check per loop iteration, not per read_u8? call. There are a variable number of read_u8? calls within each iteration, that number ranges from 0 to 12.

Once again, I’m not just iterating over the bytes in sequence. I don’t think I can do a “for foo in bar” iteration. I’m reading a variable number of bytes, interspersed with other computation, on each iteration of the outer loop.


Well, that’s what I think too, but I wanted to see if this Rust forum thought otherwise.


Yes, the point of the OP is, even with human mistakes, there are no calls to unsafe code.


#8

You could try creating temporary iterator from buf[0…12].iter() before the reads. If llvm is smart it will allow up to 12 next() calls without checks.


#9

Possibly. But then every time the loop continues, we’d have to advance the original iterator by however many bytes the temporary iterator consumed. What’s the idiomatic way to do that? Skimming https://doc.rust-lang.org/std/iter/trait.Iterator.html doesn’t suggest an obvious solution to me, although I’m happy to be proven wrong.

Would I need to keep an out-of-band count of how many bytes the temporary iterator consumed, and then explicitly call next() on the original iterator that many times? It doesn’t sound like a net win on performance.


#10

slice = &slice[n…] is Rust’s way of doing pointer arithmetic.

But it seems you need quite a special construct that isn’t in stdlib.
If you insist on iterators you could make your own slice iterator that can give a sub iterator that has a fixed length ahead of time bounds check and then also advances its parent iterator.


#11

I don’t know if I insist on iterators. I just have my (non-Rust) thing and had multiple comments on the theme of “Rust iterators can already do that”.

OK, but is that going to be as efficient as C’s “*ptr++” (with only one underlying pointer for your suggested child-and-parent iterator mechanism), without resorting to anything “unsafe”? Having rustc and/or llvm recognize and optimize that pattern is possibly a bit too niche. My thing (Puffs) is explicitly a niche thing, but Rust is a general purpose language.


#12

Just quickly, others have already noted that use of iterators can remove intermediate bounds checks, and the use of option will be optimised away. I think one of your remaining questions was about the remaining check within the iterator.

Specifically, in the following example, the 0…n iterator will still have a check to see if it is at n yet:

However, if n is a small constant known at compile time (as in the 12 byte example), then I’d expect the loop to be unrolled and even that check removed. Furthermore, in a similar situation every language is going to either be unrolling the loop, or having a check that the loop is finished.


#13

I saw your arguments, but I was unconvinced:

  • Software reproducibility needs are best addressed via containers or package managers that preserve precise versions, rather than by attempting to make the software as simple as possible. Changing a single line of code is enough to break reproducibility.
  • The Trusted Computing Base argument would be convincing if you weren’t using Go for the implementation. That is quite a thick runtime/stdlib to trust, and from a semantic point of view it is one of the worst statically typed programming languages out there when it comes to compile-time permissivity. I’m not saying that other languages used to implement theorem provers are any better (after all, Z3 is written in C++ ^^), just that being a smaller program written in Go doesn’t seem like a significant enough improvement to state “my trusted computing base is much better than yours”.
  • I don’t see how proof time is such a big problem in your target usage scenario, when the proof is only carried out rarely (because the code is rarely modified). A simple prover that fails fast because it isn’t able to infer simple program properties, requiring lots of manual effort, isn’t necessarily better than a complex prover which takes more time, but succeeds more often. It makes programmers more busy, not more productive.

#14

It may optimize to be as efficient as ptr++, but this is not guaranteed. If you want that guaranteed you can use pointers in Rust.

Note that “unsafe” in Rust is not as bad as it may seem. It’s as safe as every line of code in a C program, but in Rust if you have special requirement for performance or memory usage that the safe language doesn’t have, then you can do whatever you need and wrap it in a safe API, so that the unsafety is limited to one function or module. That’s how most Rust stdlib is implemented.

So if you’re generating code from another language that enforces extra safety guarantees, you can rely on that and use a few unsafe calls in Rust without making anything actually unsafe. That doesn’t diminish value of safety in Rust, because safety will still be enforced for all other code that interacts with it.


#15

Thanks, but you’re missing a point I’ve made a number of times in this thread. I’m not iterating over bytes. I’m iterating over Huffman codes, which take a variable number of bits, and therefore a variable number of bytes. The number of bytes needed per iteration depends on some computation made (and some bits and bytes read) inside that loop body. There is no “small constant n” that the compiler can unroll.


OK, we’ll agree to disagree. If you want to use something like SPARK, go right ahead. Programming is not a zero sum game where I ‘win’ when you ‘lose’.

But I’m drifting off topic.


Yep, to quote myself from https://groups.google.com/d/topic/puffslang/2z61mNTAMns/discussion

“But pointer arithmetic is explicitly unsafe in e.g. Go and Rust. Sure, you can hide the unsafe parts as private implementation and provide an unsafe-free public API. Still, as future maintainers edit the code, proof of safety (and the assumptions that they rely upon) are merely comments, not compiler enforced. It’s not about pointer arithmetic per se, but in cases like https://github.com/madler/zlib/issues/245 it’s far from obvious whether even the author of some widely used C code can be sure that it is safe, just from the comments.”


#16

I suspect @willu was addressing this part of your original post:

Even if there’s no constant, the compiler may unroll the loop by some unroll factor. This is a common form of loop optimization: split the loop into a pre, main, and post loops. The main portion is unrolled and any bounds checks are eliminated because range has been verified in the pre loop section. Post loop does a trailer to iterate the remainder number of times.

This isn’t guaranteed though, so if Puffs encodes that in the language (and thus guarantees) then it can do this better.


#17

The “loop unrolling” and “compiler optimization” arguments are not convincing. They’re good for doing things like getting rid of the Option wrapper, but if you really want to guarantee no bounds checks:

Slice iterators implement Clone

You can look ahead, and then conditionally decide whether to advance the top iterator or not, by cloning the top iterator and then doing your work with it:

use std::mem;
fn skip_all_spaces(a: &[u8]) {
    let mut i = a.iter();
    while let Some(&c) = i.next() {
        let mut j = i.clone();
        // in a real parser, this would consume variable-width items,
        // instead of just skipping spaces
        let mut skipped_spaces = false;
        while let Some(&c2) = j.next() {
            if c2 != b' ' { break }
            skipped_spaces = true;
        }
        // when we've consumed our variable-width token
        // we can "fast-forward" the main iterator without repeatedly calling next()
        // naturally, we can perform this conditionally
        if skipped_spaces {
            mem::swap(&mut i, &mut j);
            println!("{}", c);
        }
    }
}

playground link

You can extract a slice from a slice iterator

You can pull off whole “chunks at a time” by converting the iterator back into a slice, splitting it, and then turning it back into an iterator:

https://doc.rust-lang.org/stable/std/slice/struct.Iter.html#method.as_slice

fn treat_abc_as_one_big_token(a: &[u8]) {
    let mut i = a.iter();
    while let Some(&c) = i.next() {
        if c == b'a' {
            let mut j = i.clone();
            if j.next() == Some(&b'b') {
                if j.next() == Some(&b'c') {
                    // got the "abc" token; let's assume it should be followed by a four-byte thing
                    let sl = j.as_slice();
                    if sl.len() >= 4 {
                        // rust?! y u no have `fn try_split_at_mut(&[T]) -> Option<(&[T], &[T])>`?!
                        // it's not a big deal; this should be trivially optimizable, even if `4` is replaced by some variable
                        // as long as it's exactly the same
                        let (payload, rest) = sl.split_at(4);
                        handle_payload(payload);
                        i = rest.iter();
                    } else {
                        // got "abc" without its payload
                    }
                } else {
                    // got a b followed by something other than c
                }
            } else {
                // got an a followed by something other than b
            }
        }
    }
}

playground link


#18

As someone else noted, I was replying to a different part of your original comment. However, I don’t see the variable number of bytes as a big deal. Simply make an iterator over Huffman codes. On the other hand, if you’ve got a great language for this sort of stuff, great. :slight_smile:


#19

@nigeltao have you looked at nom yet?
It’s a “parser combinator”, that is currently the state-of-the-art in iterator-based (file) parsing in Rust. If anything, it would make a great example to see how others do it. (They explicitly mention being able to handle byte streams as bitstreams.)

One of its current feats-of-strength is that gcouprie used it to write file-format parsers for VLC (see this slide deck from RustConf2016).

I barely understand how it functions under the hood, but it has many facilities for matching variable-length elements in a byte/bit stream.


#20

He does mention nom at the end of the “related work” documentation: https://github.com/google/puffs/blob/master/doc/related-work.md


#21

Sure, with extremely short loops like “while (p < q) { x += uint32(*p++); }” or memcpy-like functions (both of which advance by a fixed number of bytes per iteration), compilers can and do unroll them e.g. 4 or 8 times (emitting shorter code like “p += 8” because 8 is 8 (the unroll count) times the fixed advance of 1), with a preparatory step beforehand and a clean-up step afterwards. But in my case, the loop body is 280 lines of code, with a variable advance. I’d be surprised if a compiler unrolled that, let alone guaranteed that.


I’m confused about how your skip_all_spaces example helps my problem. I don’t ever want to un-read bytes, so I don’t see the benefit of cloning the i iterator to j, working solely with j within the loop, then mem::swap’ping i and j at the end. Why don’t you just always use the i iter? You mention conditionality, but I don’t need conditionality for my problem.

An earlier comment said that “You could try creating temporary iterator from buf[0…12].iter() before the reads”, with the point of the explicit 12 being to eliminate bounds checks further along the loop body. Sure, here, a mem::swap would be faster than explicitly calling next() on the outer iterator. But if you cap the inner iterator at 12, this cap would propagate back to the outer iterator when you mem::swap, right?

OK, even if this the compiler can somehow make this whole dance as cheap as C’s “*ptr++”, your treat_abc_as_one_big_token example shows an explicit “if sl.len() >= 4” check within the loop body, which is exactly the sort of bounds check I’d like to eliminate.


Well, spinning out an iterator over Huffman codes would basically be factoring out my 280 line loop body into a 200 line iterator-producer and 80 line iterator-consumer, but it still wouldn’t eliminate any byte buffer bounds checks. It just moves them out of my function implementation per se into the iterator-producer implementation. Subsequent unrolling of the iterator-consumer part won’t help with that.