Code generation for iterator chains

Hi,

Now and then I take a look at the code generation in Rust for the code I write to make sure what it does and it gives me confidence in the code I write that it generates the code I expect which I think is important in a systems language.

Anyway, I was playing around a bit with iterators and and it's possible to reduce quite a bit of code by using them so I tried this out

I implemented this code

   fn get_rect_by_handle(&self, handle: u64) -> Option<Rect> {
       for dock in &self.d0 {
           if dock.handle == handle {
               return Some(dock.rect);
           }
       }
        
        for dock in &self.d1 {
            if dock.handle == handle {
                return Some(dock.rect);
            }
        }
        
        None
      }

Then I rewrote it like this

fn get_rect_by_handle(&self, handle: u64) -> Option<Rect> {
      self.d0.iter().chain(self.d1.iter()).find(|&dock| dock.handle == handle).map(|dock| dock.rect)

The first code (with two loops) gets generated like this:

_ZN5Split18get_rect_by_handle20h5c2c7519bc9bc06fmcaE:
    .cfi_startproc
    mov    rax, qword ptr [rsi]
    mov    rcx, qword ptr [rsi + 16]
    add    rax, -16
    shl    rcx, 4
    .align    16, 0x90
.LBB0_1:
    test    rcx, rcx
    je    .LBB0_5
    cmp    rax, -16
    je    .LBB0_5
    add    rcx, -16
    cmp    qword ptr [rax + 16], 2
    lea    rax, [rax + 16]
    jne    .LBB0_1
    jmp    .LBB0_4
.LBB0_5:
    mov    rax, qword ptr [rsi + 24]
    mov    rcx, qword ptr [rsi + 40]
    add    rax, -16
    shl    rcx, 4
    .align    16, 0x90
.LBB0_6:
    test    rcx, rcx
    je    .LBB0_9
    cmp    rax, -16
    je    .LBB0_9
    add    rcx, -16
    cmp    qword ptr [rax + 16], 2
    lea    rax, [rax + 16]
    jne    .LBB0_6
.LBB0_4:
    mov    rax, qword ptr [rax + 8]
    mov    qword ptr [rdi + 4], rax
    mov    dword ptr [rdi], 1
    ret
.LBB0_9:
    mov    eax, dword ptr [rip + const4699+8]
    mov    dword ptr [rdi + 8], eax
    mov    rax, qword ptr [rip + const4699]
    mov    qword ptr [rdi], rax
    ret

And the second one like this

    .cfi_startproc
    mov    rcx, qword ptr [rsi]
    mov    r9, qword ptr [rsi + 16]
    shl    r9, 4
    add    r9, rcx
    mov    rdx, qword ptr [rsi + 24]
    mov    r8, qword ptr [rsi + 40]
    shl    r8, 4
    add    r8, rdx
    xor    r10d, r10d
    .align    16, 0x90
.LBB0_1:
    movzx    eax, r10b
    cmp    eax, 1
    je    .LBB0_6
    cmp    eax, 2
    jne    .LBB0_3
    cmp    rdx, r8
    je    .LBB0_15
    lea    rsi, [rdx + 16]
    mov    r10b, 2
    mov    r11, rcx
    jmp    .LBB0_12
    .align    16, 0x90
.LBB0_6:
    cmp    rcx, r9
    je    .LBB0_15
    lea    r11, [rcx + 16]
    mov    r10b, 1
    jmp    .LBB0_5
    .align    16, 0x90
.LBB0_3:
    cmp    rcx, r9
    je    .LBB0_10
    lea    r11, [rcx + 16]
    xor    r10d, r10d
.LBB0_5:
    mov    rsi, rdx
    mov    rax, rcx
    jmp    .LBB0_13
.LBB0_10:
    cmp    rdx, r8
    je    .LBB0_15
    lea    rsi, [rdx + 16]
    mov    r10b, 2
    mov    r11, r9
.LBB0_12:
    mov    rax, rdx
.LBB0_13:
    cmp    qword ptr [rax], 2
    mov    rcx, r11
    mov    rdx, rsi
    jne    .LBB0_1
    test    rax, rax
    je    .LBB0_15
    mov    rax, qword ptr [rax + 8]
    mov    qword ptr [rdi + 4], rax
    mov    dword ptr [rdi], 1
    ret
.LBB0_15:
    mov    eax, dword ptr [rip + const4376+8]
    mov    dword ptr [rdi + 8], eax
    mov    rax, qword ptr [rip + const4376]
    mov    qword ptr [rdi], rax
    ret

The second one generates a bunch of more instructions so I'm wondering if it's to be expected that the second version (to what it seems at least) will be slower than the first one?

Cheers!

1 Like

What compiler switches did you use?

Did the test on playground so the default parameters there (and in Release mode) the results are quite similar/identical on Stable/Beta/Nightly

The chaining seems to be the stumbling point here. Using

self.d0.iter().find(|&dock| dock.handle == handle).or_else(||
    self.d1.iter().find(|&dock| dock.handle == handle)).map(|dock| dock.rect)

gives me the same timing as the for-loop version, while the chain version takes twice as much time.

Perhaps it's because of chain(). It turns two adjacent array scans into something that's harder to optimize. Perhaps this is faster:

self.d0
.iter()
.find(|&dock| dock.handle == handle)
.map(|dock| dock.rect)
.or_else(|| self.d1
            .iter()
            .find(|&dock| dock.handle == handle)
            .map(|dock| dock.rect))

Thanks for the replies and suggestions!

Yeah it seems like the chaining causes the compiler to have a harder time to optimize the code. In my current case it isn't a super big deal but good to be aware of at least and I find it interesting to looking at the code gen anyway :slight_smile:

A similar problem with chaining is present in D and C++ too. Perhaps the Rustc compiler can face such problems with something similar to the high-level rewrite rules of the Haskell compiler:

https://downloads.haskell.org/~ghc/7.0.1/docs/html/users_guide/rewrite-rules.html

But in Rust it's harder to do it because it's less pure than Haskell (purity is not a binary thing, it has many shades).

Here's another example using the VecDeque iterator.

My idea of it is that the VecDeque or equivalently .chain() iteration really corresponds to two plain loops after each other, and the optimizer would need a pass that tries to identify and optimize that case. Since these lazy iterator chains are not common in C++, it's no wonder that the optimization passes that rustc have mimiced from a C++ compiler don't handle this at all.

Making this fast can be done by implementing find on the Chain iterator type, instead of letting the default impl call next all over: chain_find.rs ยท GitHub

1 Like

This problem is sufficiently known (and studied) in the C++ world too (in Rust such ideas could lead to some kind of hierarchical iterator traits):

1 Like

I started reading that paper, but there seems to be a fundamental misunderstanding of generic programming (or maybe its just the presentation order of the paper starting with the data structures). Iterators result from a classification of the access patterns in algorithms not data structures. So we start with a bunch of algorithms for things like partition, rotate, stable-partition, sort etc. We take the definitions of the algorithms and generalise the data access patterns, we then give these access patterns names like ForwardIterator, BidirectionalIterator, IndexedIterator etc.

If there is some class of algorithms that do not fit these patterns, it does not invalidate the existing classifications (as all those useful algorithms still exist). The existence of algorithms that require a BifurcatingIterator does not make the algorithms that require a ForwardIterator suddenly not exist. (Edit: the paper makes this point as well when it says that segmented-iterators are orthogonal to forward, backward etc.)

The starting point should be to look at non-generic algorithms over one-off data-structures and look for the access patterns that could be abstracted into a new kind of iterator. (Chapter 7 of "Elements of Programming" seems relevant here).

Edit: this quote in particular: "Additionally, a fully generic algorithm ought to be usable with both segmented and nonsegmented iterators." exemplifies this misunderstanding. We already have a fully generic algorithm, but it may not be as efficient as a specialised one in all cases. The paper makes this point earlier when it talks about the bidirectional-reverse algorithm being more efficient than the forward-reverse algorithm, so obviously we should use bidirectional-reverse if the data-structure supports efficient bidirectional access.

Edit2: Coming from both a C++ and Haskell background, I think Rust should have a hierarchy of iterator traits, I agree with you about that. What C++ already has, and that paper was assuming as a baseline already seems more sophisticated than what Rust has in terms of Iterators, so I agree with you. I just find Stepanov's books to be clearer on where we should be going with this kind of generics.

Not all parts of D are well designed, but its Ranges design has something that's good enough. So also take a look at D.

Ranges is something the STL didn't do, and I have been wondering if that is deliberate, as algorithms that take a pair of iterators are more general. What would be nice would be a way to have free iterators, that are also constrained to be part of a bounded range. Maybe this could be confirmed statically. I think this is an interesting area to look at.

Ranges are probably coming in the next C++.

Pairs of iterators are a bit more general, there are few algorithms and usages that become worse using ranges. But in practice in most cases this isn't a significant disadvantage. On the other hand ranges are safer, simpler to use, lead to more succinct code that's more readable, etc.

I guess it depends on what you are trying to do, as I am interested in generic programming, where the aim is to present each algorithm in its most general form with no loss in efficiency, it does matter.

Most of the algorithms in the STL were created to implement one algorithm, stable-partition. In order to be able to compose the more complex algorithms out of the simpler ones requires the freedom of iterators. Even in my parser combinator library which uses ranges, you need a range for the file, but an iterator to represent the location in the file.

The D ranges have the same purposes. Take a look at the D std library and D code. In general the efficiency of D code is not significantly different from C++ code, if you use similar algorithms and data access patterns. D type system is more flexible and simpler to use than C++ one (with better handling of compile-time stuff), this often makes average D code use more specialized code than C++ (take a look at the D memory allocators library).

On average D coding with ranges doesn't seem significantly worse than C++ code with iterators. Take a look at the situation more in general, languages (with their standard libraries) are ecologies, they aren't collections of small features.

Have you pushed a PR already?

But is this a general enough solution?

Yup:

It's certainly not "general". But it's something we can do now, easily, while I don't see an overhaul of the whole Iterator philosophy coming any time soon.

2 Likes

I will take a look. My first objective is to port the EoP algorithms to Rust as they are. I do appreciate the advantages of ranges, but how do you deal with algorithms like this:

    pub fn partitioned<I, P>(f : I, l : &I, mut p : P) -> bool
    where I : Iterator + Readable, P : FnMut(&I::ValueType) -> bool {
        // Precondition: readable_bounded_range(f, l)
        let g = find_if(f, l, &mut p);
        *l == find_if_not(g, l, p)
    }

'g' returned from find is clearly a point not a range, however we could constrict a new range from [g, l) at the cost of some performance.

At the moment I am considering if 'rangeness' could be a type system property relating two values, so that ranges can be implicit.

I tend to not favour the 'chaining' syntax as I think it is harder to read. I also prefer classic function syntax to method notation, especially now Rust has multiple dispatch on traits, so the first argument is no longer special.