Rust vs. C++: Fine-grained Performance

Here's what I have now:

                if word.count_ones() == 7 {
                        sevens.push((word, 0))
                } else { words.push(word) }

...

    sevens.sort_by(|a,b| b.cmp(a));
    let mut place = 0;
    for i in 0..sevens.len() {
        if sevens[i].0 != sevens[place].0
            { place += 1; sevens[place] = sevens[i]; }
        sevens[place].1 += 1;
    }
    sevens.resize(place + 1, (0,0));

I am very happy with this.

In response to improvements posted here and on IRC, I have had to delete most of the complaints in the original text. I don't see any need to use scan() any more, and of late I have not encountered slowdowns using mutable variables outside the scope of loops, even with rustc 1.4. for &e in &v is still much slower than for &e in v.iter(), though, even for outermost loops.

2 Likes

String::from("/usr/share/dict/words")

=>

"/usr/share/dict/words".into()

Also:

                let a = match score + 3 * count
                    { 26 ... 32 => { any = true; 'A' }, _ => 'a' };
                *outc = a as u8 + (25 - rest.trailing_zeros()) as u8;
                (any, rest & rest - 1)
            });

You can remove a cast (generally, the less casts there are in your code, the better):

                let a = match score + 3 * count
                    { 26 ... 32 => { any = true; b'A' }, _ => b'a' };
                *outc = a + (25 - rest.trailing_zeros()) as u8;
                (any, rest & rest - 1)
            });

trailing_zeros() returns a u32 and unfortunately Rust is not yet able to see it's going to fit in an an u8 (but then you subtract 25, so you need a little smarter type system, like refinement typing, to prove that last cast to u8 safe (you have to keep the invariant about "word" values. This can't be expressed in simple ways in the current Rust type system)).

The latest version of your Rust code on GitHub has improved, but it's still not easy to understand... I like iterators chains, but when they are filled with mutations, then often I prefer regular imperative code.

Edit: if you're making a benchmark then don't forget to show the flags you use with both compilers. For the Rust code you sometimes want to use switches like -C opt-level=3 -C target-cpu=native -C lto.

In your code "len" is also used with a special value -1. In Rust you sometimes use an Option to denote such cases. With this change your code becomes:

    let mut len = Some(0);

    for c in io::BufReader::new(file).bytes().filter_map(Result::ok) {
        if c == b'\n' {
            if let Some(l2) = len {
                if l2 >= 5 {
                    if word.count_ones() == 7 {
                        sevens.push((word, 0))
                    } else {
                        words.push(word)
                    }
                }
            }
            word = 0;
            len = Some(0);
        } else if len.is_some() && c >= b'a' && c <= b'z' {
            word |= 1 << (25 - (c - b'a'));
            if let Some(l3) = len {
                len = if word.count_ones() <= 7 { Some(l3 + 1) } else { None }
            }
        } else {
            len = None
        }
    }

The code is a bit awkward because of the Option unwrapping.

Instead of code like this:

fn main() {
    let x = Some(10);
    if let Some(y) = x {
        if y > 5 {
            println!("Big");
        }
    }
}

I'd like to write:

fn main() {
    let x = Some(10);
    if let Some(y) = x && y > 5 {
        println!("Big");
    }
}

With that the code becomes reasonable:

    let mut len = Some(0);

    for c in io::BufReader::new(file).bytes().filter_map(Result::ok) {
        if c == b'\n' {
            if let Some(l2) = len && l2 >= 5 {
                if word.count_ones() == 7 {
                    sevens.push((word, 0))
                } else {
                    words.push(word)
                }
            }
            word = 0;
            len = Some(0);
        } else if let Some(l3) = len && c >= b'a' && c <= b'z' {
            word |= 1 << (25 - (c - b'a'));
            len = if word.count_ones() <= 7 { Some(l3 + 1) } else { None }
        } else {
            len = None
        }
    }

No one has answered this yet.

Let's discuss how to improve scan() instead.

Uhm, your http://danluu.org/cpu-bugs/ link seems to be incorrect.

I think the right link is: We should expect more CPU bugs

I'm not sure there's much to discuss. The method is stable and hence cannot be changed, and the map/filter_map versions seem like a perfectly good improvement to me. (Maybe you'd've preferred me to be less tongue-in-cheek and just left that joke first sentence out...)

This code:

fn main() {
    print!("{:?}", (0 .. 11)
                   .scan(0, |fac, x| { *fac = *fac + x; Some(*fac) })
                   .collect::<Vec<_>>());
}

Outputs:

[0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55]

In Haskell there's a scan, with this semantics:

scanl f z [x1, x2, ...] == [z, z `f` x1, (z `f` x1) `f` x2, ...]

Its Haskell implementation:

scan :: (b -> a -> b) -> b -> [a] -> [b]
scan f q ls = q : (case ls of
                      []   -> []
                      x:xs -> scan f (f q x) xs)

See it in action:

Prelude> scanl (+) 0 [0 .. 10]
[0,0,1,3,6,10,15,21,28,36,45,55]

Where the Haskell [0 .. 10] is similar to (0 .. 11) in Rust.

There are two differences:

  • The usage of the Haskell scanl is much more reasonable and short;
  • The Haskell scanl outputs an extra value at the beginning, it's the given seed "z". That value is often useful in the code I've written. And when you don't need that first value it's easy to skip it, while if that value is missing, then it's harder (and less efficient?) to add it with something like a chain(&[z]).

So with the Rust standard library I'd like something like:

(0 .. 11).good_scan(0, |a, b| *a + b)

To yield:

[0, 0, 1, 3, 6, 10, 15, 21, 28, 36, 45, 55]

Same reason you can't write let *count = 1. The left-hand side of let is a pattern to bind, not a reassignment. It binds new names. Since count is a newly bound name, it doesn't exist yet, and there is no useful way to dereference it.

I think you have misunderstood the question. I am not asking why let statements won't do something (and indeed they do it), I am asking why assignment won't do something.

Thank you, that was a versioning error, fixed.

Thank you, fixed.

Yes, sorry, I misread. The explanation still works a little bit, just in reverse: assignments are not pattern bindings, so you can't use tuples on the LHS. (In that case, the deref-operator is not relevant.)

Why? And why Rust can't be changed to accept reasonable and not-bug-prone code like this?

fn main() {
    let (mut x, mut y) = (1, 2);
    (x, y) = (10, 20);
}

And likewise

    let (mut a, mut b) = (2, 3);
    let (p, q) = (&mut a, &mut b);
    *(p, q) = (4, 5);

Note you can move the "use" inside the main:

    fn main() {
        use std::io::prelude::{ Read, Write };
        use std::{fs, io, env, process};

On my system your C++14 version runs in about 0.16 seconds, while the Rust code runs in 0.09 or 0.10 seconds. I don't know why.

Generally I love to use iterator chains of map/filter/fold and the like, but in code full of mutations I think they are less clear than regular imperative code. So I've converted the last part of your Rust code in code more similar to the C++14 one. On my system this Rust code is equally fast, and for me is more clear:

    for &(seven, count) in &sevens {
        let scores = first_part(&words, seven);

        let mut out = *b".......\n";
        let mut rest = seven;
        let mut any = false;
        for (&score, outc) in scores.iter().zip(out.iter_mut().rev().skip(1)) {
            let points = score + 3 * count;
            let a = if points >= 26 && points <= 32 { any = true; b'A' } else { b'a' };
            *outc = a + (25 - rest.trailing_zeros()) as u8;
            rest &= rest - 1;
        }

        if any { sink.write(&out).unwrap(); };
    }
}

Then I've tried to use regular imperative code for the section that computes the scores too. But as probably you have seen the result is slower, it runs in
0.16 seconds (as the C++14 code):

    for &(seven, count) in &sevens {
        let mut scores = [0u16; 7];
        for word in &words {
            if word & !seven == 0 {
                let mut rest = seven;
                for score in scores.iter_mut() {
                    if (word & rest & !(rest - 1)) != 0 {
                        *score += 1;
                    }
                    rest &= rest - 1;
                }
            }
        }

I don't know why that's slower. To better understand the situation I've moved that code in a function:

for &(seven, count) in &sevens {
    let scores = first_part(&words, seven);

Where:

#[inline(never)]
fn first_part(words: &[u32], seven: u32) -> [u16; 7] {
    let mut scores = [0u16; 7];

    for word in words {
        if word & !seven == 0 {
            let mut rest = seven;
            for score in scores.iter_mut() {
                if (word & rest & !(rest - 1)) != 0 {
                    *score += 1;
                }
                rest &= rest - 1;
            }
        }
    }

    scores
}

I've also moved out to a function the original code:

#[inline(never)]
fn first_part(words: &[u32], seven: u32) -> [u16; 7] {
    words
    .iter()
    .filter(|&word| word & !seven == 0)
    .fold([0u16; 7], |mut scores, word| {
        scores
        .iter_mut()
        .fold(seven, |rest, score| {
            if word & rest & !(rest - 1) != 0 { *score += 1 }
            rest & rest - 1
        });
        scores
    })
}

This is the asm of first_part with the original code (with fold):

_ZN10first_part20h0910ff8aaedb046ceaaE:
    pushq   %r15
    pushq   %r14
    pushq   %r13
    pushq   %r12
    pushq   %rsi
    pushq   %rdi
    pushq   %rbp
    pushq   %rbx
    subq    $48, %rsp
    leal    -1(%r9), %r11d
    movl    %r9d, %r14d
    negl    %r14d
    andl    %r9d, %r14d
    andl    %r9d, %r11d
    movl    %r9d, %eax
    notl    %eax
    leal    -1(%r11), %r9d
    movl    %r11d, %ebp
    negl    %ebp
    movl    %ebp, 20(%rsp)
    andl    %r11d, %r9d
    leal    -1(%r9), %edi
    movl    %r9d, %ebp
    negl    %ebp
    movl    %ebp, 16(%rsp)
    andl    %r9d, %edi
    leal    -1(%rdi), %esi
    movl    %edi, %ebp
    negl    %ebp
    movl    %ebp, 12(%rsp)
    andl    %edi, %esi
    leal    -1(%rsi), %ebx
    movl    %esi, %ebp
    negl    %ebp
    movl    %ebp, 8(%rsp)
    andl    %esi, %ebx
    movl    %ebx, %ebp
    negl    %ebp
    movl    %ebp, 4(%rsp)
    leal    -1(%rbx), %r15d
    andl    %ebx, %r15d
    movl    %r15d, %ebp
    negl    %ebp
    movl    %ebp, (%rsp)
    leaq    (%rdx,%r8,4), %r8
    movl    $0, 44(%rsp)
    movl    $0, 40(%rsp)
    movl    $0, 36(%rsp)
    movl    $0, 32(%rsp)
    movl    $0, 28(%rsp)
    movl    $0, 24(%rsp)
    xorl    %r13d, %r13d
    jmp .LBB0_1
.LBB0_3:
    movl    %r14d, %r10d
    andl    %r12d, %r10d
    cmpl    $1, %r10d
    sbbw    $-1, %r13w
    movl    %r12d, %ebp
    andl    20(%rsp), %ebp
    andl    %r11d, %ebp
    cmpl    $1, %ebp
    movl    24(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 24(%rsp)
    movl    %r12d, %ebp
    andl    16(%rsp), %ebp
    andl    %r9d, %ebp
    cmpl    $1, %ebp
    movl    28(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 28(%rsp)
    movl    %r12d, %ebp
    andl    12(%rsp), %ebp
    andl    %edi, %ebp
    cmpl    $1, %ebp
    movl    32(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 32(%rsp)
    movl    %r12d, %ebp
    andl    8(%rsp), %ebp
    andl    %esi, %ebp
    cmpl    $1, %ebp
    movl    36(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 36(%rsp)
    movl    %r12d, %ebp
    andl    4(%rsp), %ebp
    andl    %ebx, %ebp
    cmpl    $1, %ebp
    movl    40(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 40(%rsp)
    andl    (%rsp), %r12d
    andl    %r15d, %r12d
    cmpl    $1, %r12d
    movl    44(%rsp), %ebp
    sbbw    $-1, %bp
    movl    %ebp, 44(%rsp)
    .align  16, 0x90
.LBB0_1:
    cmpq    %rdx, %r8
    je  .LBB0_4
    movl    (%rdx), %r12d
    addq    $4, %rdx
    testl   %eax, %r12d
    jne .LBB0_1
    jmp .LBB0_3
.LBB0_4:
    movw    %r13w, (%rcx)
    movl    24(%rsp), %eax
    movw    %ax, 2(%rcx)
    movl    28(%rsp), %eax
    movw    %ax, 4(%rcx)
    movl    32(%rsp), %eax
    movw    %ax, 6(%rcx)
    movl    36(%rsp), %eax
    movw    %ax, 8(%rcx)
    movl    40(%rsp), %eax
    movw    %ax, 10(%rcx)
    movl    44(%rsp), %eax
    movw    %ax, 12(%rcx)
    movq    %rcx, %rax
    addq    $48, %rsp
    popq    %rbx
    popq    %rbp
    popq    %rdi
    popq    %rsi
    popq    %r12
    popq    %r13
    popq    %r14
    popq    %r15
    retq

This is the asm of first_part with the modified code (with loops):

_ZN10first_part20h1a62e25daf793c42eaaE:
    pushq   %r14
    pushq   %rsi
    pushq   %rdi
    pushq   %rbp
    pushq   %rbx
    subq    $16, %rsp
    movq    $0, 6(%rsp)
    movq    $0, (%rsp)
    testq   %r8, %r8
    je  .LBB0_4
    movl    %r9d, %eax
    notl    %eax
    leal    -1(%r9), %ebp
    movl    %r9d, %r10d
    negl    %r10d
    andl    %r9d, %r10d
    andl    %r9d, %ebp
    leal    -1(%rbp), %esi
    movl    %ebp, %r9d
    negl    %r9d
    andl    %ebp, %r9d
    andl    %ebp, %esi
    leal    -1(%rsi), %edi
    movl    %esi, %r11d
    negl    %r11d
    andl    %esi, %r11d
    andl    %esi, %edi
    leal    -1(%rdi), %esi
    movl    %edi, %r14d
    negl    %r14d
    andl    %edi, %r14d
    andl    %edi, %esi
    leal    -1(%rsi), %ebp
    movl    %esi, %edi
    negl    %edi
    andl    %esi, %edi
    andl    %esi, %ebp
    leal    -1(%rbp), %esi
    movl    %ebp, %ebx
    negl    %ebx
    andl    %ebp, %ebx
    andl    %ebp, %esi
    movl    %esi, %ebp
    negl    %ebp
    andl    %esi, %ebp
    shlq    $2, %r8
    .align  16, 0x90
.LBB0_2:
    movl    (%rdx), %esi
    testl   %eax, %esi
    jne .LBB0_3
    testl   %esi, %r10d
    je  .LBB0_7
    incw    (%rsp)
    movl    (%rdx), %esi
.LBB0_7:
    testl   %esi, %r9d
    je  .LBB0_9
    incw    2(%rsp)
    movl    (%rdx), %esi
.LBB0_9:
    testl   %esi, %r11d
    je  .LBB0_11
    incw    4(%rsp)
    movl    (%rdx), %esi
.LBB0_11:
    testl   %esi, %r14d
    je  .LBB0_13
    incw    6(%rsp)
    movl    (%rdx), %esi
.LBB0_13:
    testl   %esi, %edi
    je  .LBB0_15
    incw    8(%rsp)
    movl    (%rdx), %esi
.LBB0_15:
    testl   %esi, %ebx
    je  .LBB0_17
    incw    10(%rsp)
    movl    (%rdx), %esi
.LBB0_17:
    testl   %esi, %ebp
    je  .LBB0_3
    incw    12(%rsp)
.LBB0_3:
    addq    $4, %rdx
    addq    $-4, %r8
    jne .LBB0_2
.LBB0_4:
    movq    (%rsp), %rax
    movq    6(%rsp), %rdx
    movq    %rdx, 6(%rcx)
    movq    %rax, (%rcx)
    movq    %rcx, %rax
    addq    $16, %rsp
    popq    %rbx
    popq    %rbp
    popq    %rdi
    popq    %rsi
    popq    %r14
    retq

The two asm listings are quite different. But I don't yet know what causes the significant performance difference... Opinions are welcome.

Generally the C++ code benefits from __builtin_expect() on key conditionals, when run on Haswell. In particular, on if (!(word & ~seven)) ...

Can you show the C++ code with those __builtin_expect()?

Just the one has been enough, for me:

    if (__builtin_expect(!(word & ~seven), false))

You can expect for &seven in &sevens to be slow, too. Use sevens.iter() instead.