Rust vs. C++: Fine-grained Performance

[Edit: I am going to stop updating this version of the article, except for
strict factual correctness. The up-to-date (and better!) version lives at
http://cantrip.org/rust-vs-c++.html .
Many thanks to all who have helped make it so much better.]

[This is an introductory article I have been working on since the 1.2 times,
mainly aimed at C++ coders curious about Rust.]

Rust vs. C++: Fine-grained Performance

by Nathan Myers <ncm@cantrip.org>

If Rust is to take on work previously reserved to C++, we need to know
how well it does what C++ does best. What's fast, what's slow? What's
harder to do, what's easier? I wouldn't know how to look up answers for
those questions, but I can write programs.

I had a C++ program that was just the right length to experiment
with -- one printed page -- and that did nothing tricky to express
in an unfamilar language.[1] (It generates all possible versions of a puzzle
called "Spelling Bee" found in the New York Times Magazine.)

I began by transcribing the program straight across to equivalent Rust
code. The Rust program turned out close to the same length, but only
half as fast. As I made the Rust code more idiomatic, it got faster. At
the same time, I worked to speed up the C++ program, still hewing
to the original one-page limit. After each change, I checked
performance. Few programs get this much attention to optimization.

The C++ version now runs three times as fast as when I started; about as
fast, I think, as it can be made without making it longer, or parallel,
or using third-party libraries. In 82 ms on modern hardware, it performs
some 190 million basic operations (at a cycle per iteration (!)), filtering
to 5 million more-complex operations (at under 20 cycles per). Meanwhile,
the Rust program does about the same operations in about the same time:
a percent or two faster or slower on various hardware. Many variations
that seemed like they ought to run the same speed or faster turned out
slower.

Below, I present each program in fragments. The code may be denser than
you are used to, just to keep it to one printed page. When I write "much
slower", below, it might mean 1.3x to 2x, not the order of magnitude it
might mean outside systems programming. Huge swaths of both languages
are ignored: C++ templates, destructors, futures, lambdas; Rust channels,
threads, traits, cells, lifetimes, borrowing, modules, macros. Those are
essential to really using the language, but this project is deliberately
about the nitty-gritty of coding.

So, the programs. First, dependencies: headers, modules.

C++:

#include <iostream>
#include <fstream>
#include <vector>
#include <iterator>
#include <string>
#include <algorithm>
#include <bitset>

And Rust:

use std::io::prelude::*;
use std::{fs,io,env,process};

Rust wins, here. Rust provides much of its standard library by default,
or with just the first line above, and supports ganging module uses
on one line. Rust's module system is exemplary; any future language
that ignores its lessons courts failure.

Next, argument and input-file processing.

C++:

int main(int argc, char** argv) {
    std::string name = (argc > 1) ? argv[1] : "/usr/share/dict/words";
    std::ifstream fs;
    std::istream& file = name == "-" ? std::cin : (fs.open(name), fs);
    if (!file)
        return std::cerr << "file open failed: \"" << name << "\"\n", 1;

And Rust:

fn main() {
    let (stdin, name) = (io::stdin(),
        env::args().nth(1).unwrap_or(String::from("/usr/share/dict/words")));
    let file : Box<io::Read> = match &*filename {
        "-" => Box::new(stdin.lock()), 
        _ => match fs::File::open(&*filename) {
            Ok(f) => Box::new(f),
            Err(x) => {
                writeln!(io::stderr(), "{}: \"{}\"", x, filename).unwrap();
                process::exit(1)
            }
        }
    };

Point, C++. This stuff just takes a lot of code to do in Rust. Most
people coding Rust would let a toy program like this report usage errors
by "panicking", although it produces very ugly output. Rust does make
it hard to accidentally ignore an I/O error result, which is good so
long as people don't get used to deliberately ignoring them; but in
Rust, it's easiest to let them be caught.

Both programs take an optional file name on the command line, and can
read from stdin, which is convenient for testing. On standard Linux
systems the "words" file is a list of practical English words, including
proper names, contractions, and short words that must be filtered out.
Notable, here, is the use of Box for type-erasure so io::stdin can
be substituted for the fs::File handle. The need for &*filename is
just weird. I don't mind locking io::stdin, to get faster input, but
requiring that the call to lock() be in a separate statement is even
more weird.

Data structure and input setup follows, along with the input loop header.

C++:

    std::vector<unsigned> words;
    std::vector<std::pair<unsigned,int>> sevens;
    std::bitset<32> word;  int len = 0;
    for (std::istreambuf_iterator<char> in(file), eof; in != eof; ++in) {

Rust:

    let (mut sevens, mut words) = (Vec::new(), Vec::new());
    let (mut word, mut len) = (0u32, 0);
    for c in io::BufReader::new(file).bytes().filter_map(Result::ok) {

One point here for Rust. Rust integer types support count_ones(). The
C++ version needs std::bitset for its member count() (which would be
size() if bitset were a proper C++ set) because it is the only way in
C++ to get at the POPCNT instruction without using a non-standard
compiler intrinsic like Gcc's __builtin_popcountl. Using bitset<32>
instead of <26> suppresses some redundant masking operations. C++
bitset doesn't have an operator< (yet), so the bitsets are actually
stored as regular unsigned. (Since the smallest bitset<> on Gcc/amd64
is 64 bits, storing unsigned is more efficient anyway.) Rust has no
equivalent to bitset (yet), so we're lucky we needed just 26 bits.

The actual types of the Rust sevens and words vectors are deduced from
the way they are used way further down in the program. The filter_map
call strips off a Result wrapping, discarding any file-reading errors.

Next, we have the the input state machine.

C++:

        if (*in == '\n') {
            if (len >= 5) {
                if (word.count() == 7)
                    sevens.emplace_back(word.to_ulong(), 0);
                else words.push_back(word.to_ulong());
            }
            word = 0, len = 0;
        } else if (len != -1 && *in >= 'a' && *in <= 'z') {
            word.set(25 - (*in - 'a'));
            len = (word.count() <= 7) ? len + 1 : -1;
        } else { len = -1; }
    }

And Rust:

        if c == b'\n' {
            if len >= 5 {
                if word.count_ones() == 7 {
                        sevens.push(word)
                } else { words.push(word) }
            }
            word = 0; len = 0;
        } else if len != -1 && c >= b'a' && c <= b'z' {
            word |= 1 << (25 - (c - b'a'));
            len = if word.count_ones() <= 7 { len + 1 } else { -1 }
        } else { len = -1 }
    }

These are exactly even. The state machine is straightforward: gather up
and store eligible words, and skip past ineligible words. On earlier
versions of the Rust compiler, I had to use an iterator pipeline, using
.scan(), match, .filter(), and .collect(), at twice the line count,
to get tolerable performance. A match would work here, but the code
would be longer.

Next, we need to sort the collection of seven-different-letter words,
and count duplicates.

C++:

    std::sort(sevens.begin(), sevens.end(),
        [](auto& a, auto& b) { return a.first > b.first; });
    auto p = sevens.begin();
    for (auto s = p; s != sevens.end(); ++p->second, ++s)
        if (s->first != p->first)
            *++p = *s;
    sevens.resize(p + 1 - sevens.begin());

And Rust:

    sevens.sort();
    let sevens : Vec<_> = sevens.iter().skip(1).chain([0].iter())
        .scan((sevens[0], 1u16), |&mut (ref mut prev, ref mut count), &seven|
            if seven != *prev {
               let (p,c) = (*prev, *count); *prev = seven; *count = 1;
               Some(Some((p,c)))
            } else { *count += 1; Some(None) }
         ).filter_map(|pair| pair).collect();

These are close to even. In Rust this could have been a loop like the
C++ code, but that turned out much, much slower. I would rather have the
Rust sevens ordered from high to low, as in the C++ version, but Rust
seems not to offer a nice way to do it. Instead, the code below walks
it backwards. (Walking a container backwards is less pleasant in C++.)

The arguments to scan() are baroque: I had to ask on IRC to learn
how to set up the ref mut elements. The two different scan exits
possible (akin to "break" and "continue") needed double Option
wrapping and unwrapping. The filter_map call that follows strips off
the inner Some wrapping. The call .collect() at the end drives the
lazy iterator to completion, and builds the new sevens vector.

The program to this point is all setup, accounting for small fraction
of run time. Using <map> or BTreeMap, respectively, would make this
last fragment unnecessary, in exchange for 2% more total run time.

I don't know why I can write "let (p,c) = (*prev, *count)", but not
"(*prev, *count) = (seven, 1)". Arguably, too, the operator *
should be overloaded distributively for tuples. Rules on when you have
to use the unary * operator on pointers are unjustifiably finicky.
It would have been simpler to require & when you really want the
address, as with a C++ reference, but it's far too late to fix that.

Rust's convenience operations for booleans, by the way, are curiously
neglected, vs. Result and Option. Some code would read better if
I could write something like:

    ).filter_map(|c|, is(c).then_some(|| f(c))). ...

Code for then_some is just a one-liner, but to be useful it needs to
be standard. Arguably, bool should be an alias for Option(()).

The main loop is presented below, in two phases. The first half is where
the program spends most of its time.

C++:

   for (auto sevencount : sevens) {
        unsigned const seven = sevencount.first;
        short scores[7] = { 0, };
        for (unsigned word : words)
            if (!(word & ~seven)) {
                unsigned rest = seven;
                for (int place = 7; --place >= 0; rest &= rest - 1)
                    if (word & rest & -rest)
                        ++scores[place];
            }

And Rust:

    let stdout = io::stdout();
    let mut sink = io::BufWriter::new(stdout.lock());
    for (&seven, &count) in sevens.iter().rev() {
        let scores = words.iter()
            .filter(|&word| word & !seven == 0)
            .fold([0u16;7], |mut scores, word| {
                scores.iter_mut().fold(seven, |rest, score| {
                   if word & rest & !(rest - 1) != 0
                       { *score += 1 }
                   rest & rest - 1
                });
                scores
            });

This is close to even.

Again, the first two lines in the Rust code seem excessive just to get
faster output. The ".filter" line is executed 190M times. Only some
720K iterations reach the outer ".fold()", but the inner loop runs
5M times and *score is incremented 3M times. That loop is where the
program spends more time than anywhere else. The "fold()" with its
scores state passed along from one iteration to the next is much faster
than the equivalent loop with outer-scope state variables. The two
nested "fold()" calls, as with "collect()" above, drive the lazy
iterators to completion.

The second phase does output based on the scores accumulated above.

C++:

        bool any = false;
        unsigned rest = seven;
        char buf[8]; buf[7] = '\n';
        for (int place = 7; --place >= 0; rest &= rest - 1) {
            int points = scores[place] + 3 * sevencount.second;
            char a = (points >= 26 && points <= 32) ? any = true, 'A' : 'a';
            buf[place] = a + (25 - std::bitset<32>(~rest & (rest - 1)).count());
        }
        if (any)
            std::cout.rdbuf()->sputn(buf, sizeof(buf));
    }
}

And Rust:

        let mut out = *b".......\n";
        let (any, _) = scores.iter().zip(out.iter_mut().rev().skip(1))
            .fold((false, seven), |(mut any, rest), (&score, outc)| {
                let a = match score + 3 * count
                    { 26 ... 32 => { any = true; 'A' }, _ => 'a' };
                *outc = a as u8 + (25 - rest.trailing_zeros()) as u8;
                (any, rest & rest - 1)
            });
        if any
            { sink.write(&out).unwrap(); };
    }
}

I call this about even, too.

Rust's trailing_zeros() maps to the instruction CTZ. C++ offers
no direct equivalent, but bitset<>::count() serves. I found that
iterating over a small array in Rust with (e.g.) "array.iter()" was
much faster than with "&array", although it should be the same. The
latter version seems to take longer to initialize, which matters when,
as here, the iteration is nested inside another. I suppose that will be
fixed someday. (It might be fixed already!) As before, using outer-scope
variables would be much slower than passing fold's state along to its
next iteration.

The fold walks the out array backward, skipping the newline, and
pairing each byte with a corresponding score and a one-bit in seven.
The output is built of u8 bytes instead of proper Rust characters
because operating on character and string types would be slowed by
runtime error checks and conversions. (The algorithm used works only
with ASCII anyhow. Unlike in C++, the out elements are initialized
twice. People complain online about the few choices available for
initializing arrays, which requires the arrays, frequently, to be made
unnecessarily mutable.

Curiously, most versions of the C++ program run only half as fast as
they should on Intel Haswell chips, probably because of branch prediction
failures.[2] (Wrapping the "!(word & ~seven)" in
"__builtin_expect(..., false)" works around the bug.) It's possible
that Gcc will learn someday to step around the Haswell bug by itself,
or new microcode will fix it, but I'm amazed that Intel released Haswell
that way.[3] I don't know yet if it's fixed in Broadwell or Skylake.

Rust has some rough edges, but coding in it was kind of fun. As with
C++, if a Rust program compiles at all, it generally works, more or
less. Rust's support for generics is improving, but is still well short
of what a Rusty STL would need. The compiler was slow, but they're
working hard on that, and I believe its speed will be unremarkable by
this time next year. (I could forgive its slowness if it kept its
opinions on redundant parentheses to itself.) Rust's iterator
primitives string together nicely, for the most part, although scan()
is a bit of a monster. (I have been assured that the documentation for
it is being improved.)

It is a signal achievement to match C++ in low-level performance and
brevity while surpassing it in safety, with reasonable prospects to match
its expressive power in the near future. C++ is a rapidly moving target,
held back only by legacy compatibility requirements, so Rust will need to
keep moving fast just to keep up. While Rust could "jump the shark"
any time, thus far there's every reason to expect to see, ten years
on, recruiters advertising for warm bodies with ten years' production
experience coding Rust.

[1] http://github.com/ncm/nytm-spelling-bee/
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67153
[3] Maybe I shouldn't be: http://danluu.com/cpu-bugs/

(Thanks to Steve Klabnik and "eddyb" for help improving this article.)

8 Likes

The article is nice, similar comparisons are quite needed. Your Rust code is too much compressed, I can't understand it well. If an idiomatic version of the Rust version is twice longer than the C++ version, then let it be twice longer.

Some quick comments (I haven't yet compiled your code):

Is b'a' enough?

Can you use a sort_by_key or sort_by?

Yep.

Let's see what people answer to this.

Please explain better what you mean.

I think Rust devs don't like this idea.

Here something is missing.

Don't just suppose, do file Rust bug reports!

Thank you. I have fixed the lacuna before "Unlike in C++", and the "('a' as u8)".

Why do you use rs_main() instead of just a main?

Your Rust code a bit reformatted:

use std::io::prelude::{ Read, Write };
use std::{ fs, io, env, process };

fn main() {
    let stdin = io::stdin();
    let filename = env::args().nth(1).unwrap_or("words.txt".into());

    let file : Box<Read> = match &*filename {
        "-" => Box::new(stdin.lock()),
        _ => match fs::File::open(&*filename) {
            Ok(f) => Box::new(f),
            Err(x) => {
                writeln!(io::stderr(), "{}: \"{}\"", x, filename).unwrap();
                process::exit(1);
            }
        }
    };

    let mut sevens = vec![];
    let mut words = vec![];
    let mut word = 0u32;
    let mut len = 0;

    for c in io::BufReader::new(file).bytes().filter_map(Result::ok) {
        if c == b'\n' {
            if len >= 5 {
                if word.count_ones() == 7 {
                    sevens.push(word)
                } else {
                    words.push(word)
                }
            }
            word = 0;
            len = 0;
        } else if len != -1 && c >= b'a' && c <= b'z' {
            word |= 1 << (25 - (c - b'a'));
            len = if word.count_ones() <= 7 { len + 1 } else { -1 }
        } else {
            len = -1
        }
    }

    sevens.sort();

    let sevens : Vec<_> =
        sevens
        .iter()
        .skip(1)
        .chain(&[0])
        .scan((sevens[0], 1u16), |&mut (ref mut prev, ref mut count), &seven|
            if seven != *prev {
                let pair = (*prev, *count);
                *prev = seven;
                *count = 1;
                Some(Some(pair))
            } else {
                *count += 1;
                Some(None)
            }
         )
         .filter_map(|pair| pair)
         .collect();

    let stdout = io::stdout();
    let mut sink = io::BufWriter::new(stdout.lock());

    for &(seven, count) in sevens.iter().rev() {
        let scores =
            words
            .iter()
            .filter(|&word| word & !seven == 0)
            .fold([0u16; 7], |mut scores, word| {
                scores
                .iter_mut()
                .fold(seven, |rest, score| {
                    if word & rest & !(rest - 1) != 0 { *score += 1 }
                    rest & rest - 1
                });
                scores
            });

        let mut out = *b".......\n";
        let (any, _) =
            scores
            .iter()
            .zip(out.iter_mut().rev().skip(1))
            .fold((false, seven), |(mut any, rest), (&score, outc)| {
                let a = match score + 3 * count {
                    26 ... 32 => { any = true; b'A' },
                    _ => b'a'
                };
                *outc = a + (25 - rest.trailing_zeros()) as u8;
                (any, rest & rest - 1)
            });

        if any { sink.write(&out).unwrap(); }
    }
}

One trick with scan is to just not use it at all: it is inherited from Haskell where mutability is finely controlled, but Rust has mutation, and so one can do scan's job with just map or filter_map, e.g.

let mut prev = sevens[0];
let mut count = 1;

sevens.iter().skip(1).chain([0]).iter()
    .filter_map(|&seven| {
        if seven != prev {
            let p = mem::replace(&mut prev, seven);
            let c = mem::replace(&mut count, 1);
            Some((p, c))
        } else {
            count += 1;
            None
        }
    });

This code is better-looking, but no longer fits on one printed page. (I.e. 66 lines, 80 columns.)

The github version is altered slightly to work well with Google Bench (recommended). See the Makefile targets gbench and gbench.run. Cargo might have equally clever benchmarking; if it can do the C++ benching as well, I might switch to it.

Usually whenever I have tried using variables outside the scope of the loop, the program run time doubled. I don't know why. I originally coded the fragment with the sort using a regular loop. It took as long as the whole rest of the program.

I'm afraid I can't really comment on the comparison without knowing what the regular loop code looked like; e.g. did it still use the .skip/.chain iterator (like my filter_map version), or did it use indexing?

It had for seven in sevens {, and sevenscounts.push((prev, count)). I know random indexing is slow. I would like to do it the way the C++ code does, in place, instead of generating another vector, but that seems to require indexing.

I think this is a valuable perspective for devs interested in Rust, but I'm not sure what the real value is in ignoring much of what makes either language unique and introducing an arbitrary constraint on the code; the single printed page.

I tried this, and it didn't slow down the code at all. I will switch to something like this. But using mem::replace doesn't seem like an improvement over

    let pair = (prev, count); prev = seven; count = 1;
    Some(pair)

The constraint to one page provides discipline. How concisely and clearly you can express yourself in limited space reveals a great deal about a language. Nobody could do this in Java at all. You could do it in Perl, and easily, but nobody could read it, and it would be slow. Allow it to be longer, and you invite in all kinds of foofaraw that seems clever but obscures essentials.

1 Like

That's a fair point. I suppose it is an easy enough space constraint to picture too, so it's relatable.

Sweet analysis. Thanks for publishing it.

1 Like

But using mem::replace doesn't seem like an improvement over

So do that; it's even more succinct compared to scan...

For sevenscounts, did you try using with_capacity?

Random indexing shouldn't be too slow, especially in a tight loop where prediction should be good. In a C++ implementation that bounds checks vector::operator[] it should be equalish. I don't really understand what you're doing, since the code in the OP is either wrong or confusing (in the C++ version, how are you using ++ on an element of sevens, i.e. std::pair<unsigned,int>&?), but have you tried it?

I do test the code religiously, besides benchmarking it. The C++ code is certainly allowed to alter the value of a member of a pair in a vector, if you are asking about the ++p->second operation. That's doing the same job as the Rust count += 1, but in-place. I'll try an indexed version and see how it goes.

I wouldn't expect provisioning space in the vectors to make any measurable difference, but I'll measure and then we'll know.

I meant that. I see, the version on GitHub is different; never mind my comment about operator[]. You also have some mismatches in the Rust version - name vs. filename etc.

Edit: If you want to use the double-iterator approach in Rust, you should be able to do it by making the count field a Cell and using immutable iterators.

More canonical (and succinct) is to use the helpers on Result rather than match it against Ok/Err.

    let file: Box<Read> = match &*filename {
        "-" => Box::new(stdin.lock()),
        _ => Box::new(fs::File::open(&*filename).unwrap_or_else(|err| -> _ {
                 writeln!(io::stderr(), "{}: \"{}\"", err, filename).unwrap();
                 process::exit(1);
             }))
    };

Thank you, this is great!

I guess sort_by() is new. I am using it now. Rustc 1.6 has more new goodies than I expected!