We need non-trivial idiomatic workloads for meaningful performance evaluation


#1

Rust’s current approach to evaluating optimizations are:

  • I did it and now rustc builds faster/smaller
  • The #[bench] I made just for this seems to do better

The first is problematic because rustc is neither idiomatic nor all-encompasing. The latter is problematic because micro-benching correctly is super hard, and even if you do it right, there might be unintended consequences for a different workload.

Therefore I believe we need to collect up a selection of idiomatic non-trivial pieces of code to more robustly evaluate optimizations against. Things that upon inspection should perform well, or at least are “how you would do it”. Heck, maybe even some examples of common mistakes or bad idioms. Things that leverage lots of functionality. Things that we can track and perhaps gate against.

The benchmarks game files are perhaps the closest things we have to this today, but they’re a bit tainted by the fact that they’re written by us to make Rust look good, rather than get a job done.


#2

What about:

  • A simple interpreter (maybe scheme or/and basic) because it includes string parsing, data structures and graph operations. We could also run a huge amount of already test-programs in that interpreters without having to write them ourselves.
  • Conways game of life: We could use a simple version with fixed size field for evaluating the code speed on continouus chunks of memory and another version that also has reallocations and many cache misses.

#3

What about a load generator to accompany the micro-benchmarker? AFAICT, benchmarks are inaccurate because things fit nicely in cache, so memory reads are minimized, its on a dev box, so contention for other resources is minimized. Also, for multi-threaded applications, with no load on the TLB, context switches are reasonably fast.

So maybe we mix the benchmark runs in with some sort of random data generation and combinatorial matching scheme that runs in multiple threads. If we’re benching data structures, maybe leverage quickcheck to generate GBs of data.

If we don’t like that idea, Big Data applications are a good source of resource drain, as they can be both CPU and IO intensive. Let’s go to work on parallel document clustering/classification in Rust, like a clone of Vowpal Wabbit


#4

I have a benchmark game for csv processing that @BurntSushi used to tune up the performance of rust-csv. My game is terrible, but all benchmarks are terrible for the reasons @Gankro mentions. But it’s one that you can add. I think @BurntSushi adapted the test to make his own larger scale test to flush out more performance.

Other examples could be those written for other systems where performance is claimed to be a goal. e.g. Hadoop examples have wordcount, terasort, page rank, and some pi calculation. Also, as @rrichardson says: a parallel document classifier like Vowpal Rabbit. Or something like Lucene to index documents (as opposed to just classifying them). I would also add a Rust implementation of RDD’s to the list.

A simple interpreter (maybe scheme or/and basic) because it includes string parsing, data structures and graph operations. We could also run a huge amount of already test-programs in that interpreters without having to write them ourselves.

An interpreter is a real rabbit hole. If you want performance then you add in all sorts of optimizations. I expect eventually a lot will could be thrown over the fence to llvm. And that turns out to be a lot of effort for something that people won’t necessarily be using.

Conway’s game of life

+1


#5

@nrc Has been looking into setting up a benchmarking bot and may be interested it your efforts.