What about a load generator to accompany the micro-benchmarker? AFAICT, benchmarks are inaccurate because things fit nicely in cache, so memory reads are minimized, its on a dev box, so contention for other resources is minimized. Also, for multi-threaded applications, with no load on the TLB, context switches are reasonably fast.
So maybe we mix the benchmark runs in with some sort of random data generation and combinatorial matching scheme that runs in multiple threads. If we're benching data structures, maybe leverage quickcheck to generate GBs of data.
If we don't like that idea, Big Data applications are a good source of resource drain, as they can be both CPU and IO intensive. Let's go to work on parallel document clustering/classification in Rust, like a clone of Vowpal Wabbit