Success story: new Rustacean beating C perf in first week

Hi everyone, I'd like to share a success story of one of my co-workers, who beat C-performance in his first week of learning Rust.
This shows, to me, the great potential rust has as a language: high-level programming, with great low-level performance, accessible to newcomers.

TL;DR: prototypes and benchmark timings here

Background:
My coworker is optimising scripts for a bioinformatics web-service our institute hosts (CRISPR AnalyzeR ). It is currently written in a mix of R and Perl, and some parts are very slow not as fast as we'd like.
On the tram to work, he mentioned that he was rewriting a particularly slow Perl script (0m45.006s on the benchmark file) in C, and had gotten it down to 0m4.409s in low-level C, an admirable 10x speedup.
We talked about the fact that it was quite hard to get back down in all the nitty-gritty pointer handling of C, which was my cue to preach the Rust Gospel: productive high-level concepts, memory safe and fast! (with easy transition to multi-threading for even more performance later!)

I sent him some links to cargo and the rust book, and he said he'd give Rust a try.
The next morning he had an initial working version in Rust, coming in at 0m14.334s.

Now, before we go into the performance, just the fact that he could get a working version in 24 hours is a good sign in itself. The script iterates over a biological data file (fastQ), which has a multi-line record structure. If the record identifier matches a regex, it is copied to the output file. Quite simple all in all, but still, this means my colleague got himself up to speed in:

  • The rust language,
  • The cargo ecosystem, and
  • The regex crate

all in 24 hours. So much for Rust being "too hard to be adopted" :slight_smile:

We were both surprised that rust was so much slower than C, and checked the usual "compiled with --release ?" (yes, it was).
Then I noticed that we were spending a whopping 10s of our 14 seconds in system time, which caused me to look at the code.
Two inserts of std::io::BufWriter later, and we're suddenly at 0m3.866s, beating C by 500 ms! (i.e. 11,8%)

Cue great enthusiasm at my colleague's department, and another 2 people interested in Rust.
All in all, quite the success :slight_smile:

P.S. and if the perf story isn't good enough by itself; when I cloned the rust version, I did "cargo run --release" and everything magically worked, downloading dependencies and everything.
When I ran the C version with "make; ./bin/prototype", all I got was "SEGFAULT" :joy:

71 Likes

Nice story! There is actually room for possible improvement too (of course, I haven't tried it, so I don't know for sure):

  • Use read_line to amortize allocation of the line. (This may not do much if the underlying allocator is already doing this for you.)
  • Even better use read_until with b'\n' to both amortize allocation and read a line without doing UTF-8 validation. Then do your regex match on &[u8] directly. Other than using read_until instead of a line iterator, you'll need to do use regex::bytes::Regex; instead of use regex::Regex;.
  • When I move the core search code out of ripgrep, you should be able to do line-by-line searching even faster (by never actually enumerating all the lines in the first place).
24 Likes

Thanks for the improvement suggestions :heart_exclamation:

  • for a second WIP prototype (the SAM_mapper subfolder in the same repo), my colleague was indeed using read_line() with a re-used String as a buffer, but the logic became very convoluted because we were manipulating the buffer in multiple spots (String::clear() in at least 3 places). I refactored that out again for readability.
    If you really think that it'll help, I can put it back in later, once the required logic has solidified somewhat, but I'd probably try to skip directly to a streaming API (using nom)
  • Regex::Bytes seems like a worthwhile optimisation, fastq files are generally ASCII, so UTF8 validation is useless overhead. I wasn't aware that all that was going on in the background!
  • I'd love to see RipGreps search code available in re-usable form; I was already thinking about ways to write everything in a streaming form, but honestly, once you go from 45 seconds down to 3,6s, there isn't much incentive to keep fiddling on the 3 seconds, when there are still other 40s+ scripts around :wink:

All in all, it's becoming a great learning experience for both my colleague and myself. He has great fun learning a new language and getting awesome performance, I have great fun trying to refactor his "I'm new, coming from C" code into more rustic idioms. Learning all round :mortar_board: :slight_smile:

9 Likes

The cool thing here is that even if your input contained non-ASCII data, you could still skip the UTF-8 validation because the regex engine will do it for you implicitly as part of the match. :slight_smile:

Haha yeah, totally fair. That's a good perspective to take!

3 Likes

That's awesome! I'm glad your coworker was able to get productive this quickly. I'm also very excited to see Rust used at an institute like DFKZ!

Looking at the code, and in addition to what was already suggested above, you might want to try to use String::with_capacity(n) instead of String::from("") and further down use fq_seq.clone_from(l).

2 Likes

Glad you think so :slight_smile:
I would caution against premature enthusiasm, Rust is still far of from being "institutional" at the DKFZ. As far as I'm aware, 3 out of 3.000 people use Rust at the DKFZ (Me, and the two people I badgered into it :wink: ).
Still, that's 200% more than 2 months ago (just me), and infinitely more than 12 months ago :smiley:

Thanks also for the String::clone_from tip, I'd read about it before, but it isn't yet at the forefront of my brain. You're right that this is exactly the right use-case for it.
saving those few extra allocations each iteration may help a bit! I'll have to discuss with my colleague which with_capacity() values make sense.

2 Likes

If you learn ways to improve your Rust code, you can later use them for the other scripts too.

Very true! I didn't mean to imply I won't be using all the wonderful tips! :scream:
Just that this script is probably not getting much improvements, rather the other slowpokes first. Only 24 hours in a day, after all.

1 Like

If you are running these small applications on Linux, you can probably achieve a massive performance increase by 1) installing and compling to the MUSL target, 2) setting transparent_hugepage to madvise, and 3) disabling jemalloc and using the system allocator instead.

Disabling Jemalloc

Set this at the top of your main.rs file:

#![feature(alloc_system)]
extern crate alloc_system;

Installing the MUSL target

rustup target add x86_64-unknown-linux-musl

Compiling to MUSL target

cargo build --release --target x86_64-unknown-linux-musl

Setting Transparent Hugepages to Madvise

sudo sh -c "echo madvise > /sys/kernel/mm/transparent_hugepage/enabled"
4 Likes

Thanks for the tips! We are indeed running on Linux, so I'll definitely give them a try.

Could you explain the reasoning behind them though?
I'm guessing that using the system allocator probably saves the initialisation time of jemalloc, but why use musl? Wouldn't the system libc follow the same argument as the system allocator?
And Hugepages? That sounds like performance wizardry! Please teach me how that's supposed to work! :slight_smile:

Edit to add:

  • Using the system allocator seems to drop run-time by another 300-500 ms on my PC :smiley:
    It's getting hard to measure though, since the variance between runs is also in that range.
  • Using the musl-target seems to be slower by about a second, with a lot more variance
  • I don't have root access to my desktop, so I can't try the hugepage tuning.

I'll ask my colleague to run a new set of benchmarks on his machine, so that they're comparable to the numbers I posted before.

Basically, the GNU libc is rather bloated in comparison to MUSL, and cannot be compiled statically into the binary. MUSL tends to lead to less overhead, and makes the produced binary truly portable as it will have zero dynamic dependencies.

As for jemalloc, it's given me a lot of performance issues in my projects -- consuming volumes of CPU time to allocate and deallocate memory where the system allocator doesn't struggle.

And as for transparent hugepages, they have extremely severe performance consequences with Rust binaries and jemalloc when set to always instead of madvise. A good kernel maintainer will compile the kernel with madvise instead of always.

2 Likes

Thanks for the explanations!
I can see the first two points; We'll have to do some benchmarking to see which way it swings for us.

For huge-pages, that still sounds like black magic to me, but I've found some documentation to keep me occupied.
Our Opensuse (desktop) boxes have this though:

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

Seems like the OpenSuse kernel team could use some prompting :slight_smile:

So, it's been a while, and my colleague has continued fiddling with the original application in between his other tasks.
It's now basically finished, so you can think of this post as the closing update.

tl;dr: Rust now does more for us, and our implementation got even faster! :smiley:

A big part of the speed increase was discovering the wonderful "needletail" crate (crates.io).
Needletail uses byte-reading, memchr and reused buffers to do (almost) zero-copy iteration. Basically all the things you wonderful people suggested we do, upthread :wink:
In practice, it builds a reader over a file, and that reader then calls your user-supplied closure with each record in the file. Your closure gets handed a lifetime-bounded record backed directly by the read-buffer!.
As a bonus, it transparently reads gzipped files, which was a separate step in our old perl-based system.

As suggested, we switched to Regex::bytes submodule, to avoid all the UTF8 validation when reading our (pure-ascii) fastq files.

In addition to Needletail, we've also started using Rust-bio (crates.io) to handle our nucleotides in the file. They have a nice, fast, reverse complement algorithm, that even works on raw bytes!

To repeat the context, we use this, productively, for our CRISPR-analyzer webservice, that lets you look for CRISPR/CAS9 target sites in your samples, bringing sophisticated biological screens into the reach of labs without dedicated bioinformatics staff.
Speeding up this webservice means more customer-friendlyness (less waiting!), and less resources expended on our side.

timings with time:

-- original PERL version: --
1) gunzip inputfile:                      real 0m 9.674s
2) extract matching fastQ reads:          real 0m46.133s
3) Map to reference genome with bowtie2:  real 0m30.978s  (constant)
4) Map reads to genes:                    real 2m 3.399s
   TOTAL:                                    * 3m30.184s *

++ RUST version ++
1+2) extract fastQ reads from gzip'd input:   real 0m26.334s
  ALT: same, non-gzip'd input:               (real 0m10.971s)
3) Map to reference genome with bowtie2:      real 0m30.978s (constant)
4) Map reads to genes:                        real 0m19.490s
   TOTAL                                           1m16.802s (63% shorter overall!)

Updated code is (still) here on github

For those of you wondering that these timings are longer than the original 3-4 seconds I posted: the 3 seconds time was for an incomplete port of the original, just the bare minimum of iterating and regex-matching (but of course, the same incomplete feature-set for both C and Rust).
These new timings are for the full, does-everything, version. Features cost runtime :slight_smile:

Thanks everyone for helping us save an entire two minutes from our webservice!

31 Likes