Rust vs Go string manipulation -- performance

I'm trying to learn Rust by writing some libraries for parsing large text files used in bioinformatics. Much of this text file processing involves splitting lines of text into their respective fields (e.g. tab-delimited or white space delimited).

I'm finding that Rust's standard string manipulation functions, such as split and split_whitespace are very slow in comparison to similar functions in Go.

To illustrate this I've written Rust and Go example code that does essentially the same processing using equivalent structs and functions:

  1. Rust example -- Rust code
  2. Go example -- Go code

Benchmarking these implementations by running them against modestly large files (e.g. [1] below), I'm finding that the Rust executable (built using cargo --release) is about 4x-5x slower than the respective Go executable.

The bottleneck in the Rust code is, as you might suspect, here:

    fn parse(s: &str) -> Record {
        // Record::new()
        Record {
            words: s.split_whitespace().map(str::to_owned).collect(),
        }
    }

If I replace the Record creation and string splitting with the commented out empty struct creation (and make the same respective edits in the Go code), then the Rust implementation is faster than the Go implementation (i.e. illustrating that the act of struct creation in Rust is faster than that in Go).

My real world code is of course more complex than the examples illustrated above, but I've boiled down the major bottle necks I'm encountering so far to variations on string splitting.

Is there a more idiomatic way I should be handling this type of string processing in my Rust code? Any tips or suggestions for more performant string processing in Rust would be greatly appreciated.

[1] For testing I'm using the the unzipped version of this file (ftp link).

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz

Sorry for the verbatim text but Discord only wants new users to create two links per post :frowning:

6 Likes

If you can, use Record<'_> and avoid str::to_owned step.

Or use Record { unsplit_words: s.to_owned() } and expose self.unsplit_words.split_whitespace() iterator if you don't need random access.

If you know max/typical number of words, try using ArrayVec or SmallVec instead of Vec, or at least use Vec::with_capacity(typical number of parts) and .extend it instead of collecting.

Try using jemallocator.

And in case it's not allocation that's costly, but splitting, try cutting corners on Unicode support with split(' ') or maybe even s.as_bytes().split(b' ').

5 Likes

I apologize, I'm not sure what you're suggesting here. Do you mean to restructure Record to use explicit lifetimes and something like words: Vec<&'a str> instead? If so, I tried going this route and ran into various problems with lifetimes related to the lines from the BufRead; I think the key issue being that I don't really grok lifetimes yet in Rust.

I imagine this will likely get you a lot closer to the Go performance - the docs for the Fields method in Go say (emphasis mine):

Fields splits the string s around each instance of one or more consecutive white space characters, as defined by unicode.IsSpace, returning a slice of substrings of s or an empty slice if s contains only white space.

This implies (but I don't know for sure!) that Go isn't allocating a new string for each entry, it's just returning subsections of the original string. However, by calling to_owned in your Rust code, you're explicitly asking for a new String to be allocated for each substring. Using string slices definitely sounds like a better idea if you're working with a massive piece of input data.

Here is a minimal example showing how to use string slices for your Record type.

Also, the obligatory suggestion: are you running your Rust application in release mode (e.g. cargo run --release)? Without that, your application will be compiled without optimizations. This should be the first thing you do before trying anything else!

10 Likes

Yes, I was suggesting Vec<&'a str> which would reference a string stored elsewhere instead of storing it itself. And yes, it would complicate lifetimes.

In a for line in buf.lines() loop each line lives only for one loop iteration, so the borrow checker won't let you keep a Record that references a line that has been thrown away. You'd need all_lines = buf.lines().collect() to keep them all, and then make Record<'s> that reference any of the lines.

Or just load the whole file with std::fs::read_to_string() and then split it on newlines instead of using BufReader. You're ending up with the whole file in memory anyway.

11 Likes

This is indeed exactly the problem. In Go, this is quite natural to do because of its GC. This is actually a really nice example of something that the presence of a GC really helps with. The "natural" way to write the code also happens to be quite fast because it minimizes allocation. In Rust, the straight-forward translation ends up allocating new space for every field, which slows it down substantially.

The suggestions you've got are all good, but depending on what I was doing, I would try to basically do what Go is doing, but more explicitly. Namely, I'd store the full line in each record and then add a separate Vec of offsets for each word. It's not precisely equivalent to what Go does, but it's pretty close.

Here's the code:

use std::env;
use std::error::Error;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::time::Instant;

#[derive(Debug)]
struct Record {
    words: String,
    offsets: Vec<(usize, usize)>,
}

impl Record {
    fn parse(s: &str) -> Record {
        Record {
            words: s.to_owned(),
            offsets: s.split_whitespace().map(|x| {
                let start =
                    (x.as_ptr() as isize - s.as_ptr() as isize) as usize;
                let end = start + x.len();
                (start, end)
            }).collect(),
        }
    }

    fn word(&self, i: usize) -> &str {
        let (start, end) = self.offsets[i];
        &self.words[start..end]
    }
}

#[derive(Debug)]
struct Table {
    records: Vec<Record>,
}

impl Table {
    fn new() -> Table {
        Table {
            records: Vec::new(),
        }
    }

    fn parse(mut r: impl BufRead) -> Table {
        let mut tbl = Table::new();
        let mut line = String::new();
        while r.read_line(&mut line).unwrap() > 0 {
            let rec = Record::parse(&line);
            tbl.records.push(rec);
            line.clear();
        }
        tbl
    }
}

fn main() -> Result<(), Box<dyn Error>> {
    let args: Vec<_> = env::args().collect();
    if args.len() < 2 {
        panic!("Must specify file argument")
    }

    let file = File::open(&args[1]).unwrap();

    let timer = Instant::now();
    let tbl = Table::parse(BufReader::new(file));
    let duration = timer.elapsed();

    println!("Time elapsed to parse records: {:?}", duration);
    println!("Number of records: {}", tbl.records.len());
    println!("word 7 of record 62167: {:?}", tbl.records[62167].word(7));

    Ok(())
}

And for the comparison, here's Rust:

$ time ./target/release/rust ../ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf
Time elapsed to parse records: 1.107500897s
Number of records: 62168
word 7 of record 62167: "AA=T;AC=1;AN=1233;DP=4935;NS=1233;VT=SNP;AMR_AF=0.0000;AF=0.0008;AFR_AF=0.0000;EUR_AF=0.0042;SAS_AF=0.0000;EAS_AF=0.0000"

real    1.175
user    0.658
sys     0.515
maxmem  1587 MB
faults  0

And now Go:

$ time ./pgo ../ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf
Time elapsed to parse records: 1.284818437s
Number of records: 62168
word 7 of record 62167: AA=T;AC=1;AN=1233;DP=4935;NS=1233;VT=SNP;AMR_AF=0.0000;AF=0.0008;AFR_AF=0.0000;EUR_AF=0.0042;SAS_AF=0.0000;EAS_AF=0.0000

real    1.347
user    4.902
sys     0.503
maxmem  1441 MB
faults  0

@kornel's suggestion of just loading the entire file into memory is also a good one. If your files are only a couple hundred MBs, then that might actually be the best thing.

21 Likes

I had written essentially the same code for the Record<'a> implementation as your example but I can't figure out how to combine it with Table<'a> and BufRead (see kornel's answer below) to make all the lifetimes work. Might you be able to illustrate this?

And yes, I was running with --release.

In this scenario can all_lines be sorted as a Vec<String> in the Table itself? I tried to implement this, but ran into lifetime issues again.

For the purposes of this illustrative example I'm simply reading the whole file into memory, but in general wouldn't do so, hence the desire to use BufRead instead of std::fs::read_to_string.

Thanks! This helps me grok what's going on..

In an attempt to be clever and test my understanding of lifetimes I tried to combine your approach with the suggestions of kornel and 17cupsofcoffee re: having Record hold Vec<&'a str>. I came up with this new version of Record:

#[derive(Debug)]
struct Record<'a> {
    line: String,
    words: Vec<&'a str>,
}

impl<'a> Record<'a> {
    fn new() -> Record<'a> {
        Record {
            line: String::new(),
            words: Vec::new(),
        }
    }

    fn parse(s: &'a str) -> Record<'a> {
        let mut r = Record::new();
        r.line = s.to_owned();
        r.words = r.line.split_whitespace().collect();
        r
    }
}

I thought that in this case the lifetime of Record.line would be the same as that of Rec.words and that everything should just work. But I'm running into the message that r.line is owned by the current function. I'm guessing this is me running into the borrow checker again?

Right. If you could do something like that simply, then that's certainly what I would have done instead of going the more indirect approach of using offsets. Basically, lifetimes are annotations in a Rust program. You can't use them to define lifetimes. When you parameterize Record over 'a, then you're saying that it is tied to the lifetime of some other data that it does not own. Consider what would happen if, for example, something tried to mutate or empty the contents of line while you had a Vec<&'a str> pointing into line. In your code, there is nothing that would prevent such a mutation, which could very easily lead to UB.

If you want words to be a Vec<&str> that points into the sibling owned field of line, then AFAIK, you still need to use unsafe to accomplish that. See also: GitHub - Kimundi/owning-ref-rs: A library for creating references that carry their owner with them.

In this case there is a better solution, don't store line as a String, but as &'a str. This way you can directly refer to subslices of it.

#[derive(Debug)]
struct Record<'a> {
    line: &'a str,
    words: Vec<&'a str>,
}

impl<'a> Record<'a> {
    fn new() -> Record<'a> {
        Record {
            line: "",
            words: Vec::new(),
        }
    }

    fn parse(s: &'a str) -> Record<'a> {
        let mut r = Record::new();
        r.line = s;
        r.words = r.line.split_whitespace().collect();
        r
    }
}
1 Like

Have you tried putting that code into the OP's original program? That only works if you read the entire file into memory, or if you can throw the record away before parsing the next line. The OP's code is reading it line by line and accumulating all records.

4 Likes

I seem to have missed that somehow, thanks for the correction

3 Likes

I don't see how that is possible.

One has a huge string in memory and wants to split it into a million sub-strings. To my naive mind all those new sub-strings are allocated somewhere on the heap. I don't see how Go having GC or Rust not makes any difference.

Now, to my naive old C programmer mind, if one has a huge string in memory and want's to parse and locate parts of it one does not need to allocate anything anywhere. GC or otherwise. On only has to create references to those sub strings in the existing string.

That is to say the result of splitting should just be a vector of slices of the huge input string. Nothing new created anywhere, no copies, no allocations.

I suspect that if one did that it would be way faster than Go.

Or perhaps that is what has been suggested above already...

1 Like

Taking a sub-slice of a string in Go or Rust does not involve an allocation. However, a sub-slice in Rust is a borrowed value where as a sub-slice in Go is indistinguishable in the type system from its original string. In Rust, if you convert a borrowed string into an owned string, then that involves a heap allocation of the sub-slice's string data. In Go, no such thing is required because you don't need to do any conversion from a borrowed sub-slice to an owned sub-slice. A garbage collector lets you throw away the original string that you sliced. The sub-slices will still point to it, so memory isn't freed. But at no point are new allocations created for the data inside the sub-slices.

9 Likes

@ZiCog -- here are two specific examples that I believe illustrate the problem clearly.

  1. Allowed string slices
  2. Disallowed string slices

These have the same basic code, but note that the line borrowing in the Table::parse in the second example is not allowed.

2 Likes

However, if we give up wanting to reproduce the exact same program, and we are only interested in making it fast(er), then reading the entire string is indeed probably going to be even faster. (Of course, as long as it fits in memory. In bioinformatics, it's not uncommon to have huge strings that don't.)

1 Like

Seems like a violation of the spirit of the OP's program though. There are lots of different tricks. You might use memory maps for example.

Sometimes a GC makes things more convenient without completely sacrificing performance. I don't think that should be interpreted as a controversial statement. I personally am always interested in finding such cases.

6 Likes

Perhaps I am missing a point here.

One of the first programs I wrote in Rust was to tackle the problem of finding all the anagrams in the Insane British English dictionary you get with Debian.

Being of a C mind set and knowing nothing of Rust's "slices" what do I do? I read the whole dictionary file into memory, I scan it for line endings and skip any spaces, I create my own structs that contain indices to the start and end of each word in that string. Then I set about finding the anagrams in there.

No allocations required anywhere apart from loading the dictionary in the first place and recording those word begin/end structs.

The end result was a program that worked as fast as those who set the challenge and were coding in C/C++. See here: GitHub - ZiCog/insane-british-anagram-rust: Rust program to find anagrams in the Debian british-english-insane dictionary file.

No problems with ownership or borrowing in sight.

Having learned more about Rust and slices etc I thought I should revisit that code and replace my home made range structs with Rust slices. I like to think that would work as well or better.

I don't believe in that. In particular, I don't see why the goal should be solving a problem in exactly the same way. If performance is desired, why couldn't we rewrite the program in a manner that yields the same results but faster? Unless we are creating a benchmark for the allocator or for copying strings, that seems like a pointless restriction for a real-life problem.