I'm trying to learn Rust by writing some libraries for parsing large text files used in bioinformatics. Much of this text file processing involves splitting lines of text into their respective fields (e.g. tab-delimited or white space delimited).
I'm finding that Rust's standard string manipulation functions, such as split
and split_whitespace
are very slow in comparison to similar functions in Go.
To illustrate this I've written Rust and Go example code that does essentially the same processing using equivalent structs and functions:
Benchmarking these implementations by running them against modestly large files (e.g. [1] below), I'm finding that the Rust executable (built using cargo --release
) is about 4x-5x slower than the respective Go executable.
The bottleneck in the Rust code is, as you might suspect, here:
fn parse(s: &str) -> Record {
// Record::new()
Record {
words: s.split_whitespace().map(str::to_owned).collect(),
}
}
If I replace the Record creation and string splitting with the commented out empty struct creation (and make the same respective edits in the Go code), then the Rust implementation is faster than the Go implementation (i.e. illustrating that the act of struct creation in Rust is faster than that in Go).
My real world code is of course more complex than the examples illustrated above, but I've boiled down the major bottle necks I'm encountering so far to variations on string splitting.
Is there a more idiomatic way I should be handling this type of string processing in my Rust code? Any tips or suggestions for more performant string processing in Rust would be greatly appreciated.
[1] For testing I'm using the the unzipped version of this file (ftp link).
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrY.phase3_integrated_v2a.20130502.genotypes.vcf.gz
Sorry for the verbatim text but Discord only wants new users to create two links per post