Optimizing string search code for large files

The following program performs quite fast... Far faster than any other equivalent program I've written in other languages on my system. Before I show it, please note that I am aware that this type of optimization is machine-dependent and I do have a pretty fast machine with 12 cores/24 logical and memory is also not a constraint for me. As such, this program is intended to load a very large text file (I tested with a file with over 10 million lines of text) and search each line for a term. In this example, the term is hardcoded as "test" and the file has the English dictionary repeated about 20 times over.

I used rayon to speed up the processing and it has quite a large impact - without rayon, the processing phase takes 700+ milliseconds (ms) on my machine. With the par_lines() method, it takes only ~170ms!

But, I haven't written a Rust program in a while and I am by no means an expert. Am I making unnecessary memory operations here? Can we speed this up even more? Note that I chose to indent the benchmarking lines to differentiate them from the core program. Thanks. I'm on nightly and benchmarking with --release, btw.

extern crate rayon;
use rayon::prelude::*;
use std::io;
use std::io::prelude::*;
use std::fs::File;
use std::time::Instant;

fn main() -> io::Result<()> {
    let now = Instant::now();

    let mut f = File::open("words_huge.txt")?;
        let time_to_open = now.elapsed().as_millis();
        
    let mut whole_file = String::new();

    let n = f.read_to_string(&mut whole_file)?;
        let time_to_read = now.elapsed().as_millis() - time_to_open;

    let mut lines = whole_file.par_lines(); // Split it into lines
        let time_to_line_split = now.elapsed().as_millis() - time_to_read;

    let mut results: Vec<&str> = lines.filter(|l| l.contains("test")).collect();
        let time_to_process = now.elapsed().as_millis() - time_to_line_split;
    
    for r in results
    {
        println!("{}", r);
    }

    let time_to_print = now.elapsed().as_millis() - time_to_process;
    println!("Time to open: {}", time_to_open);
    println!("Time to read: {}", time_to_read);
    println!("Time to line split: {}", time_to_line_split);
    println!("Time to process: {}", time_to_process);
    println!("Time to print: {}", time_to_print);
    println!("Total time elapsed: {}", now.elapsed().as_millis());
    Ok(())
}

So how about that utf-8, are you maybe working with ascii only data, or data where non-ascii doesn't matter? You could work on bytes to avoid utf8-validation. I've not found this to be a tremendous optimization, but depending on what you really do with the data it might add up.

You can speed up that chopping up into lines using the memchr crate, like I did here (note this is on u8 because I don't care for utf8).

Then I guess you can try applying your filter via rayon, too (not used rayon a lot, I think you need par_iter() here?).

That contains check seems ripe for simd too, but you didn't say a lot about the properties (length? is it ascii only?). A quick google brought up https://github.com/jneem/teddy, maybe try that? Or try the regex crate which I understand has optimizations for substring search that might do this, too?

Overall, maybe another approach is better, too. Do you really need the lines, or just the occurences of the pattern? If so, why split into newlines at all? Just a thought :slight_smile:

Just a couple of small notes for ya. :slightly_smiling_face:

It should already be using SIMD. See here: https://github.com/rust-lang/rust/blob/316a391dcb7d66dc25f1f9a4ec9d368ef7615005/src/libcore/str/pattern.rs#L305 --- Although, I guess it isn't as good as I thought, since it's only using the fallback implementations provided by memchr (which should be auto-vectorized, but they are much slower than the _mm_cmpeq approach through more explicit vectorization).

Teddy is primarily for multiple pattern matching, and shouldn't beat a well optimized single string searcher. Teddy is now also available as part of the aho-corasick crate.

Using a Regex here could be faster, but only incidentally, perhaps because it will use better optimized versions of memchr.

4 Likes

moveax41h,

Can we speed this up even more?

I'll say. If it takes 170ms to find "test" in 10 million lines of dictionary using a fast 12 core machine you have a long way to go!

If I concatenate a bunch of copies of the british-english-insane dictionary from Debian into a 100MByte, 10 million line file, I can find all the occurrences of "test" in 342ms. This on a 7 year old 4 core desk top box. Does grep even use multiple cores?

$ ls -lh big8.dict
-rw-rw-rw- 1 zicog zicog 105M Aug 31 16:37 big8.dict
$ wc -l  big8.dict
10468416 big8.dict
$ time grep test big8.dict | wc -l
9856

real    0m0.330s
user    0m0.250s
sys     0m0.078s
$
$ time grep test big8.dict > res.txt

real    0m0.328s
user    0m0.234s
sys     0m0.078s
$

@ZiCog
Also, I'm not running on bare metal though... Sorry for the confusion. I've run this benchmark in a virtual machine with only 8 logical cores allocated and 32GB of RAM. But that should be enough for me to further optimize this given those specs.I do all of my dev in VMs these days but I could run it on the host just for benchmarking at some point if I wanted to. At this point, I am new to this type of work so things that may be obvious to you are not so obvious to me. For example:

I would assume that at some point, the overhead of starting more threads is not worth it, so I'm not sure that using all 12 proc cores would actually speed this up. I will have to experiment some more with that. But nevertheless, I'm currently looking to make code changes to optimize and I can up the hardware specs later on if desired.

Btw I'm on a Ryzen 9 3900X.

I think it was mentioned already but if your files are of the order of a few megabytes or gigabytes I'd be reading the whole thing into memory and scanning it there.

Forget about splitting it into vectors of words or whatever, that is a huge waste of time with memory allocations and such. Just scan in place.

As it happens my first ever Rust program scans the british-english-insane dictionary of Debian and finds all the anagrams in it. I read it into memory and scan it byte by byte. https://github.com/ZiCog/insane-british-anagram-rust/blob/master/src/main.rs (That's probably not such good Rust code from a Rust newbie so read with caution).

If the whole job is taking 100s of ms then firing up 8 or 16 or so cores to handle it in parallel should see great gains. They are all acting independently and starting a thread on a core is only a small chunk of time compared to the work they are doing.

I have not tried it yet but I have read great things about the "nom" parser so that might be of interest to you: https://github.com/Geal/nom

1 Like

The first thing I notice is:

let mut lines = whole_file.par_lines(); // Split it into lines
let time_to_line_split = now.elapsed().as_millis() - time_to_read;

You're measuring nothing here! The par_lines method returns a lazy iterator, and it does nothing until the call to collect.

1 Like

Thanks for the suggestions. Unfotunately it looks like Teddy is built to match on multiple patterns but only returns the first match. I'm looking to search for 1 pattern only and return multiple matches. I didn't see functionality there for it to do that as the only method which returns a vector is "patterns" which presumably just returns the pattern vector rather than all of the matches. The find() method states that it returns the first match or None.

@alice you're right, this was evident by the fact that my readout always displayed 0ms for the read portion.

You can build an index and get down to 2ms. That's what I can get from solr.

As I said above, Teddy is specifically designed for searching for multiple patterns. If you want multiple occurrences, then use find_iter. The packed sub-module is specifically a lower level API. If you need to match multiple patterns, you should just use AhoCorasick. But it sounds like you don't need to match multiple patterns, so there's no reason to be looking at the aho-corasick crate or the Teddy algorithm.

solr? Not sure what this is.

https://lucene.apache.org/solr/