Learning parallel processing with rust

Hi All, I'm new in rust. so i create litle script as my learning test. The object is process the file as fast as possible. Now i can process the file about 8 minutes. The file size about 11 GBs, My cpu core are 10 and ram about 16 GBs. this is my code :

use std::fs;
use std::sync::Arc;
use regex::Regex;
use std::time::Instant;
use std::thread;
use rayon::prelude::*;


fn main() {

    // read the file
    let start_reading_time = Instant::now();

    let file_as_str = match fs::read_to_string("testing.txt") {
        Ok(data) => data,
        Err(e) => {
            eprintln!("Error reading file: {}", e);
            return;
        }
    };
    let end_reading_time = Instant::now();
    let elapsed_reading_time = end_reading_time - start_reading_time;

    println!("elapsed reading time : {:?}", elapsed_reading_time);


    // split string into array lines
    let start_split_time = Instant::now();
    let lines: Vec<String> = file_as_str.lines().par_bridge().map(|f|{
        f.to_string()
    }).collect();
    let end_split_time = Instant::now();
    let elapsed_split_time = end_split_time - start_split_time;

    println!("elapsed split time : {:?}", elapsed_split_time);

    // main process
    let start_time = Instant::now();

    let arc_lines = Arc::new(lines);

    let lines_handle = {
        let arc_lines = Arc::clone(&arc_lines);
        thread::spawn(move || {
            println!("count_lines  : {}", count_lines(arc_lines));
        })
    };

    let spaces_handle = {
        let arc_lines = Arc::clone(&arc_lines);
        thread::spawn(move || {
            println!("count_spaces  : {}", count_spaces(arc_lines));
        })
    };

    let words_handle = {
        let arc_lines = Arc::clone(&arc_lines);
        thread::spawn(move || {
            println!("count_words  : {}", count_words(arc_lines));
        })
    };

    let paragraphs_handle = {
        let arc_lines = Arc::clone(&arc_lines);
        thread::spawn(move || {
            println!("count_paragraphs  : {}",count_paragraphs(arc_lines));
        })
    };

    lines_handle.join().expect("Thread panicked");
    spaces_handle.join().expect("Thread panicked");
    words_handle.join().expect("Thread panicked");
    paragraphs_handle.join().expect("Thread panicked");

    let end_time = Instant::now();
    let elapsed_time = end_time - start_time;

    println!("elapsed main process time : {:?}", elapsed_time);
}

fn count_lines(file_as_str: Arc<Vec<String>>) -> usize {
    file_as_str.par_iter().count()
}

fn count_spaces(file_as_str: Arc<Vec<String>>) -> usize {
    let re = Regex::new(r"\s+").unwrap();
    file_as_str.par_iter().map(|x|{
        re.find_iter(x).count()
    }).sum()
}

fn count_words(file_as_str:Arc<Vec<String>>) -> usize {
    let re = Regex::new(r"\b\w+\b").unwrap();
    file_as_str.par_iter().map(|x|{
        re.find_iter(x).count()
    }).sum()
}

fn count_paragraphs(file_as_str: Arc<Vec<String>>) -> usize {
    let re = Regex::new(r"\n\s*\n").unwrap();
    file_as_str.par_iter().map(|x|{
        re.find_iter(x).count()
    }).sum()
}

is it still possible for optimize my code ? if yes. may i know the hint ? Thanks all

Disclaimer: I'm not an optimisation expert by any means.

First, file_as_str.par_iter().count() makes no sense to do. Just use file_as_str.len().

But I would rewrite all the counter functions as a single function that loops over a simple &str, finding and counting lines, spaces, words, and paragraphs in one pass. Reading from memory is slow, so I'd try to do it all in one pass. Probably using memchr, not regex. Incidentally, I wouldn't do the line-breaking pre-pass. I'd also try to use mem mapping instead of a regular file read, since the size of the file is close to the amount of memory you have; let the OS handle paging for you.

Then, I'd use the multiple cores by splitting the input file (as a single string) into N roughly equal pieces (where N is the number of threads). For each boundary, adjust it forwards or back until it lands on a paragraph boundary to simplify counting. This vector of N sub-strings can then be fed through my existing counting function using into_par_iter.

One pass (minimising reads) on independent regions (hopefully minimise cache thrashing), with each thread getting a roughly equal amount of work.

Then I'd test dividing into, say, 100*N chunks and chunks of a fixed target size (say, 4MiB) to see if those are faster (due to making the amount of work more granular in case different parts of the input file take different amounts of time to process).

At this point, I would then feed the input to something like the standard wc program and, on realising it's significantly faster, actually use that in practice instead. :smiley:

For starters I'd compile and run in release mode. There's no way an 11 GB file takes 8 minutes to be chewed through on a modern machine where the entire thing fits in RAM, unless you are running without optimizations.

Your program copies the file into newly allocated memory twice: once in read_to_string and again when splitting it into lines. Both of these allocations can be eliminated, and replaced with counting as you read the file rather than afterward — you'll need to use a custom state machine instead of regex, but that's a good programming exercise itself. “Streaming” processing, that does not keep the entire data set in memory, is an important paradigm for working with large files efficiently.

(This is not necessarily the fastest approach, nor the most parallel, though.)

1 Like

Since you always join these, check out scope in std::thread - Rust

That'll be easier to use because it can take borrows, and thus you'll be able to avoid all the Arcing.

I've seen this question before somewhere. Is this from some l33tcode test?

I think the consensus last time was that the parallelism wasn't worth it.