Multiplex batches of read file lines to rayon tasks

dimitarvp · May 8, 2020, 1:28am

Hello,
I am making a CLI tool to produce a histogram out of a JSON log file (one JSON object per line) -- it counts the distinct occurrences of a particular JSON object field.

I had no problems following the recommended BufReader.read_line idiom with a single mutable String buffer and make it work reliably.

But now I'd like to parallelise it. I gather there are a lot of ways but either due to a case of a XY problem, or because of how performant my equivalent Elixir code is, I'd like to have the parallel implementation roughly do the following:

Serially (single thread) read the file.
Make and collect batches of, say, 500 lines.
Send each batch to a separate thread.
Each separate thread does JSON parsing of each line in its batch and updates the shared-state histogram (a HashMap<String, u64>).

In the IRC channel I got told that "a quick loop of try_sends and a blocking send for each line (make a thread pool, send work to them from a single reader thread over a bounded channel)" would likely be the quickest solution without involving rayon.

My issue is: I am still rather new and I still don't know much of Rust's idioms. Also, rayon seems to not need the concept of batches; it claims it can automatically adapt to load so maybe the batches producing part can be entirely scrapped?

Can somebody help with some sample code snippets to help with this? I am not looking to have my homework written for me; I just need a good starting point.

(F.ex. I have no clue how could I translate a BufReader.read_line loop to a 500-line chunk producer. Or even to a simple iterator producer.)

naim · May 8, 2020, 2:09am

The following snippet would open the logs file and parse/process every line parallel. I think rayon is the goto library for such a problem, it is designed to work with data. But please do not try to comine mpsc or Mutexes with rayon, it may cause dead locks.

use rayon::prelude::*;
use serde::Deserialize;
use std::io::BufRead;

/// This describes the following JSON object: {"f1": 34}
#[derive(Deserialize, Debug)]
struct LogLine {
    f1: u32,
}

fn main() {
    let fd = std::fs::File::open("logs").unwrap();
    let x = std::io::BufReader::new(fd);

    x
        .lines()        // split to lines serially
        .filter_map(|line: Result<String, _>| line.ok())
        .par_bridge()   // parallelize
        .filter_map(|line: String| serde_json::from_str(&line).ok()) // filter out bad lines
        .for_each(|v: LogLine| {
           // do some processing (in parallel)
           println!("X={}", v.f1);
        });
}

dimitarvp · May 8, 2020, 7:32am

(Fully edited the previous version of this since it was quite the rookie question.)

Isn't the .lines() making a new String for each line? Is there a way to make the BufReader.read_line idiom (with a single mutable String buffer) yield an Iterator? I am still very new and I am not sure if I could properly implement Iterator.

alice · May 8, 2020, 9:00am

You can use a single string buffer like this:

use rayon::prelude::*;
use serde::Deserialize;
use std::io::Read;

/// This describes the following JSON object: {"f1": 34}
#[derive(Deserialize, Debug)]
struct LogLine {
    f1: u32,
}

fn main() {
    let mut fd = std::fs::File::open("logs").unwrap();
    let len = fd.metadata().unwrap().len() as usize;
    let mut file = String::with_capacity(len);
    fd.read_to_string(&mut file).unwrap(); // This reads the entire file into memory.
    drop(fd);

    file.lines()        // split to lines serially
        .par_bridge()   // parallelize
        .filter_map(|line| serde_json::from_str(line).ok())
        .for_each(|v: LogLine| {
            // do some processing (in parallel)
            println!("X={}", v.f1);
        });
}

This uses str::lines instead of BufRead::lines, which returns references into the same string instead of a new allocation per line.

dimitarvp · May 9, 2020, 10:00am

That would read the entire file in memory. Since I expect those files to be 1GB or more... I suppose it's still not a big deal but for now will go with the one String allocation per line.

alice · May 9, 2020, 11:06am

I mean, you could also change it to read max 100 MB into memory, and then process all the complete lines in that, and then do the next 100 MB of lines.

dimitarvp · May 9, 2020, 11:09am

That's just the thing: still learning so fine-tuning those algorithms still doesn't come naturally -- I don't know most of the Rust traits and my brain still doesn't bend that way.

alice · May 9, 2020, 11:40am

You could do it like this playground.

alice · May 10, 2020, 5:40am

No, s.len() returns the length, not the capacity, so in the first iteration the length is 0, and in later iterations it is how many bytes were in the incomplete line that was transferred to the next chunk.

system · August 8, 2020, 5:42am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to Integrate Rayon for Parallel File Extraction in a Rust Project with tar and flate2 Crates? help	7	299	March 21, 2024
Rayon is slower than serial algorithm	15	1971	October 30, 2020
Limiting number of rayon threads globally per machine help	5	1255	March 26, 2022
Using parallelism properly for a game help	3	518	June 1, 2022
rayon::ThreadPool overhead help	2	642	September 14, 2022

Multiplex batches of read file lines to rayon tasks

Related Topics