Multithreading performance measurement question

I wrote a Rust program which has 1 thread reading a file into a Buffered Reader and then line-by-line, feeding that data to a thread pool via a channel. The thread pool has like 8-9 threads which are in spinlock waiting for work to come to them out of the channel. When the channel has data hit the receiving end, the next free thread will grab the data and work on it.

I definitely see a performance boost versus when I had the program using only a single-threaded model. The program works fairly well, actually.

However, I have no idea if the number of threads I have spinning is really all that efficient… For example, what if my main thread is only reading fast enough to keep 3 of the threads in the thread pool busy and 6 threads are pretty much just spinlocking at any given time? How would I know that I need to have 2 threads reading from the file and feeding the rest of the threads through 2 separate channels, for example? I am not sure how to go about actually measuring this. Any tips or ideas would be appreciated, thanks.

Is there a reason you’re using a spinlock? That doesn’t seem optimal or necessary. I’m going to put down a few high level thoughts for now.

Generally, with a producer/consumer model where the rate of work production differs from rate of consumption, you’ll want to implement some form of backpressure. Backpressure will force the producer(s) to slow down when consumer(s) fall behind. In Rust’s stdlib this can be accomplished by connecting the producer and consumer with a bounded channel - if a producer would exceed the capacity of the channel, it blocks until capacity is made available (ie consumer makes some progress).

There are many ways to measure this and tune. One can measure the work rate imbalance by instrumenting the queue depth and/or tracking how often the producer blocks and/or how often the consumer has nothing to pull off the queue. You can also run benchmarks and use empirical observations to see how many consumers you need before throughput flattens out (assuming you’re mostly throughout oriented). You can also use a threadpool that does work stealing internally, and/or a pool that manages concurrency level dynamically (ie injects new threads if consumers are being outpaced, reaps threads that stay idle for too long, etc).

Whether you need more producers or consumers depends on which side is outpacing the other and whether you have room to increase one or the other. For example, a producer(s) doing disk I/O may hit the I/O device limits and adding additional ones either doesn’t help or makes things worse.

1 Like

Let me just show you what I’m doing; I often get confused with multithreading lingo… But by spinlock I mean “a loop which continues to iterate until work is ready.”

However, it may be that I am not truly using a spinlock since the channel docs claim that it blocks the thread until there is work… But I’m not sure how it blocks it (does it pause the thread’s execution or does it execute a spinlock internally?)… Anyway, this is the type of thing I’m talking about:

extern crate spmc;
use spmc::channel;
use std::thread;
use std::io::BufReader;
use std::fs::File;
use std::io::BufRead;
use std::u32::MAX;
use std::env;
use std::sync::Arc;

fn main() -> Result<(), std::io::Error> 
{
    let args: Vec<String> = env::args().collect();
    match args.len()
    {
        2 => {},
        _ => return Err(std::io::Error::new(std::io::ErrorKind::Other, "Incorrect number of args"))
    }
    let filename = &args[1] as &str;
    let f1 = File::open(filename)?;

    let mut br = BufReader::new(f1);

    let mut vecData: Vec<String> = Vec::new();     

    let (tx, rx) = spmc::channel();

    let mut handles = Vec::new();
    for n in 0..5 {
        let rx = rx.clone();
        handles.push(thread::spawn(move || {
            loop
            {
                let mut line_to_check: Arc<String> = rx.recv().unwrap();
                if line_to_check.contains("test")
                {
                    println!("HIT: {}", line_to_check);
                }
            }

        }));
    }
    let mut input_str = String::new();
    let mut bytes_read: usize = 1;
    while bytes_read != 0 {
        let mut is_copy = input_str.clone();
        bytes_read = 
        match br.read_line(&mut is_copy)
        {
            Ok(num) => num,
            Err(err) => return Err(std::io::Error::new(std::io::ErrorKind::Other, "read_line failed...\n"))
        };

        let str_arc : Arc<String> = Arc::new(is_copy);
        tx.send(str_arc);
    }


    Ok(())
}

Note that if I am understanding the docs properly, the statement let mut line_to_check: Arc<String> = rx.recv().unwrap(); should actually cause the thread to “block” if there is no work and thus I am not clear on whether or not this is truly a spinlock in the sense that the loop will execute as fast as possible until there is work…

Ah ok - spinlock usually refers to a critical section implementation (like a mutex) that uses spinning, rather than thread suspension, to acquire a lock (as the name suggests). It’s somewhat niche and intended for cases where you’re willing to burn CPU rather than sleep the thread and critical section length is short/fast. Some impls of critical sections are hybrids: they may spin a bit and then give up and go to sleep.

Anyway, I’m not familiar with the spmc crate but it likely uses OS primitives to put a thread to sleep and then wake it (eg futex on Linux). If the channel is empty, a thread doing recv()‘ will go to sleep (maybespmc` spins a bit but that would be an impl detail anyway), and thus won’t waste CPU; when work arrives, it’ll get woken up.

Have you looked into using a threadpool crate rather than manage channels directly? You probably also don’t need to wrap the String in an Arc - just submit the String to the channel (or pool).

The other suggestion I’d have is to try and submit “chunky” work to the worker threads - this amortizes the synchronization/concurrency cost you pay every time you communicate via a channel/pool. In other words, try not to be too chatty, but while also trying to be granular enough to spread the load across the workers.

2 Likes