I'm doing a task similar to one described in this thread -- one thread reads from a file chunks of variable length, sends it to a channel, and the worker threads process these items.
I thought of a scheme with a vector of vectors under one or multiple mutices. But it seems to have bottlenecks. Instead, maybe I use a carousel: worker threads read from one channel, then send the vectors back over another channel. File reader thread will do the other way around.
I wonder, is this worth the hassle?
In current scheme, I create 1K vectors per second. I know their size beforehand (first 4 bytes indicate the size) and don't increase capacity, so it's 1 allocation per chunk + 1 channel write.
The carousel version will have 2 channel changes (1 read & 1 write) and 0 allocations per chunk.
Current reader code
move || -> Result<(), ThreadCrash> {
loop {
let chunk_len = match u32::read_from_io(&mut file_reader) {
Ok(v) => Ok(v),
Err(e) if e.kind() == std::io::ErrorKind::UnexpectedEof => {
break
},
Err(e) => Err(ThreadCrash::new(&format!("{e}")))
}?;
let mut bytes = vec![0; chunk_len as usize];
file_reader.read_exact(&mut bytes).unwrap();
jobs.send(bytes).map_err(|e| ThreadCrash::new(&format!("{e}")))?;
}
Ok(total_tracks)
}
The link you provided doesn't work, it is not even clickable. That said the answer is to benchmark and profile your code.
Many factors can affect performance, including the specific OS and CPU you are running on. Other code in your application can affect things: something that looks good in a microbenchmark might not be the top performer in the the actual application. This can happen because the CPU cache is a limited resource, and a slower algorithm that uses less cache might end up faster in practice.
In your case the performance of allocation is likely to differ based on which allocator you use. Rust allows overriding this with the global_allocator attribute. Some of them might involve taking locks, some might have thread local pools for reusing allocations on the same thread. Allocators are also unpredictable since they tend to differ wildly between best case and worst case performance: of the allocator needs to go ask to OS to map more memory rather than reuse memory in an existing pool it will be slower. This is why allocators are avoided in hard real time code, since there you want predictable worst case.
I suspect that would be largely measurement error - did you use something like hyperfine that can estimate that for you?
1k/s is pretty much nothing for most machines (that's ballpark around a million operations), so it pretty much shouldn't matter at all for most things you do. You're saying it's getting bottlenecked though - I'm curious where. If it's CPU bound then a profiler with something like a flamechart is a good bet (though since this is jobs based, look for one that sorts the stacks by duration or just alphabetically so it merges common stacks), but if it's IO bound or something fancier like lock or cache contention you need to get into more specific tooling.
Both channels and general purpose allocators are fundamentally based on at least memory synchronization, so they both have the general pattern of basically being free when not under contention, so that would be a slight win for the channels in theory as they are not shared across the whole process and therefore are unlikely to be under contention. It's possible that's what you're seeing, but again I would expect 1k/s to be well below where that's an issue.
Rapidly de- and reallocating using the system allocator can cause issues like fragmentation which can cause performance issues overall, though it's much less of a problem with more modern allocators. If you're really trying to optimize you could start looking at arena allocators simply to keep the related memory together, but it is quite possibly just churn without value unless you can see it in a profiler.
I think it'll depend massively on your allocator and the exact pattern of usage.
Modern allocators have smart bucketing and thread awareness, so if you're deallocating something then reallocating something the same size in the same thread it's typically incredibly fast.
If the sizes vary widely and you're using them across a channel so the thread locality optimizations never apply, then the practical allocation performance might be completely different.
Depending how your reader works, you could also try a Bytes-like scheme where the reader will allocate a huge buffer and give out shared chunks of it so that it's one allocation per 10 MB (say) instead of one allocation per chunk.
But really, what I think most is that if it's worth having multiple worker threads -- especially to process something you read from IO -- then the amount of work you're doing per chunk is probably high enough that none of this ought to matter. For "large" chunks, I'd expect the IO time to dominate over an extra allocation.
If a thread locks a mutex and is then suspended by the OS, none of the other threads that are waiting to acquire that mutex can make progress until the original one is unsuspended and releases the lock. It's also just easier to reason about channels than mutexes.
When I tried writing it, I understood that my scheme had an issue; I wanted the reader to save data in mutexes one by one, and this means any worker that took too long and made the reader thread wait for its mutex, would make others wait too. So I abandoned the idea.
if you know sizes beforehand and don't grow them i'd recommend using boxed arrays instead, if you really want to do something fancy you might even calculate the total number of elements make a massive array for them all and split it up into slices so all your data is neatly compact in ram.
this would give you a few benefits ranging from having a single allocation instead of multiples and having all your data be compact thus potentially getting better cache access and less memory waste due to fragmentation.
also remember allocations often have mutex access themselves