and now we want to convert a BigImage into a Vec<Tile>. Is this fundamentally a slow operation? I can't seem to break 1 GB/s. (Tried: reading in order; writing in order; multi threading w/ rayon).
There appears to be this fundamental issue with lots of cache misses after every 4 * 16 = 64 bytes read/written.
If we focus on writing in tiles, then we suffer a cache miss after every 64 bytes read.
If we focus on reading in a line, then we suffer a cache miss after every 64 bytes written.
Is there a clever solution around this problem?
(I ran perf, 75% of cpu-clock is spent in mov#### instructions).
What is the bandwidth for just cloning the vector? (to see if 1GB/s can be improved)
Is it faster if you make the transformation in-place? (by swapping [u32; 4]'s in the vector, maybe there would be less cache pressure)
I think I have the same kind of problem with big matrix transposition or product. I tried different ways but this kind of algorithm always implies to go through all the memory for each iteration, unless you know something about the data (e.g. sparse or symmetrical matrix).
That kinda makes sense. When you go from reading row 1 on a tile to reading row 2 you'll need to skip 8192 - 16 = 8176 pixels, which isn't great for your cache.
What about playing around with the std::intrinsics::prefetch_read_instruction() intrinsic? At the start of each tile's row you might be able to prefetch the next row (or 3 rows ahead, or whatever) so it's already in cache by the time you want to read it.
Not sure if it helps, but what if you try to "prime" the cache by reading the words in each tile in column-major order? Once you have read the first word of each line of the current tile, all the data for that tile should be in the cache. And it should be prefetcher-friendly because the distance between each word that's read is constant (8192 bytes – not sure if the prefetcher "gets" strides as big as that, but it's worth trying.)
pub fn test_00() {
let x = vec![2_u8; 1000_000_000];
let start = std::time::Instant::now();
let y = x.clone();
let end = std::time::Instant::now();
println!("cloning 1gb: {:?}", end - start);}
so there is a reported 10x difference between what I can achieve with Vec.clone and what linux bandwidth tool is claiming.
pub fn test_00() {
use rayon::prelude::{IntoParallelRefIterator, ParallelIterator};
use std::sync::Arc;
let x = Arc::new(vec![2_u8; 10_000_000]);
let start = std::time::Instant::now();
let lst = vec![(); 10_000];
let ans = lst
.par_iter()
.map(|_| {
let y = x.as_ref().clone();
0})
.collect::<Vec<_>>();
let end = std::time::Instant::now();
println!("time: {:?}", end - start);}
Prefetching memory outside your allocation should be perfectly fine, so that means you can blindly prefetch memory without needing to do any bounds checks.