Tiling an image fundamentally slow?

Suppose we have


pub struct BigImage {
  data: Vec<u32>; // 8192 * 8192
}

pub struct Tile ([[u32; 16]; 16])

and now we want to convert a BigImage into a Vec<Tile>. Is this fundamentally a slow operation? I can't seem to break 1 GB/s. (Tried: reading in order; writing in order; multi threading w/ rayon).

There appears to be this fundamental issue with lots of cache misses after every 4 * 16 = 64 bytes read/written.

If we focus on writing in tiles, then we suffer a cache miss after every 64 bytes read.

If we focus on reading in a line, then we suffer a cache miss after every 64 bytes written.

Is there a clever solution around this problem?

(I ran perf, 75% of cpu-clock is spent in mov#### instructions).

2 Likes

I wonder whether do you really need that conversion. Can't you just simulate two dimensional indexing on the vector you already have?

What is the bandwidth for just cloning the vector? (to see if 1GB/s can be improved)

Is it faster if you make the transformation in-place? (by swapping [u32; 4]'s in the vector, maybe there would be less cache pressure)

I think I have the same kind of problem with big matrix transposition or product. I tried different ways but this kind of algorithm always implies to go through all the memory for each iteration, unless you know something about the data (e.g. sparse or symmetrical matrix).

That kinda makes sense. When you go from reading row 1 on a tile to reading row 2 you'll need to skip 8192 - 16 = 8176 pixels, which isn't great for your cache.

What about playing around with the std::intrinsics::prefetch_read_instruction() intrinsic? At the start of each tile's row you might be able to prefetch the next row (or 3 rows ahead, or whatever) so it's already in cache by the time you want to read it.

4 Likes

No, many matrix / image techniques do this, "blocking" as it's first step to speed up later operations.

1 Like

Not sure if it helps, but what if you try to "prime" the cache by reading the words in each tile in column-major order? Once you have read the first word of each line of the current tile, all the data for that tile should be in the cache. And it should be prefetcher-friendly because the distance between each word that's read is constant (8192 bytes – not sure if the prefetcher "gets" strides as big as that, but it's worth trying.)

This is the type of black magic I'm looking for. Is there a way to do this on Rust stable? I'm okay with lots of unsafe here.

Output of Linux bandwidth tool:

Sequential read (64-bit), size = 256 B, loops = 515899392, 25144.0 MB/s
Sequential read (64-bit), size = 384 B, loops = 359311957, 26291.8 MB/s
Sequential read (64-bit), size = 512 B, loops = 270532608, 26373.8 MB/s
Sequential read (64-bit), size = 640 B, loops = 216425880, 26381.7 MB/s

cloning a vec (takes 500-600 ms for 1GB)

pub fn test_00() {
    let x = vec![2_u8; 1000_000_000];
    let start = std::time::Instant::now();
    let y = x.clone();
    let end = std::time::Instant::now();
    println!("cloning 1gb: {:?}", end - start);}

so there is a reported 10x difference between what I can achieve with Vec.clone and what linux bandwidth tool is claiming.

Okay, Rust can get ~20GB/s via:

pub fn test_00() {
    use rayon::prelude::{IntoParallelRefIterator, ParallelIterator};
    use std::sync::Arc;

    let x = Arc::new(vec![2_u8; 10_000_000]);
    let start = std::time::Instant::now();

    let lst = vec![(); 10_000];

    let ans = lst
        .par_iter()
        .map(|_| {
            let y = x.as_ref().clone();
            0})
        .collect::<Vec<_>>();

    let end = std::time::Instant::now();
    println!("time: {:?}", end - start);}

This does 100 GB in ~4-5 seconds.

Oh, it looks like the corresponding instruction is available under core::arch::x86_64::_mm_prefetch().

Prefetching memory outside your allocation should be perfectly fine, so that means you can blindly prefetch memory without needing to do any bounds checks.

6 Likes

Thanks! Will give this a try, precisely type of trick/technique I was hoping for.

  1. Code compiles and runs.

  2. _mm_prefetch appears to be having an effect, as I can definitely increase the runtime by fetching useless things. :slight_smile:

  3. Not obvious which pattern improves performance yet. (tried the obvious ones)

6 Likes

I gave it a try:

input order: 4.679652550118881 GB/s
output order: 6.04657130938373 GB/s

2 Likes

Would you mind also benching vec![2_u8; 10_000_000_000] just so we can get a sense of what your machine's memory system is capable of ?

0.56s for 10_000_000_000 (16GB/s)

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.