Tiling an image fundamentally slow?

zeroexcuses · May 6, 2022, 12:33pm

Suppose we have


pub struct BigImage {
  data: Vec<u32>; // 8192 * 8192
}

pub struct Tile ([[u32; 16]; 16])

and now we want to convert a BigImage into a Vec<Tile>. Is this fundamentally a slow operation? I can't seem to break 1 GB/s. (Tried: reading in order; writing in order; multi threading w/ rayon).

There appears to be this fundamental issue with lots of cache misses after every 4 * 16 = 64 bytes read/written.

If we focus on writing in tiles, then we suffer a cache miss after every 64 bytes read.

If we focus on reading in a line, then we suffer a cache miss after every 64 bytes written.

Is there a clever solution around this problem?

(I ran perf, 75% of cpu-clock is spent in mov#### instructions).

eko · May 6, 2022, 1:16pm

I wonder whether do you really need that conversion. Can't you just simulate two dimensional indexing on the vector you already have?

tuxmain · May 6, 2022, 1:46pm

What is the bandwidth for just cloning the vector? (to see if 1GB/s can be improved)

Is it faster if you make the transformation in-place? (by swapping [u32; 4]'s in the vector, maybe there would be less cache pressure)

I think I have the same kind of problem with big matrix transposition or product. I tried different ways but this kind of algorithm always implies to go through all the memory for each iteration, unless you know something about the data (e.g. sparse or symmetrical matrix).

Michael-F-Bryan · May 6, 2022, 2:04pm

That kinda makes sense. When you go from reading row 1 on a tile to reading row 2 you'll need to skip 8192 - 16 = 8176 pixels, which isn't great for your cache.

What about playing around with the std::intrinsics::prefetch_read_instruction() intrinsic? At the start of each tile's row you might be able to prefetch the next row (or 3 rows ahead, or whatever) so it's already in cache by the time you want to read it.

zeroexcuses · May 6, 2022, 2:09pm

No, many matrix / image techniques do this, "blocking" as it's first step to speed up later operations.

jdahlstrom · May 6, 2022, 2:10pm

Not sure if it helps, but what if you try to "prime" the cache by reading the words in each tile in column-major order? Once you have read the first word of each line of the current tile, all the data for that tile should be in the cache. And it should be prefetcher-friendly because the distance between each word that's read is constant (8192 bytes – not sure if the prefetcher "gets" strides as big as that, but it's worth trying.)

zeroexcuses · May 6, 2022, 2:11pm

This is the type of black magic I'm looking for. Is there a way to do this on Rust stable? I'm okay with lots of unsafe here.

zeroexcuses · May 6, 2022, 2:21pm

Output of Linux bandwidth tool:

Sequential read (64-bit), size = 256 B, loops = 515899392, 25144.0 MB/s
Sequential read (64-bit), size = 384 B, loops = 359311957, 26291.8 MB/s
Sequential read (64-bit), size = 512 B, loops = 270532608, 26373.8 MB/s
Sequential read (64-bit), size = 640 B, loops = 216425880, 26381.7 MB/s

cloning a vec (takes 500-600 ms for 1GB)

pub fn test_00() {
    let x = vec![2_u8; 1000_000_000];
    let start = std::time::Instant::now();
    let y = x.clone();
    let end = std::time::Instant::now();
    println!("cloning 1gb: {:?}", end - start);}

so there is a reported 10x difference between what I can achieve with Vec.clone and what linux bandwidth tool is claiming.

zeroexcuses · May 6, 2022, 2:29pm

Okay, Rust can get ~20GB/s via:

pub fn test_00() {
    use rayon::prelude::{IntoParallelRefIterator, ParallelIterator};
    use std::sync::Arc;

    let x = Arc::new(vec![2_u8; 10_000_000]);
    let start = std::time::Instant::now();

    let lst = vec![(); 10_000];

    let ans = lst
        .par_iter()
        .map(|_| {
            let y = x.as_ref().clone();
            0})
        .collect::<Vec<_>>();

    let end = std::time::Instant::now();
    println!("time: {:?}", end - start);}

This does 100 GB in ~4-5 seconds.

Michael-F-Bryan · May 6, 2022, 2:29pm

Oh, it looks like the corresponding instruction is available under core::arch::x86_64::_mm_prefetch().

Prefetching memory outside your allocation should be perfectly fine, so that means you can blindly prefetch memory without needing to do any bounds checks.

zeroexcuses · May 6, 2022, 2:35pm

Thanks! Will give this a try, precisely type of trick/technique I was hoping for.

zeroexcuses · May 6, 2022, 3:34pm

Code compiles and runs.
_mm_prefetch appears to be having an effect, as I can definitely increase the runtime by fetching useless things.
Not obvious which pattern improves performance yet. (tried the obvious ones)

s3bk · May 7, 2022, 11:45am

I gave it a try:

input order: 4.679652550118881 GB/s
output order: 6.04657130938373 GB/s

zeroexcuses · May 7, 2022, 12:03pm

Would you mind also benching vec![2_u8; 10_000_000_000] just so we can get a sense of what your machine's memory system is capable of ?

s3bk · May 7, 2022, 12:07pm

0.56s for 10_000_000_000 (16GB/s)

system · August 5, 2022, 12:07pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Why is this code so slow? help	15	1319	December 13, 2022
Faster Progress Report 2: Up to 600% speedups for popular crates, and our biggest release ever announcements	5	1310	January 12, 2023
Performance issues with AtomicU8? help	17	431	March 10, 2024
Performance optimization of Vec containing Structs help	5	948	March 4, 2022
Bincode::serialize slow? community	5	1426	April 13, 2020

Tiling an image fundamentally slow?

Related Topics