How to optimize a image comparison function

Hi Rustacians!

I'm writing a function to compare two RgbImage structs (from the image crate). I want a function that compares both and returns a value in the 0..1 range that indicates how different those two images are, based on their RGB channel values and a simple subtraction. For example, a completely black image would be 1.0 different from a completely different white image.

My current diff() function is as such:

// Actual diff is weighted by luminosity bias
const LUMA_R: f64 = 0.2126;
const LUMA_G: f64 = 0.7152;
const LUMA_B: f64 = 0.0722;

pub fn diff(a: &RgbImage, b: &RgbImage) -> f64 {
	let w = a.dimensions().0;
	let h = a.dimensions().1;
	let num_pixels = w * h;

	let mut diff_sum_r: i32 = 0;
	let mut diff_sum_g: i32 = 0;
	let mut diff_sum_b: i32 = 0;

	let samples_a = a.as_flat_samples().samples; // Returns &[u8]
	let samples_b = b.as_flat_samples().samples; // Returns &[u8]

	let skip_step = 1; // Temporary; higher values skip pixels
	let mut pos: usize = 0;
	let pos_step: usize = skip_step * 3; // Fixed since we know the layout is RGBRGBRGB

	for _ in (0..num_pixels).step_by(skip_step) {
		diff_sum_r += (samples_a[pos + 0] as i32 - samples_b[pos + 0] as i32).abs();
		diff_sum_g += (samples_a[pos + 1] as i32 - samples_b[pos + 1] as i32).abs();
		diff_sum_b += (samples_a[pos + 2] as i32 - samples_b[pos + 2] as i32).abs();
		pos += pos_step;
	}

	let lr = LUMA_R / 255.0;
	let lg = LUMA_G / 255.0;
	let lb = LUMA_B / 255.0;
	let diff_sum = diff_sum_r as f64 * lr + diff_sum_g as f64 * lg + diff_sum_b as f64 * lb;

	diff_sum / (num_pixels as f64 / skip_step as f64)
}

This function works fine. My problem is that it's slower than I'd expect. On a release build, a images of 1500x1500 pixels takes about 10ms to be checked. That's not terrible, but considering there's other more complicated code (using clone(), and even a rect painter with put_pixels()) that executes in ~1ms, it seems like I'm doing something wrong.

Those 10ms is already after a lot of optimizations! The code used to use pixel enumerations, channel(), etc. That .as_flat_samples().samples call seems to be the closest I could get to the data without many conversions/copying.

I'm new to Rust (I know more about GC languages, and some C/C++ experience) so I'm sure I'm just not knowing which shortcuts to take to make that comparison more performant.

So are there any tips on what I should be doing? The loop is of course the big bottleneck on that whole execution, but I think I've extracted all I could from it. I hate the two as i32 castings, but they're necessary so I could get abs() to work on what was previously u8s.

Other things I've tried:

  • max(a, b) - min(a, b): slower
  • if a < b { b - a } else { a - b}: same speed
  • higher level calls like get_pixel, enumerate_pixels: also slower

Any other ideas?

C-style for loops are the worst in Rust. Iterators are the fastest.

Don't use indexing with [] in loops. It will add bounds checks, and possibility of panics from the bounds checks prevents optimizations of the code around them.

It's even worse when you combine [] with a for _ in loop, which obscures how the loop index relates to the loop counter, and prevents the optimizer from eliminating bounds checks.

You can use get_unchecked(i) instead of [i]. Or use for (a, b) in samples_a.iter().zip(samples_b) to iterate on both slices.

Maybe samples_a.chunks_exact(3).zip(samples_b.chunks_exact(3)) will work too if the optimizer can see that the chunks have a fixed size and allow fast indexing into them. If you could cast [u8*3] to [struct RGB] then you wouldn't have to worry about offsets, and it'd optimize better.

3 Likes

Like @kornel said, you want to avoid loops, especially ones where you index into things. You can fold an iterator to get the values:

fn sums_chunked(samples_a: &[u8], samples_b: &[u8]) -> (i32, i32, i32) {
    samples_a
        .chunks_exact(3)
        .zip(samples_b.chunks_exact(3))
        .fold((0, 0, 0), |(r, g, b), (p_a, p_b)| {
            (
                r + (p_a[0] as i32 - p_b[0] as i32),
                g + (p_a[1] as i32 - p_b[1] as i32),
                b + (p_a[2] as i32 - p_b[2] as i32),
            )
        })
}

I don't know if the optimizer will know that the chunk accesses are fine, so here's a version with a custom adapter that might optimize better (or not, I haven't checked):

struct Triples<I, T>
where
    I: Iterator<Item = T>,
{
    iterator: I,
}

impl<I, T> Triples<I, T>
where
    I: Iterator<Item = T>,
{
    fn new(iterator: I) -> Self {
        Self { iterator }
    }
}

impl<I, T> Iterator for Triples<I, T>
where
    I: Iterator<Item = T>,
{
    type Item = (T, T, T);

    fn next(&mut self) -> Option<Self::Item> {
        let a = self.iterator.next()?;
        let b = self.iterator.next()?;
        let c = self.iterator.next()?;
        Some((a, b, c))
    }
}

fn sums_triples(samples_a: &[u8], samples_b: &[u8]) -> (i32, i32, i32) {
    let diffs = samples_a
        .iter()
        .copied()
        .zip(samples_b.iter().copied())
        .map(|(a, b)| a as i32 - b as i32);

    Triples::new(diffs).fold((0, 0, 0), |(r, g, b), (d_r, d_g, d_b)| {
        (r + d_r, g + d_g, b + d_b)
    })
}
2 Likes

Thanks friends! This is really informative and really helpful in making me understand how well rustc optimizes code.

I'm still digesting it all, but for reference, here's a some info on my optimization efforts.

Iterating over 2 slices is not something I knew I could do! I think the reason why my enumerate_pixels() was slower before is because I was iterating on a, but then indexing on b, so any optimization was lost.

I did a first pass with iterating over the samples with .zip() and chunks_exact():

for (p_a, p_b) in samples_a.chunks_exact(3).zip(samples_b.chunks_exact(3)) {
	diff_sum_r += (p_a[0] as i32 - p_b[0] as i32).abs();
	diff_sum_g += (p_a[1] as i32 - p_b[1] as i32).abs();
	diff_sum_b += (p_a[2] as i32 - p_b[2] as i32).abs();
}

That already moves my avg iteration down from 9ms to 7ms. Getting a[n] to *a.get_unchecked(n) (and same with b) within an unsafe{} block for the whole loop didn't change anything, so maybe it does allow fast indexing? I know the compiler doesn't catch an error if I do p_a[3] though. I wasn't able to (as I don't know how to) cast to a [u8; 3] so I wonder if there's anything that can be done there at all. Still checking.

The second solution using sums_chunked() also gave times of about 7ms on average. It's cool to know of fold(). In this case it doesn't seem to add any additional optimization, but it's an elegant extraction of the loop.

The version with a custom adapter is interesting. A lot of new stuff for me to read and learn about for sure. In this case it didn't help much though; average time went to about 8ms. (Um, I guess I need to start measuring my average on sub-ms now).

Anyway, I think in all of those cases I'm now running into performance walls because of the i32 casts and calls to abs().. where the bottleneck moved from finding the values to actually calculating them.

As a last clue, if I do a single iterator through sample, without the chunks (thus no indexing inside the loop), I get a 4ms result. It doesn't help me in my case, since I need different calculations for each channel, but it might indicate that the chunk indexing is indeed not optimized since that's the only difference in practice for the loop block.

The compiler may be allowed to elide bounds checks when it can figure them out at compile time, but it's still just a performance optimization. Bounds violations are still a runtime error, so the compiler will never fail to compile even if it knows that you have indexed outside an array. It could, however, optimize the access to an unconditional panic.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.