Scaling algorithm

I am making a scaling algorithm for my homemade bitmap.
Is there maybe a way to speed things up for the "one line (repeat pixels)"?
In this example we are scaling up with an int value (2,3,4 etc.)

Maybe a speedy "copy one value multiple times to dst pointer"?
Could an multithread operation speed things up?
Any tips welcome...

Pseudo code:

foreach source line
{
	let mut src = src_line_ptr; // unsafe pixel pointer to beginning of source line
	let mut dst = dst_line_ptr; // unsafe pixel pointer to beginning of dest line

	// one line (repeat pixels)
	for x in 0..self.width
	{
		for dx in 0..scale
		{
			*dst = *src;
			dst = dst.add(1);
		}
		src = src.add(1);
	}
	// increment line pointers etc.
}

I would expect llvm to generate code that branches on scale and then optimize the code using simd instructions. But it's better to check.

Aha yes I was thinking that.... how can I check it?

I like https://rust.godbolt.org/

You can put a small sample code and look at the assembly for different compiler options. Make sure to use -Copt-level=3 for a release builds.

With raw pointers this might be hard for the compiler to vectorize because src and dst could overlap. Just hoisting the load *src out of the innermost loop is not necessarily sound to the compiler if dst and src could alias. But you know they don't and can move the load out of the loop.

Is the scaling factor known at compile-time? If so how much faster is the code when you make the scaling factor a constant?

1 Like

That is correct. When doing that, the compiler will optimize the inner loop to a call of memset, and unroll the outer loop for multiples of 8. Compiler Explorer

Manual loops and raw pointers will give you C-like performance, which isn't the fastest that Rust can do. Rust can be faster than this when you use safe types with extra guarantees.

For example, slice.copy_from_slice() will give the optimizer a guarantee that source and destination never overlap, which allows generating simpler, faster code. The manual loops and raw pointers don't give such guarantee, so the compiler will need to use a slower byte-by-byte copy, or at least emit a check and generate both slower and faster version.

For similar reason in many cases Rust Iterators will be faster than for x loops and raw pointer arithmetic.

I recommend starting with safe code. Use the standard library whenever it has something applicable. Even for unsafe pointers there are functions like ptr::copy_nonoverlapping. Then check if it optimizes well. You can use assert! outside of loops to give hints to the optimizer.

6 Likes

It's possible. If you want to experiment, add rayon and try out its parallel iterators.

However, invoking additional threads, or preparing to be able to do it based on workload as rayon does, does have its own costs β€” loops become less simple. So if you do this, you want to parallelize only some β€œchunks” rather than individual pixels. A simple strategy would be to write a rayon parallel iteration over lines of your image, but use the existing code you've got for the individual pixels of each line. You would call par_chunks_mut() on your output vector to get an iterator over &mut [YourPixelType]s, and write into those using a plain for x in 0..self.width.

(And as was already mentioned β€” you don't need to, and therefore shouldn't, use raw pointers for this.)

3 Likes

Thanks for the tips. I was suspecting safe code could potentially be faster and better.
So I will try out some slice and iterator magic for the lines and individual pixels first.
And maybe compare the performance with raw pointers.

Hmm, I thought this is what copy_nonoverlapping is for?

You make the non-aliasing guarantee when you call copy_nonoverlapping. If you don't call it, and make a loop with .add(1), then you'll get longer code that handles overlapping pointers.

Both pointers are advancing at different speeds. You have to keep reloading from src in case dst catches up with it.

FYI: this is the rewritten complete routine. It performs about the same as the pointer based function.
with 320x160 to 3200x1600 (10 times enlarge). Rgba is a u32 struct.
Although the copies are definitely non-overlapping i could not find a slice-based routine for that.

    pub unsafe fn draw_to_scaled_up_rusty(&self, dst: &mut Self, scale: usize)
    {
        debug_assert!(dst.width == self.width * scale as i32);
        debug_assert!(dst.height == self.height * scale as i32);

        let mut dst_y: i32 = 0;
        for y in 0..self.height
        {
            let src_line_slice: &[Rgba] = self.get_line_slice(y);
            let dst_line_slice: &mut[Rgba] = dst.get_line_slice_mut(dst_y);

            let mut src_x: usize = 0;
            let mut dst_x: usize = 0;

            // one line (repeat pixels)
            for &src_pixel in src_line_slice.iter()
            {
                dst_line_slice[dst_x..dst_x + scale].fill(src_pixel);
                src_x += 1;
                dst_x += scale;
            }

            // copy line to multiple lines
            let fs: usize = dst.index_of_xy(0, dst_y);
            for i in 1..scale
            {
                let ds: usize = dst.index_of_xy(0, dst_y + i as i32);
                dst.pixels.copy_within(fs..fs + dst.width as usize, ds);
            }
            dst_y += scale as i32;
        }
    }

You don't need to ask for a specifically non-overlapping copy routine for slice references, because the borrow checker ensures there is never overlap.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.