SciPy gaussian_filter port

Hi all. I've been trying to port a Python program at my job and it went well for the most parts except when I tried to code a Rust version of scipy.ndimage.filters.gaussian_filter doc. In brief, it's doing a convolution between some padded data and a weights vector, for all element of the input array.

Here's my clean version. The problem is that it's super slow! (Yes, I'm calling this function thousands of times) For an input image of dimension [134, 156, 130] and a weights array of length 5, this version takes 180ms to run, while SciPy takes around 23ms.

So I tried several things in the past days... going from a clean version to an unsafe low level version. Here's the fast version (you can call the tests and bench if you clone this project). I actually got faster then SciPy at around 17ms. The slow parts were

  • ndarray high level tools like Zip and dot.
  • to much math operations for indexing.
  • the bounds check.

The trouble is:

  • OMG, it's ugly! Rust has spoiled me and now I'm used to beautiful code :slight_smile:
  • The assert! makes everything faster, which seems weird. It's probably because it allows the compiler to avoid some bounds check, but I'm using unsafe r/w so I don't understand why it matters.
  • I can't use unsafe at work. It's forbidden. Of course I could put this in an open source crate and then I'll be able to use it at work but I would argue that there should be no unsafe even on an official crate.

Is there a way to remove the unsafe and stay fast "enough"? We don't really care if we're 5-10% slower than SciPy but 50% slower is too slow! Right now, replacing the r/w with their safe version makes the application around 55% slower. So, do you have any suggestions?

1 Like

Can you describe the memory access pattern of gaussian_filter/clean.rs at master · nilgoyette/gaussian_filter · GitHub ? Is it running through memory in sequence, or is it jumping all over memory?

In the case D = Ix3 (3 dimensions), it's running through memory in sequence only in one of the dimension. The first or the last depending if data is in 'c' or 'f' order. The other dimensions will have a stride > 1. That's one of the reason why the buffer is useful. The other reason being that I need to pad the edges.

Have you tried putting the buffers out of the loop?

let mut output = Array::zeros(data.dim());
let mut buffer = Array1::zeros(n + 2 * half);
1 Like

I am not familiar with nd-array notation. For the cases where stride > 1, are you already doing:

  1. do n reads (of stride > 1), read into Vec of length n
  2. do 1-d convolution on Vec of length n
  3. do n writes (of stride > 1)

?

@elsuizo No, I didn't. It would be a single allocation instead of 3. I'll try. It won't fix the unsafe problem though.

@zeroexcuses If I understand correctly what you suggest, yes, this is what lanes() does.

Zip::from(data.lanes(Axis(d))).and(output.lanes_mut(Axis(d)))

This is zipping over all 1D arrays of data and output along the current direction. I removed Zip in my fast version because it was too slow, but I didn't have the heart to replace lanes() with a manual version. It could be worth a try, but it won't fix the unsafe problem.

yeah i test in my computer and not difference.Can you share how you measure times in python programs?

It's written in the readme of the repo.

Can you call ArrayBase in ndarray - Rust on the slice? I'm still not 100% convinced it's contiguous in memory for every axis (and if it's not, we have horrible memory access cache pattern during the dot product).

I'm still not 100% convinced it's contiguous in memory for every axis

No no! I may have been unclear, sorry. As I wrote, it's NOT contiguous for every axis. It's contiguous only for one of the axis (0 or 2) but we can't really take advantage of this because the other axes won't be.

The dot product is slow even with a contiguous buffer (see clean version). The way I see it, there should be no "fancy" call in the inner loop. Slicing (.slice(s![..])) and dot seem to be fancy calls :slight_smile:

Therefore, create a temporary vec, copy the non-contiguous axis into the contiguous vector so that loop-dot-product uses contiguous memory.

Yes ... but this is exactly what I'm doing in the clean version. This is what buffer is used for. It's an Array1 in the clean version and a Vec in the fast version. Maybe I should have commented each ndarray line before asking for help. It might not be the easiest library to understand when you've never used it. I could have asked on the ndarray "forum" but the maintainers are really busy nowadays.

Btw, I'm a fairly experienced programmer. I don't think that I did an "easy" error like the ones you suggest. That's why my question was about "removing unsafe without getting too slow" and not on general optimization tricks, because I already know the stuff about branch prediction, memory contiguity, never calculating the same results, etc.

My fault, I somehow missed this.