Hi all. I've been trying to port a Python program at my job and it went well for the most parts except when I tried to code a Rust version of scipy.ndimage.filters.gaussian_filterdoc. In brief, it's doing a convolution between some padded data and a weights vector, for all element of the input array.
Here's my clean version. The problem is that it's super slow! (Yes, I'm calling this function thousands of times) For an input image of dimension [134, 156, 130] and a weights array of length 5, this version takes 180ms to run, while SciPy takes around 23ms.
So I tried several things in the past days... going from a clean version to an unsafe low level version. Here's the fast version (you can call the tests and bench if you clone this project). I actually got faster then SciPy at around 17ms. The slow parts were
ndarray high level tools like Zip and dot.
to much math operations for indexing.
the bounds check.
The trouble is:
OMG, it's ugly! Rust has spoiled me and now I'm used to beautiful code
The assert! makes everything faster, which seems weird. It's probably because it allows the compiler to avoid some bounds check, but I'm using unsafe r/w so I don't understand why it matters.
I can't use unsafe at work. It's forbidden. Of course I could put this in an open source crate and then I'll be able to use it at work but I would argue that there should be no unsafe even on an official crate.
Is there a way to remove the unsafe and stay fast "enough"? We don't really care if we're 5-10% slower than SciPy but 50% slower is too slow! Right now, replacing the r/w with their safe version makes the application around 55% slower. So, do you have any suggestions?
In the case D = Ix3 (3 dimensions), it's running through memory in sequence only in one of the dimension. The first or the last depending if data is in 'c' or 'f' order. The other dimensions will have a stride > 1. That's one of the reason why the buffer is useful. The other reason being that I need to pad the edges.
This is zipping over all 1D arrays of data and output along the current direction. I removed Zip in my fast version because it was too slow, but I didn't have the heart to replace lanes() with a manual version. It could be worth a try, but it won't fix the unsafe problem.
Can you call ArrayBase in ndarray - Rust on the slice? I'm still not 100% convinced it's contiguous in memory for every axis (and if it's not, we have horrible memory access cache pattern during the dot product).
I'm still not 100% convinced it's contiguous in memory for every axis
No no! I may have been unclear, sorry. As I wrote, it's NOT contiguous for every axis. It's contiguous only for one of the axis (0 or 2) but we can't really take advantage of this because the other axes won't be.
The dot product is slow even with a contiguous buffer (see clean version). The way I see it, there should be no "fancy" call in the inner loop. Slicing (.slice(s![..])) and dot seem to be fancy calls
Yes ... but this is exactly what I'm doing in the clean version. This is what buffer is used for. It's an Array1 in the clean version and a Vec in the fast version. Maybe I should have commented each ndarray line before asking for help. It might not be the easiest library to understand when you've never used it. I could have asked on the ndarray "forum" but the maintainers are really busy nowadays.
Btw, I'm a fairly experienced programmer. I don't think that I did an "easy" error like the ones you suggest. That's why my question was about "removing unsafe without getting too slow" and not on general optimization tricks, because I already know the stuff about branch prediction, memory contiguity, never calculating the same results, etc.