I am looking at the performance of Rust for an algorithm that has branches in addition to floating point operations. The code shown below runs about 2 times slower than my equivalent C++ code. Here are my questions:
What improvements could I make so that the code is more idiomatic Rust?
At one time the function was returning the output vector, but I dropped that implementation because of performance. What other tricks can I take to make the code run faster?
The input vector is created by let mut input: Vec<f32> = Vec::new(); and the output vector in the same manner usinglet mut output: Vec<T> = Vec::new();. Are these vectors aligned on a 32 bit boundary?
You might want to have a look at this thread from a couple of weeks ago: it has a relatively thorough investigation into performance improvements of floating-point code.
In terms of idiomatic-ness, I'd rewrite your loop code to factor out the common writing to destination_value, like
let transform = |&cur| {
if cur == null_value {
null_replacement
} else if cur < min_value {
min_value
} else if cur > max_value {
max_value
} else {
cur * scale + shift
}
};
for (val, dest) in input.iter().map(transform).zip(output.iter_mut()) {
*dest = val;
}
According to Godbolt, this generates different assembly than the original, but I'm not sure if it's actually any faster (or slower.) I'm not too knowledgeable in optimization, but this seems like somewhere C++ is maybe autovectorizing? You could try implementing it using SIMD and operating over f32x8s, but that'd have to be done by hand.
Can you post your original C++ code so that we know what we are up against when implementing it in Rust?
I'm a relative newbie to Rust but over the last year I have recast numerous bits of C and C++ into Rust and always been able to match them in performance, and sometimes do better.
What I have learned along the way is:
Don't listen to all the advice to use the functional programming style, iter this, zip that. Often that results in more complex, obfuscated, code that performs badly.
BUT, when you get it right Rust will fly.
Be wary of those that suggest "unsafe" to get performance. I'm sure that is required some times but it has never been true for me.
export
void
scaleAndShift_ispc(uniform int N,
uniform float input[],
uniform float output[],
uniform float nullValue,
uniform float nullReplacement,
uniform float minValue,
uniform float maxValue,
uniform float scale,
uniform float shiftValue,
uniform bool verbose = true)
{
N = N & ~(programCount-1); // Using aligned memory
foreach (i = 0 ... N)
{
const float currentValue = input[i];
float value = currentValue * scale + shiftValue;
if (currentValue == nullValue)
{
value = nullReplacement;
}
if (currentValue < minValue)
{
value = minValue;
}
if (currentValue > maxValue)
{
value = maxValue;
}
output[i] = value;
}
}
Thanks for the extra set of eyes. It has been fun to learn generics and now to see how I can convert this old functional code into something that makes sense in Rust.
How fast is clang with no flags like -ffast-math or -Ofast, which trade off correctness for speed. Just -O3? That would be a more fair comparison, as both clang and rustc are using LLVM as backend.
Please don't refer to other people's code as "unintelligible line noise" (or not "human readable") just because it uses coding styles or features that are less familiar to you. Even when meant in fun, some people find this discouraging. Also, it often causes derails into subjective arguments about style, in topics that were originally not about that.
Sometimes I forget that in this modern world we are supposed to keep our feeling in our pockets and pretend we don't have an opinion for fear of upsetting someone, somewhere.
The challenge with this code broadly falls into two areas. The first are all of the conditionals. They cause modern x86 processors to stall. That is the reason, I included the AVX version that has no processor stalls. The second is making simplified version of DAXPY. DAXPY is constant multiplied times a vector plus a vector. Here we have a constant multiplied against a vector plush a constant as well as the branching.
Interestingly enough, the solution posted by @asymmetrikon produces a pretty good vectorized code when compiled with -C opt-level=3 -C target-cpu=haswell -- godbolt.
@jbw I know that maybe it is not possible, but I will try to ask anyway: are you able to provide us a version that compiles on compiler explorer? We are missing the AlignmentAllocator_t, and even if we can imagine what it does, it could also help quite a lot to have something more to work with.
I think that it could be interesting to see if using some sort of aligned vector (using this crate for instance) could help the compiler. @vorner also wrote a post pretty recently about trying to "cheat" with SIMD, maybe his crate could help as well.
Obviosly, last resource would be to use explicit AVX intrinsics like you already did in the C++ version, but it is obviously pretty hard to maintain.
EDIT: this version seems to work slightly better because the slice has an alignment of 256 bytes, but it's still pretty far from a clean vectorized code.