Performance questions

Warning new Rust user.

I am looking at the performance of Rust for an algorithm that has branches in addition to floating point operations. The code shown below runs about 2 times slower than my equivalent C++ code. Here are my questions:

  • What improvements could I make so that the code is more idiomatic Rust?
  • At one time the function was returning the output vector, but I dropped that implementation because of performance. What other tricks can I take to make the code run faster?
  • The input vector is created by let mut input: Vec<f32> = Vec::new(); and the output vector in the same manner usinglet mut output: Vec<T> = Vec::new();. Are these vectors aligned on a 32 bit boundary?
fn doit<T>(input: &[T],
           output: &mut [T],
           null_value: T,
           null_replacement: T,
           min_value: T,
           max_value: T,
           scale: T,
           shift: T,
           _verbose: bool)
    where T: PartialOrd + Copy + Mul<Output=T> + Add<Output=T> + Default
{
    for (destination_value, current_value) in output.iter_mut().zip(input)
    {
        if *current_value == null_value
        {
            *destination_value = null_replacement;
            continue;
        }

        if *current_value < min_value
        {
            *destination_value = min_value;
            continue;
        }

        if *current_value > max_value
        {
            *destination_value = max_value;
            continue;
        }

        *destination_value = (*current_value) * scale + shift;
    }
}

You might want to have a look at this thread from a couple of weeks ago: it has a relatively thorough investigation into performance improvements of floating-point code.

1 Like

I've read the thread and now to try implement something that is faster the my C++ code, C assembly, or ISPC.

In terms of idiomatic-ness, I'd rewrite your loop code to factor out the common writing to destination_value, like

let transform = |&cur| {
    if cur == null_value {
        null_replacement
    } else if cur < min_value {
        min_value
    } else if cur > max_value {
        max_value
    } else {
        cur * scale + shift
    }
};

for (val, dest) in input.iter().map(transform).zip(output.iter_mut()) {
    *dest = val;
}

According to Godbolt, this generates different assembly than the original, but I'm not sure if it's actually any faster (or slower.) I'm not too knowledgeable in optimization, but this seems like somewhere C++ is maybe autovectorizing? You could try implementing it using SIMD and operating over f32x8s, but that'd have to be done by hand.

This sounds like an interesting challenge.

Can you post your original C++ code so that we know what we are up against when implementing it in Rust?

I'm a relative newbie to Rust but over the last year I have recast numerous bits of C and C++ into Rust and always been able to match them in performance, and sometimes do better.

What I have learned along the way is:

  1. Don't listen to all the advice to use the functional programming style, iter this, zip that. Often that results in more complex, obfuscated, code that performs badly.

  2. BUT, when you get it right Rust will fly.

  3. Be wary of those that suggest "unsafe" to get performance. I'm sure that is required some times but it has never been true for me.

1 Like

Here is the C++ code:

namespace Test
{

 template< typename T >
 boost::chrono::microseconds
 straightForward(const std::vector< T, Utils::AlignmentAllocator_t< T, 32 > > &input,
 std::vector< T, Utils::AlignmentAllocator_t< T, 32 > > &output,
 const T nullValue,
 const T nullReplacement,
 const T minValue,
 const T maxValue,
 const T scale,
 const T shift,
 const bool verbose = true)
 {
 typedef std::vector< T, Utils::AlignmentAllocator_t< T, 32 > > Container_t;

 if (boost::distance(input) != boost::distance(output))
 {
 const std::string message = boost::str(boost::format("sizes differ expected: %d found: %d")
 % boost::distance(input)
 % boost::distance(output));

 BOOST_THROW_EXCEPTION(Exception::Arithmetic_t()
 << Exception::ErrorString_t(message));
 }

 const Utils::HighResolutionTimer_t timer;

 typename Container_t::const_iterator srcIter;
 typename Container_t::iterator destIter;
 const typename Container_t::const_iterator srcEnd = boost::end(input);

#ifdef __INTEL_COMPILER
#pragma ivdep
#endif /* __INTEL_COMPILER */

 for (srcIter = boost::begin(input), destIter = boost::begin(output);
 srcIter != srcEnd;
 ++srcIter, ++destIter)
 {
 const typename Container_t::value_type currentValue = *srcIter;

 if (currentValue == nullValue)
 {
 *destIter = nullReplacement;
 continue;
 }

 if (currentValue < minValue)
 {
 *destIter = minValue;
 continue;
 }

 if (currentValue > maxValue)
 {
 *destIter = maxValue;
 continue;
 }

 *destIter = currentValue * scale + shift;
 }

 const boost::chrono::microseconds duration
 = boost::chrono::duration_cast< boost::chrono::microseconds >(timer.elapsed());

 return duration;
 } // straightForward
} // namespace Test

Normally, the fastest code in Intel is using ISPC


export
void
scaleAndShift_ispc(uniform int N,
 uniform float input[],
 uniform float output[],
 uniform float nullValue,
 uniform float nullReplacement,
 uniform float minValue,
 uniform float maxValue,
 uniform float scale,
 uniform float shiftValue,
 uniform bool verbose = true)
{
 N = N & ~(programCount-1); // Using aligned memory
 foreach (i = 0 ... N)
 {
 const float currentValue = input[i];
 float value = currentValue * scale + shiftValue;

 if (currentValue == nullValue)
 {
 value = nullReplacement;
 }

 if (currentValue < minValue)
 {
 value = minValue;
 }

 if (currentValue > maxValue)
 {
 value = maxValue;
 }

 output[i] = value;
 }
}

Thanks for the extra set of eyes. It has been fun to learn generics and now to see how I can convert this old functional code into something that makes sense in Rust.

For completeness, this is the fastest single thread AVX code that I can write:

namespace Test
{
   boost::chrono::microseconds 
   AVX(const std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &input,
       std::vector< float, Utils::AlignmentAllocator_t< float, 32 > > &output,
       const float nullValue,
       const float nullReplacement,
       const float minValue,
       const float maxValue,
       const float scale,
       const float shift,
       const bool verbose = true)
   {
      if (boost::distance(input) != boost::distance(output))
      {
         const std::string message = boost::str(boost::format("sizes differ expected: %d found: %d")
                                                % boost::distance(input)
                                                % boost::distance(output));         
         
         BOOST_THROW_EXCEPTION(Exception::Arithmetic_t()
                               << Exception::ErrorString_t(message));
      }
      
      //
      // Ignore unrolling 
      //

      const std::size_t todo = boost::distance(input);
      
      if ((todo % 4) != 0)
      {
         const std::string message = boost::str(boost::format("Only supports multiples of 4.  Left over %d")
                                                % (todo % 4));
         
         BOOST_THROW_EXCEPTION(Exception::Arithmetic_t()
                               << Exception::ErrorString_t(message));
      }

      const Utils::HighResolutionTimer_t timer;

#ifdef __AVX__
      //
      // A conditional of the form: if (x) y=a; else y=b;
      //
      // can be rewritten as: y = (x & a) | (~x & b);
      //
      // The advantage of using the bit-wise operators is that
      // branching, with its associated overhead is not needed.
      //

      const float *srcIter;
      float *destIter;
      std::size_t i;
      for (srcIter = &input[0], destIter = &output[0], i = 0;
           i < todo;
           i += 8, srcIter += 8, destIter += 8)
      {
         //
         // Load the data into the register
         //

         const __m256 in = _mm256_load_ps(srcIter);
         
         //
         // Handle missing observation
         //
         
         const __m256 nullValueVector = _mm256_set1_ps(nullValue);
         const __m256 nullValueReplacementVector = _mm256_set1_ps(nullReplacement);
         const __m256 nullMask = _mm256_cmp_ps(in, nullValueVector, _CMP_EQ_OS);
         __m256 out = _mm256_blendv_ps(in, nullValueReplacementVector, nullMask);

         //
         // Handle numbers less than or equal to minValue
         //
         
         const __m256 minValueVector = _mm256_set1_ps(minValue);
         const __m256 minMask = _mm256_cmp_ps(in, minValueVector, _CMP_LE_OS);
         out = _mm256_blendv_ps(out, minValueVector, minMask);

         //
         // Handle numbers greater than or equal to maxValue
         //
         
         const __m256 maxValueVector = _mm256_set1_ps(maxValue);
         const __m256 maxMask = _mm256_cmp_ps(in, maxValueVector, _CMP_GE_OS);
         out = _mm256_blendv_ps(out, maxValueVector, maxMask);

         //
         // Shift and scale
         //
         
         __m256 scaleAndShiftMask = _mm256_or_ps(nullMask, minMask);
         scaleAndShiftMask = _mm256_or_ps(scaleAndShiftMask, maxMask);
         __m256 scaleAndShift = _mm256_andnot_ps(scaleAndShiftMask, out);
         const __m256 scaleVector = _mm256_set1_ps(scale);
         scaleAndShift = _mm256_mul_ps(scaleAndShift, scaleVector);
         const __m256 shiftVector = _mm256_set1_ps(shift);
         scaleAndShift = _mm256_add_ps(scaleAndShift, shiftVector);

         //
         // flip the calculation of left and right since we need to use the ~scaleAndShiftMask
         // or simple use _mm_blendv_ps
         //
         
         out = _mm256_blendv_ps(scaleAndShift, out, scaleAndShiftMask);

         //
         // Store the data
         //

         _mm256_store_ps(destIter, out);
      } // AVX
} // namespace Test

How fast is clang with no flags like -ffast-math or -Ofast, which trade off correctness for speed. Just -O3? That would be a more fair comparison, as both clang and rustc are using LLVM as backend.

c++  -DBOOST_FILESYSTEM_VERSION=3 -DUSING_ISPC -I/opt/local/include -I/Users/josephwinston/src/Prototypes/ScaleAndShift/include -I/Users/josephwinston/src/Prototypes/ScaleAndShift -I/Users/josephwinston/src/Prototypes/ScaleAndShift/x86_64-apple-darwin17.7.0/demo/Benchmark  -mavx -std=c++0x -Wall -Wextra -Wreorder -Wno-unused-parameter -Wno-long-long -Wno-trigraphs -mavx -O3 -DNDEBUG -Wuninitialized -arch x86_64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk   -o CMakeFiles/Benchmark.dir/main.cpp.o -c /Users/josephwinston/src/Prototypes/ScaleAndShift/demo/Benchmark/main.cpp
c++  -mavx -std=c++0x -Wall -Wextra -Wreorder -Wno-unused-parameter -Wno-long-long -Wno-trigraphs -mavx -O3 -DNDEBUG -Wuninitialized -arch x86_64 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -Wl,-search_paths_first -Wl,-headerpad_max_install_names  CMakeFiles/Benchmark.dir/main.cpp.o scaleAndShift_ispc.o scaleAndShift_ispc_sse2.o scaleAndShift_ispc_sse4.o scaleAndShift_ispc_avx.o  -o ../../bin/Benchmark.app/Contents/MacOS/Benchmark  /opt/local/lib/libboost_chrono-mt.dylib /opt/local/lib/libboost_system-mt.dylib
   simpleC(const size_t NumberOfElements,
           const float *input,
           float *output,
           const float nullValue,
           const float nullReplacement,
           const float minValue,
           const float maxValue,
           const float scale,
           const float shift,
           const bool verbose)
   {
      //
      // assume sizes of input and output are the same
      //
      
      for (size_t i = 0; i < NumberOfElements; ++i) {
         const float currentValue = input[i];
         float value = currentValue * scale + shift;
         
         if (currentValue == nullValue)
         {
            value = nullReplacement;
         }
         
         if (currentValue < minValue)
         {
            value = minValue;
         }
         
         if (currentValue > maxValue)
         {
            value = maxValue;
         }
         
         output[i] = value;
      }
   }
Mean = 142.06 microseconds per loop
Min = 131 microseconds per loop
Max = 286 microseconds per loop

What now?

I ask for a human readable version of our posters problem and somebody flags my post as "offensive".

I don't understand.

jbw,

Ah, thank you. Let me sleep on it.

How big a problem are we looking at, how many NumberOfElements?

Please don't refer to other people's code as "unintelligible line noise" (or not "human readable") just because it uses coding styles or features that are less familiar to you. Even when meant in fun, some people find this discouraging. Also, it often causes derails into subjective arguments about style, in topics that were originally not about that.

9 Likes

OK. I can sympathize with that.

Sometimes I forget that in this modern world we are supposed to keep our feeling in our pockets and pretend we don't have an opinion for fear of upsetting someone, somewhere.

Or perhaps I just phrase things badly.

The challenge with this code broadly falls into two areas. The first are all of the conditionals. They cause modern x86 processors to stall. That is the reason, I included the AVX version that has no processor stalls. The second is making simplified version of DAXPY. DAXPY is constant multiplied times a vector plus a vector. Here we have a constant multiplied against a vector plush a constant as well as the branching.

I begin to see the problem.

Comparison is a bit tricky around here, I don't have a machine in the house that has AVX instructions.

Interestingly enough, the solution posted by @asymmetrikon produces a pretty good vectorized code when compiled with -C opt-level=3 -C target-cpu=haswell -- godbolt.

@jbw I know that maybe it is not possible, but I will try to ask anyway: are you able to provide us a version that compiles on compiler explorer? We are missing the AlignmentAllocator_t, and even if we can imagine what it does, it could also help quite a lot to have something more to work with.

I think that it could be interesting to see if using some sort of aligned vector (using this crate for instance) could help the compiler. @vorner also wrote a post pretty recently about trying to "cheat" with SIMD, maybe his crate could help as well.

Obviosly, last resource would be to use explicit AVX intrinsics like you already did in the C++ version, but it is obviously pretty hard to maintain.

EDIT: this version seems to work slightly better because the slice has an alignment of 256 bytes, but it's still pretty far from a clean vectorized code.