Auto-vectorize sum with mask

Hi,

I am trying to autovectorize the following masked sum, but I am struggling to achieve the performance that I would expect for this operation.

The slow operation is simd_nonnull_sum, on which the inner loop with the validity is not being very well auto-vectorized: I observe a 4x-5x performance degradation in comparison with the non-masked sum (also below). Such performance difference is not observed when I use packed_simd (which requires nightly).

For completeness, I present the whole benchmark setup.

Any ideas how this can be further improved / help the compiler?

use std::convert::TryInto;

use criterion::{criterion_group, criterion_main, Criterion};

const LANES: usize = 16;

static U16_MASK: [u16; 16] = [
    1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768,
];

fn simd_nonnull_sum(values: &[f32], bitmap: &[u8]) -> f32 {
    assert!(values.len() <= bitmap.len() * 8);

    let mut chunks = values.chunks_exact(LANES);
    let mut validity_chunks = bitmap.chunks_exact(2); // 16 bits = LANES

    let sum = chunks.by_ref().zip(validity_chunks.by_ref()).fold(
        [0.0f32; LANES],
        |mut acc, (chunk, validity_chunk)| {
            // select
            let chunk: [f32; LANES] = chunk.try_into().unwrap();
            let validity_chunk = u16::from_ne_bytes(validity_chunk.try_into().unwrap());
            let mut selected_chunk = [0.0f32; LANES];
            (0..LANES).for_each(|i| {
                selected_chunk[i] = if validity_chunk & U16_MASK[i] != 0 {
                    chunk[i]
                } else {
                    0.0
                }
            });

            // sum
            (0..LANES).for_each(|i| {
                acc[i] += selected_chunk[i];
            });
            acc
        },
    );

    let mut reduced = 0.0f32;
    (0..LANES).for_each(|i| {
        reduced += sum[i];  // this is NOT right as it needs the mask, but ignore. It is easily fixable.
    });
    reduced
}

fn simd_sum(values: &[f32]) -> f32 {
    let chunks = values.chunks_exact(LANES);
    let remainder = chunks.remainder();

    let sum = chunks.fold([0.0f32; LANES], |mut acc, chunk| {
        let chunk: [f32; LANES] = chunk.try_into().unwrap();
        (0..LANES).for_each(|i| {
            acc[i] += chunk[i];
        });
        acc
    });

    let remainder: f32 = remainder.iter().copied().sum();

    let mut reduced = 0.0f32;
    (0..LANES).for_each(|i| {
        reduced += sum[i];
    });
    reduced + remainder
}

fn naive_sum(values: &[f32]) -> f32 {
    values.iter().sum()
}

fn add_benchmark(c: &mut Criterion) {
    let values = (0..513 * 7).map(|x| x as f32).collect::<Vec<_>>();

    c.bench_function("simd_sum", |b| b.iter(|| simd_sum(&values)));

    c.bench_function("naive_sum", |b| b.iter(|| naive_sum(&values)));

    let mask = (0..513 * 7).map(|x| x as u8).collect::<Vec<_>>();
    c.bench_function("simd_nonnull_sum", |b| {
        b.iter(|| simd_nonnull_sum(&values, &mask))
    });
}

criterion_group!(benches, add_benchmark);
criterion_main!(benches);
2 Likes

Can you run

perf record -e branch-instructions,branch-misses,cache-misses,cache-references,cpu-cycles,instructions,ref-cycles,stalled-cycles-backend,stalled-cycles-frontend,alignment-faults,bpf-output,context-switches,cpu-clock,major-faults,minor-faults,page-faults,task-clock,duration_time,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-prefetch-misses,L1-dcache-prefetches,L1-dcache-store-misses,L1-dcache-stores,L1-icache-load-misses,L1-icache-loads,LLC-load-misses,LLC-loads,LLC-prefetch-misses,LLC-prefetches,LLC-store-misses,LLC-stores,branch-load-misses,branch-loads,dTLB-load-misses,dTLB-loads,dTLB-store-misses,dTLB-stores,iTLB-load-misses,iTLB-loads cargo test -- --nocapture test_sgemm

followed by

perf report

and dump the assembly instruction code of the functions being evaluated ?

1 Like

I always avoid indexing operations when iterating over collections, because the optimizer may fail its task to see the correlation between 0..LANES and the type [_; LANES] in more complex code, which may be the case in your branching variant. You should take a look at the assembly output and make sure no bound checks are generated. If you see inappropriate bound checks, I recommend using Iterator::zip.

Example:

(0..LANES).for_each(|i| {
    selected_chunk[i] = if validity_chunk & U16_MASK[i] != 0 {
        chunk[i]
    } else {
        0.0
    }
});

would become:

selected_chunk
  .zip(U16_MASK)
  .zip(chunk)
  .for_each(|((selected_val, mask), val)| {
    selected_val = if validity_chunk & mask != 0 {
        val
    } else {
        0.0
    }
});

You may have to call .iter() when zipping; haven't tried compiling the code.

1 Like

Thanks a lot @zeroexcuses.

What I did so far:

cargo build --release --example bla
sudo perf record target/release/examples/bla
sudo perf report

and that brings a user interface to explore the perf.data. I was unable to use all those events, as many of them are not available. Is there a way to request "all supported events"? How can I dump the instructions from there?

  1. Profiling rust code: cpu bound? l1 data cache miss? l1 instr cache miss? - #4 by JosephOsborn from @JosephOsborn has two really good links on using perf.

  2. My usage of perf is fairly primitive. I pretty much only use perf record, perf stat, perf report. The only events I really look at are (1) instructions executed and (2) L1 cache misses. The UI should sort the functions by decreasing order of % of selected event.

  3. (I don't know if this is good or bad practice). I actually set opt_level = 3 for both test and bench profiles. Then I run stuff via cargo test. This has the benefit that it keeps the debug symbols, so when perf report displays the assembly code, it inlines the rust code (not always 100% accurate, but quite helpful to sorta match rust to assembly).

  4. The main thing for dumping the instructions is that (1) regardless of what event you sort on, it should, sort the functions by % of that event it got and (2) if you hit enter, it should jump you to the function, which shows the x86 assembly.

1 Like

One more thing: I'm not sure if sudo is necessary. I'm running on Fedora with fairly strict security settings, and all my calls to perf are as regular user.

I would probably write a version with unsafe blocks and unchecked indexing. Not because that's necessarily the final code, but it's important as part of development of this to get an understanding of what can be achieved, to be able to work with the best possible case for the compiler. It can inform you what the shape of the code needs to be (and which operations need safe wrappers).

You never need unchecked indexing if all you do is iterate over a slice.

Unchecked (unsafe) indexing makes sense, if you need random access and you can prove it won't be out-of-bounds, e.g. trying to keep a reference to an element of a Vec while retaining the ability to push to it. That's a case, where you can neither take a regular reference nor a raw pointer, because pushing may reallocate and thus invalidate any previous references/pointers as represented by the method taking &mut self. However, pushing an element at the end will not cause previous elements to be reordered or removed (in relation to the start of the list), i.e. if you have checked the access to an existing element before pushing, you can keep the index around and do an unchecked access after pushing.

1 Like

If 16_MASK values are always a power of 2 you can replace the following lines:

selected_chunk[i] = if validity_chunk & U16_MASK[i] != 0 {
  chunk[i]
} else {
   0.0
}

by one of the 3 possibilies (they seem to produce the same assembly code).

selected_chunk[i] =  ((validity_chunk & U16_MASK[i])  / U16_MASK[i]) as f32 * chunk[i]; // 1
// selected_chunk[i] =  ((validity_chunk & (1 <<i))  >> i) as f32 * chunk[i];              // 2
// selected_chunk[i] =  ((validity_chunk & U16_MASK[i])  >> i) as f32 * chunk[i];      // 3

If you have a power of 2 you only have one bit set to 1, so (validity_chunk & U16_MASK[i]) is either 0 or U16_MASK[i], so ((validity_chunk & U16_MASK[i]) / U16_MASK[i]) is 0 or 1 (U16_MASK[i] should not be equal to 0).
Finally ((validity_chunk & U16_MASK[i]) / U16_MASK[i]) as f32 * chunk[i] is equal to 0 or chunk[i].

An other solution is to decompose this loop in two parts (powers of 2 are not necessary in this case).
1.You store the result of if validity_chunk & U16_MASK[i] != 0 into an array.
2.You use the computed array in order to compute fill selected_chunk

let mut selected_chunk = [0.0f32; LANES];
let mut choosen_chunk = [0; LANES];
 choosen_chunk.iter_mut()
    .zip(U16_MASK.iter())
    .for_each(|(choose, mask)| *choose= (validity_chunk & mask != 0) as i32);
selected_chunk.iter_mut()
    .zip(chunk.iter())
    .zip(choosen_chunk.iter())
    .for_each(|((sc, &c), &choose)| *sc = c * (choose as f32));

For choosen_chunk I have tried different types: bool, i32 and f32. It seems that i32 is the best.

I couldn't find any solution without casting the bool to f32.
However, casting bool to i32will produce 0 or 1 according to the rustdoc. Casting 0.0i32 to 0f32 or 1.0i32 to 1f32 seems ok according to the [nomicon](Casts - The Rustonomicon ("casting from an integer to float will produce the floating point representation of the integer, rounded if necessary (rounding to nearest, ties to even)"), since 0.0 and 1.0 are "valids" floating point numbers (see the example section of wikipedia). But, I am not sure if multiplying a floating point number x (which is not nan, snan, -inf or +inf) by 1.0 always returns x.