Faster way to read bytes from a bitstream

    pub fn read_n_bytes(&mut self, n: usize, input: &[u8], out: &mut [u8]) {
        let left_mask = 0b00001111;
        let right_mask = 0b11110000;
        for i in 0..n {
            let left_part = input[i] & left_mask;
            let right_part = input[i + 1] & right_mask;
            out[i] = left_part | right_part;
        }
    }

The function reads bytes from a bitstream. The bytes are "unaligned" in the bitstream, ie our wanted byte will have some number of bits from the left byte and the rest from the right byte. For simplicity lets just say the bytes are always "4 bits" off. so each byte is 4 bits from first byte and 4 bits from second byte. My question is can we do better than this? To me this sounds like a great fit for simd but all my attempts are much slower.

other info:

  • Typical read is 100-1000 bytes.
  • I cannot change the format.
  • removing temp variables seems to have no effect. Same with iterators and unsafe.

Doing it in the idiomatic way (windows() + no explicit length) and asserting the length of the input before zipping it with the output results in autovectorized code and no bounds checks/panics in the loop body.

5 Likes

In such loops in Rust temporary variables cost nothing, but indexing with [] is very expensive.

Especially [i + 1] destroys performance, because LLVM can't prove it won't overflow, so it can't remove bounds checks. When bounds checks exists, the expression has side effects which forbids almost all optimizations, because any smarter version of the loop wouldn't panic exactly the same way, or wouldn't leave memory in exactly the same state while panicking.

BTW: when writing hot loops, use:

3 Likes

Thanks for your example, this is exactly what I though should be faster but when I try to benchmark it seems to be slower. Maybe there is some issue with how I'm measuring things?

This generates nice assembly, but needs additional work to handle edge cases (first byte, remainder, and last half-byte of u64, and I probably got endian wrong…)

    out.iter_mut().zip(
        input.chunks_exact(8)
            .filter_map(|c| c.try_into().ok())
            .map(|c| u64::from_ne_bytes(c) << 4)
            .flat_map(|c| c.to_ne_bytes()))
    .for_each(|(out, inp)| {*out = inp})

but anyway, the core idea is that chunks_exact + u64::from_?e_bytes optimize well, and you can take it from there to process 8 bytes at once (LLVM will even unroll it to use more registers and read 32 bytes at once)

1 Like

Don't do DIY benchmarks with Instant, they will be too noisy. Use criterion or nightly bencher. Or just check if assembly looks reasonable in cargo-show-asm or godbolt.

4 Likes

I'm not good enough to look at assembly and determine how fast it will be. So benchmarking is my only option.

fn criterion_benchmark(c: &mut Criterion) {
    let n = 10000;
    let input = vec![0_u8; n + 1];
    let mut out = vec![0_u8; n];
    black_box(&input);
    c.bench_function("old ", |b| {
        b.iter(|| black_box(read_n_bytes(&input, &mut out)));
    });
    c.bench_function("new ", |b| {
        b.iter(|| black_box(read_n_bytes_forum(&input, &mut out)));
    });
    c.bench_function("new + u64", |b| {
        b.iter(|| black_box(read_n_bytes_forum2(&input, &mut out)));
    });
}

Gives me:

old
time: [191.38 ns 191.49 ns 191.65 ns]
change: [+1.7023% +2.2877% +2.7720%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
3 (3.00%) high mild
1 (1.00%) high severe

new
time: [4.4196 µs 4.4219 µs 4.4244 µs]
change: [+2.7719% +3.6283% +4.5932%] (p = 0.00 < 0.05)
Performance has regressed.
Found 11 outliers among 100 measurements (11.00%)
2 (2.00%) high mild
9 (9.00%) high severe

new + u64
time: [11.542 µs 11.546 µs 11.550 µs]
change: [+6.7497% +6.9318% +7.0704%] (p = 0.00 < 0.05)
Performance has regressed.

What am I doing wrong?

I'm sorry if I'm completely wrong here, as stated I'm not very good at assembly.

My suspicion is that the original already produced ~optimal assembly.

The hot loop in the generated original ASM:

.LBB0_8:
movups  xmm2, xmmword ptr [rdi + r9]  // load input[idx..idx+16]
andps   xmm2, xmm0 // logical and
movups  xmm3, xmmword ptr [rdi + r9 + 1] // load input[idx+1..idx+1+16]
andps   xmm3, xmm1 // logical and
orps    xmm3, xmm2 // logical or
movups  xmmword ptr [rdx + r9], xmm3 // output[idx..idx+16] = xmm3[..]
add     r9, 16 // idx += 16
cmp     rax, r9  // if idx == n(or maybe do last loop without simd)  {exit loop}
jne     .LBB0_8 // if idx == n(or maybe, do last loop without simd)  {exit loop}

Seems quite 1:1 to the rust code but with simd operations.

I'm a bit surprised that the idiomatic does so much worse so maybe there is something else going on?

If i understand correctly this is the hot loop in the idiomatic solution:

movzx   r8d, word ptr [rdi + rsi + 7]
movd    xmm3, r8d
movzx   r8d, word ptr [rdi + rsi + 6]
movd    xmm4, r8d
punpcklwd       xmm4, xmm3
movzx   r8d, word ptr [rdi + rsi + 5]
movd    xmm3, r8d
movzx   r8d, word ptr [rdi + rsi + 4]
movd    xmm5, r8d
punpcklwd       xmm5, xmm3
punpckldq       xmm5, xmm4
movzx   r8d, word ptr [rdi + rsi + 3]
movd    xmm3, r8d
movzx   r8d, word ptr [rdi + rsi + 2]
movd    xmm4, r8d
punpcklwd       xmm4, xmm3
movzx   r8d, word ptr [rdi + rsi + 1]
movd    xmm3, r8d
movzx   r8d, word ptr [rdi + rsi]
movd    xmm6, r8d
punpcklwd       xmm6, xmm3
punpckldq       xmm6, xmm4
punpcklqdq      xmm6, xmm5
movdqa  xmm3, xmm6
pand    xmm3, xmm0
packuswb        xmm3, xmm3
pand    xmm3, xmm1
psrlw   xmm6, 8
packuswb        xmm6, xmm6
pand    xmm6, xmm2
por     xmm6, xmm3
movq    qword ptr [rdx + rsi], xmm6
add     rsi, 8
cmp     rax, rsi

There is quite a lot going on compared to the other loop.

I did not know this was a thing. So nightly has something that's more like criterion built-in?

Try generating assembly with -Ctarget-cpu=znver4. That has AVX512 which has more straightforward ops that are easier to understand. If it spews a whole bunch of vpinsrb instructions that's usually an indication of very poor vectorization, effectively operating on 1 byte at a time which is no better than SISD code. Autovectorization is fickle and needs to be golfed into good shape.

My suspicion is that the original already produced ~optimal assembly.

Yeah that already looks good.

1 Like

It’s a baby criterion. Like many things in Rust, it has died from the perfect-is-enemy-of-good syndrome, but still enough of it exists to break CLI arguments of criterion… :confused:

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.