Performance impact of bounds checking slices

I have to process binary data read over the network or from files. I’ve been using the byteorder crate and slices to extract the fields from the protocol messages with lines a bit like

let x = read_u16(&buf[10..12]);

Is this a good way to process binary protocols?

I was using godbolt and in general it looks like this optimizes nicely to very efficient code. One rust feature which will help catch bugs in my code is bounds checking slices. I like this, but wondered how clever the optimizer is. I observed that if you make different slices from the same buffer, then the order in which you do it affects the efficiency of the generated code. In the code below, plus1 does two bounds checks and is less efficient than plus2. Introducing a dummy variable also makes the code more efficient as shown in plus3.

Are there any suggestions or best practices about I should write my code so that it optimizes well?

If I have a lot of slices in to a buffer it would be really nice if I only pay for one bounds check. This may happen anyway, since real code would likely check the message length directly before slicing in to the buffer.

https://godbolt.org/z/rxDt59

pub fn plus1(buf: &[u8]) -> u16 {
    let x = read_u16(&buf[0..2]);
    let y = read_u16(&buf[2..4]);

    x + y
}

pub fn plus2(buf: &[u8]) -> u16 {
    let y = read_u16(&buf[2..4]);
    let x = read_u16(&buf[0..2]);

    x + y
}

pub fn plus3(buf: &[u8]) -> u16 {
    let _dummy = buf[3];
    let x = read_u16(&buf[0..2]);
    let y = read_u16(&buf[2..4]);

    x + y
}
example::plus1:
        push    rax
        cmp     rsi, 1
        jbe     .LBB1_3
        cmp     rsi, 3
        jbe     .LBB1_4
        movzx   eax, word ptr [rdi + 2]
        add     ax, word ptr [rdi]
        pop     rcx
        ret
.LBB1_3:
        mov     edi, 2
        call    qword ptr [rip + _ZN4core5slice20slice_index_len_fail17hf626520261f048feE@GOTPCREL]
        ud2
.LBB1_4:
        mov     edi, 4
        call    qword ptr [rip + _ZN4core5slice20slice_index_len_fail17hf626520261f048feE@GOTPCREL]
        ud2

example::plus2:
        push    rax
        cmp     rsi, 3
        jbe     .LBB2_1
        movzx   eax, word ptr [rdi]
        add     ax, word ptr [rdi + 2]
        pop     rcx
        ret
.LBB2_1:
        mov     edi, 4
        call    qword ptr [rip + _ZN4core5slice20slice_index_len_fail17hf626520261f048feE@GOTPCREL]
        ud2

example::plus3:
        push    rax
        cmp     rsi, 4
        jb      .LBB3_2
        movzx   eax, word ptr [rdi]
        add     ax, word ptr [rdi + 2]
        pop     rcx
        ret
.LBB3_2:
        mov     rdx, rsi
        lea     rdi, [rip + .L__unnamed_1]
        mov     esi, 3
        call    qword ptr [rip + _ZN4core9panicking18panic_bounds_check17h273e49a380d01fb7E@GOTPCREL]
        ud2

The differences in the generated code are actually related to the error reporting. In the case where you access up to index 2 first, the code will report if this is a problem (e.g. accessed 1 with length 1) before it checks if you can get up to 4. If you access 0…4 first, there’s no risk of reporting that 2 is a problem.

If you’re concerned: accessing the last index first, as you do with your dummy variable, can help eliminate the conditional branches. That being said, don’t jump to the conclusion that an additional jbe is going to be a performance problem. Cold conditional branches like this cost very little on modern machines, and I’ve personally never had this sort of code come up high enough in my profiling results that I’ve bothered optimizing it. Measure first!

1 Like

Put an assert!(buf.len() >= 4)l at the beginning of the function.

That’s a good practice anyway – pre-verifying your preconditions – and LLVM knows you pre-checked it so will remove both bounds-checks.

4 Likes