Find first non-matching byte in array with SIMD

I am trying to find a fast way to find the first non-space character in an array of UTF-8 string bytes.

It's possible to get the compiler to generate vectorised code for some of this without using explicit SIMD intrinsics. e.g. this gets vectorised:

const LANES: usize = 16;

#[repr(C, align(16))]
pub struct Bools([bool; LANES]);

pub fn spaces_mask(bytes: &[u8; LANES]) -> Bools {
    let mut mask = Bools([false; LANES]);
    for i in 0..LANES {
        mask.0[i] = bytes[i] == b' ';
    }
    mask
}

But I cannot find a good way to produce vectorised code for finding the first unset bit in the Mask. My first attempt:

pub fn first_non_space(bytes: &[u8; LANES]) -> u32 {
    let mut mask = Bools([false; LANES]);
    for i in 0..LANES {
        mask.0[i] = bytes[i] == b' ';
    }

    let mut bits: u32 = 0;
    for i in 0..LANES {
        if mask.0[i] {
            bits |= 1 << i;
        }
    }
    bits.trailing_ones()
}

Filling up mask is vectorized, but calculating bits isn't.

This is better - it mostly gets vectorized - but limited to 16 lanes (so can't exploit AVX2), and still not as efficient as a movemask operation:

#[repr(C, align(16))]
pub struct Bytes([u8; LANES]);

pub fn first_non_space2(bytes: &[u8; LANES]) -> u32 {
    let mut mask = Bytes([0u8; LANES]);
    for i in 0..LANES {
        mask.0[i] = (bytes[i] == b' ') as u8 * 0xFF;
    }

    let u = u128::from_le_bytes(mask.0);
    u.leading_ones() / 8
}

Is there any way to get the compiler to generate a movemask operation without resorting to std::arch?

Both examples here: Compiler Explorer

A few more attempts, all of which produce very lengthy ASM with 15 branches:

pub unsafe fn first_non_space3(bytes: &[u8; LANES]) -> usize {
    let start = bytes.as_ptr();
    for i in 0..LANES {
        if start.add(i).read() != b' ' {
            return i;
        }
    }
    LANES
}

pub unsafe fn first_non_space4(bytes: &[u8; LANES]) -> usize {
    let mut p = bytes.as_ptr();
    for i in 0..LANES {
        if p.read() != b' ' {
            return i;
        }
        p = p.offset(1);
    }
    LANES
}

pub fn first_non_space5(bytes: &[u8; LANES]) -> usize {
    let mut it = bytes.iter();
    while it.clone().next() == Some(&b' ') {
        it.next();
    }
    it.as_slice().as_ptr() as usize - bytes.as_ptr() as usize
}

This feels similar-ish to the problem from

Could you show the entire code? My gut feeling is that you don’t actually need this particular code to be vectorized. That is, it’s ok if finding the index of the first non-whitespace in a lane of 16 bytes is scalar.

What you want to vectorize is rather “is there a non-whitespace in this lane at all?” If this is vectorized, the algorithm becomes:

  • find the lane which contains non-ws, using vectorized code
  • inside this lane, find a specific index using simple scalar loop.

Thanks very much for your suggestion.

This is for a Javascript parser (OXC). The hope is to speed up the lexer by consuming sequences of bytes in blocks.

Some actual code:

pub fn consume_spaces(bytes: &mut std::slice::Iter<u8>) {
    while bytes.clone().next() == Some(&b' ') {
        bytes.next();
    }
}

pub fn consume_whitespace(bytes: &mut std::slice::Iter<u8>) {
    while matches!(bytes.clone().next(), Some(&b' ' | &b'\t' | &b'\n' | &b'\r')) {
        bytes.next();
    }
}

pub fn consume_typical_identifier(bytes: &mut std::slice::Iter<u8>) {
    while let Some(&b) = bytes.clone().next() {
        // Exit if doesn't match a-z or A-Z
        if (b.wrapping_sub(b'A') & 0b11011111) >= 26 {
            break;
        }
        bytes.next();
    }
}

The problem is that mostly these functions consume 16 bytes or less, as typical JS code tends mostly not to be deeply indented, and identifiers are often quite short. With a lane width of 32, a match on first chunk would happen 90% of the time.

So, for this use case, "find the lane which contains non-ws" will very often conclude "the first lane". So I feel like the 2-step process won't have as much impact as it might where you're chewing through large stretches of bytes without a match.

I gather from reading the code for memchr that using movemask is the most efficient approach. I just was hoping there might be a way to get at least close to memchr's performance without having to write specialised code for each CPU architecture, using use core::arch::x86_64 etc.

1 Like

Ah, yeah, these are indeed two different problems:

  • finding a rare character in huuuge string
  • quickly finding a common character in a small string

Indeed, you probably want some very specific SIMD instructions here, and something like portable simd would be the best bet here, but that's unstable :frowning:

1 Like

Thanks for coming back.

Oh damn! Unfortunately OXC has a stable-only policy, so portable SIMD is not an available option.

It be nice if there was some formulation of "normal" Rust code which the compiler can recognise as a search, and turn into a movemask operation itself - in the same way as, as your article points out, you can encourage the compiler to use SIMD instructions in other cases by using certain patterns.

I'm surprised there isn't, because I assume "find the first something" is a fairly common case, and such an auto-optimization could benefit a lot of code.

Here's how memchr looks in portable-simd, which should probably be reasonably easy to convert to memnotchr as you need here: https://rust-lang.zulipchat.com/#narrow/stream/257879-project-portable-simd/topic/memchr/near/363116955

The reason this optimization doesn't happen in LLVM is that for LLVM it's very important that the loop have a predictable trip count, and when there's two exits -- particularly when one is content-dependent, like here -- it doesn't know that. That's made worse because LLVM -- where this optimization would have to happen, realistically -- is made mostly for C++ code. And C++ code generally can't over-read past the pointers that it's definitely going to read in the code anyway, since that might be UB. So since the obvious way to phrase this in Rust is a way that doesn't read the data after the found position, it's probably illegal for LLVM to actually produce the SIMD-optimized form of it in many cases.

Thanks for coming back @scottmcm. Yes, portable-simd would make this pretty easy. Unfortunately, OXC has a stable-only policy, so portable-simd is not an option in this case.

Your point about LLVM is slightly above my pay grade, but I think I get the general idea. I had hoped a formulation like this might resolve the issue of unpredictable trip count:

pub fn first_non_space6(bytes: &[u8; LANES]) -> u32 {
    let mut mask = Bytes([0u8; LANES]);
    for i in 0..LANES {
        mask.0[i] = (bytes[i] != b' ') as u8 * 0xFF;
    }

    let u = u128::from_le_bytes(mask.0);
    u.trailing_zeros() / 8
}

u128::from_le_bytes(mask.0).trailing_zeros() / 8 is essentially doing exactly what movemask + tzcnt does - extract the sign bits from mask.0 and count zeros.

Using u128 is a hack, but maybe a function in std [u8; N]::leading_zeros() or [bool; N]::leading_false() could implement this optimization if N is power of 2?