U16 slice to u8 iterator

The significance of the statement that @RustyYato quoted is that std could change the algorithm at any time to one that doesn’t have this property (maximizing the length of the middle slice), and it would not be considered a breaking change: if your unsafe code would exhibit UB in that case, it’s not std’s problem, but yours.

1 Like

Thank you for clarifying. I have to think more about that. It would seem to render the value of the function to “very low”... and difficult to imagine. I’m not sure how to use the output of the function without relying on the middle being as requested.

I thought the only thing that might change is the size of each block. If the function gave me the prefix where I was expecting middle, the empty slots would likely be a problem.

I base this conclusion on the fact that changes to the algorithm might change “how much” gets allocated to each bucket (each block), but that each block will have the advertised layout. It takes time to maximize the size of the middle chunk; it, middle, being the preferred bucket because it allows more processing with every tick. A different algorithm might be faster net net putting less into the preferred and subsequently faster middle block.

That change in performance dynamic makes sense in the future (or depending on the specific trade-offs). But changing where to point to the different chunks doesn’t seem useful.

The purpose of the function is to optimize a loop over Ts when you can take advantage of loading multiple of them at a time with high alignment if that's available.

In such a situation, you always need a loop over the prefix and suffix anyway to correctly handle them, since the assumption is that you don't know the alignment of the input data. So if the middle happened to be empty, well, it might run slower than you expected, but the correct prefix and suffix handling will mean that the correct result will be computed anyway.

3 Likes

Thanks all for your valuable inputs.

To summarize, so peoples can find the best solution on this thread:

/// Copy u16 slice to u8 slice.
pub fn u16_to_u8_cpy(bytes: &[u16], my_bytes: &mut [u8]) {
    bytes.iter()
        .zip(my_bytes.chunks_exact_mut(2))
        .for_each(|(a, b)| b.copy_from_slice(&a.to_ne_bytes()));
}

Generate this assembly code (with -C opt-level=3):

example::u16_to_u8_cpy:
        push    rax
        shr     rcx
        cmp     rcx, rsi
        cmovae  rcx, rsi
        test    rcx, rcx
        je      .LBB0_2
        mov     rax, rdi
        add     rcx, rcx
        mov     rdi, rdx
        mov     rsi, rax
        mov     rdx, rcx
        call    qword ptr [rip + memcpy@GOTPCREL]
.LBB0_2:
        pop     rax
        ret

It works starting from 1.36+.

1 Like

Exactly. I wish I described it as you had as that’s my understanding too. The point I was trying to make is that the function that processed the middle chunk (my function), would likely fail if it received the prefix or suffix. It was designed to process larger swaths of memory. Gaps created by providing it smaller chunks would be problematic (not UB, but just wrong).

So, I would find it difficult to believe the docs for align_to leave open the possibility that at any later point in time the function I just described should anticipate anything else other than the middle, ever.

There won't be gaps -- the result will always cover the entire input -- the middle just might not be as long as you wished.

Here's an example of what I understand to be a correct use: Rust Playground

fn u32_contains_zero_byte(v: u32) -> bool {
    // https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
    0 != !((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F)
}

pub fn contains_zero_byte(x: &[u8]) -> bool {
    let (prefix, middle, suffix) =
        // SAFETY: `u32` can support any bit pattern, `u8` contains no padding.
        unsafe { x.align_to::<u32>() };
    prefix.iter().copied().any(|x| x == 0)
        || middle.iter().copied().any(u32_contains_zero_byte)
        || suffix.iter().copied().any(|x| x == 0)
}

That will be better able to take advantage of bigger aligned reads when the middle is longer, but if the prefix or suffix is longer than expected, it just naturally works -- and there's no extra code to handle those cases, because it's always possible for the prefix or suffix (or both) to be non-empty (such as a 6-element slice where the first is at 0xXXXXXXX3), so that code is needed anyway.

1 Like

@scottmcm You have me doubting myself. Perhaps it has to do with the specifics of my implementation.

Here is where I believe the logic in the code depends on the middle chunk returned by align_to being 128bits. Anything less "is not part of the plan" :))

...
    // Mmap::align_to
    //
    // prefix: &[u8] ... aka head_u8
    // shorts: &[_128] ... aka body_vectors
    // suffix: &[u8] with max length of 15 (not guaranteed but how it plays out now)... aka tail_u8
    //
    let (head_u8, body_vectors, tail_u8) =
        unsafe { bytes.align_to::<__m128>() };

    ...

    while simdinput_cnt <= iter_cnt {
        // load a 64-byte slice of the data into the registers
        let input = unsafe {
            SimdInput::new(body_vectors.get_unchecked(simdinput_cnt as usize..))
        };

        #[cfg(debug_assertions)]
        input.show();

        // transform the 64-bytes -> 64-bit structure
        input.structure(&mut set_bits, &mut inside_str);
        SimdInput::crush_set_bits(
            &mut struct_acc,
            set_bits,
            codepoint_cnt,
            &mut array_idx,
        );

        #[cfg(debug_assertions)]
        {
            println!(
                "🟢 simdinput_cnt: {} of len: {}",
                simdinput_cnt, iter_cnt
            );
            input.show();
        }

        simdinput_cnt += INPUT_LENGTH; // simdinput => raw input
        codepoint_cnt += 64; // codepoint => setbits
    }

...
    // Maintain continuity of the memory whilst padding to SimdInput
    // load the remaining 128
    // load tail (0-16 x u8)
    let padded_input = unsafe {
        SimdInput::new_with_padding(
            body_vectors.get_unchecked(simdinput_cnt as usize..),
            tail_u8,
        )
    };

    // reset the set_bits b/c the logic relies on any unused
    // memory be set to zero.
    set_bits = 0;
    padded_input.structure(&mut set_bits, &mut inside_str);
    SimdInput::crush_set_bits(
        &mut struct_acc,
        set_bits,
        codepoint_cnt,
        &mut array_idx,
    );

    #[cfg(debug_assertions)]
    println!(
        "\n📋\ntail_u8 len: {}\n{}",
        tail_u8.len(),
        ByteReport::_u8_as_str(tail_u8)
    );

    // 🎉 The index result!
    #[cfg(debug_assertions)]
    {
        println!("🎉 index:\n{:?}", struct_acc);
        println!("len: {:?}", struct_acc.len());
    }
    StructureIndex(cast_vec(struct_acc))

    /// padding in the number of 128 vectors
    pub(crate) fn new_with_padding(ptr: &[__m128], tail: &[u8]) -> Self {
        let load = ptr.len();
        println!("---------------------------------------------------------------------------------");
        println!("Last iteration");
        println!("Number of vectors to load {}", load);
        println!("The tail len: {}", tail.len());
        println!("---------------------------------------------------------------------------------\n");

        assert!(
            load < 4,
            "The SSE/AVX simdinput is not being initialized correctly"
        );
        assert!(
            tail.len() < 16,
            "The SSE/AVX tail_u8 is not being initialized correctly"
        );

        let mut padded_tail: [u8; 16] = [0; 16];
        tail.iter()
            .enumerate()
            .for_each(|(i, value_u8)| padded_tail[i] = *value_u8);

        println!("The now padded tail: {:?}", &padded_tail);

        unsafe {
            let padded_tail = padded_tail.get_unchecked(0..).as_ptr();

            match load {
                3 => Self {
                    v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
                    v1: _mm_load_si128(ptr.as_ptr().add(1) as *const __m128i),
                    v2: _mm_load_si128(ptr.as_ptr().add(2) as *const __m128i),
                    v3: _mm_loadu_si128(padded_tail as *const __m128i),
                },
                2 => Self {
                    v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
                    v1: _mm_load_si128(ptr.as_ptr().add(1) as *const __m128i),
                    v2: _mm_loadu_si128(padded_tail as *const __m128i),
                    v3: _mm_setzero_si128(),
                },
                1 => Self {
                    v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
                    v1: _mm_loadu_si128(padded_tail as *const __m128i),
                    v2: _mm_setzero_si128(),
                    v3: _mm_setzero_si128(),
                },
                0 => Self {
                    v0: _mm_loadu_si128(padded_tail as *const __m128i),
                    v1: _mm_setzero_si128(),
                    v2: _mm_setzero_si128(),
                    v3: _mm_setzero_si128(),
                },
                _ => panic!(
                    "The SSE/AVX simdinput is not being initialized correctly"
                ),
            }
        }
    }

I don't see you using head_u8 or tail_u8 in there? That's almost certainly wrong.

align_to doesn't actually guarantee that the suffix has a max length of 15

3 Likes

You are really good at catching me when I go-astray and risk creating confusion. In practice, in this instance, the algorithm would "zero-out" the third chunk at 16. That makes sense given what I was specifying for the minimum size of the middle chunk at 128 bits. So, the 16 x 8 = 128 -> zero suffix was useful... also, I had to juggle keeping track of bits vs bytes; it was a useful reminder of how the u8 relates to 128 when my eyes and brain start to fry whilst jumping around the code base.

To be clear, and to the essence of your point: there is no guarantee that this will be the case. Should the std lib choose to switch the algorithm, it is within the contract to allocate any number of u8 to either the prefix or suffix.

I think miri used to always return all the data in the first slice

2 Likes

You are correct. I updated the code to include what I did with the tail. A couple of notes. (1) to @SkiFire13 point about the align_to guarantee, at the every least, I run a runtime assertion making sure the size does not go higher than what I expect. (2) the code does not process the prefix. In practice, it was not necessary. The files are validated to ensure it meets the ascii spec. The way I was thinking about it, that means I have multiples of u8, and given 128, any "remainder" will be in the suffix, anything below u8 would be in the prefix, so "not to worry" - a convenient and somewhat lazy conclusion on my part. I'll update the code to cover these bases in a more complete and useful manner.

I took another look at the docs for Mmap::align_to and docs for slice::align_to:

This method splits the slice into three distinct slices: prefix, correctly aligned middle slice of a new type, and the suffix slice. The middle slice will have the greatest length possible for a given type and input slice.

... are we talking about the same align_to? the docs for std::alloc::Layout::align_to is a different animal.

That actually comes from its implementation of Deref<Target = [u8]>, so they're the same. The documentation is different because it's pretty old, the last build of the mmap crate was done in June 2018, before align_to was stabilized. When it was stabilized, in September 2018, the documentation was changed to warn that the middle slice may not have the greatest length possible.

6 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.