The significance of the statement that @RustyYato quoted is that std
could change the algorithm at any time to one that doesn’t have this property (maximizing the length of the middle slice), and it would not be considered a breaking change: if your unsafe code would exhibit UB in that case, it’s not std
’s problem, but yours.
Thank you for clarifying. I have to think more about that. It would seem to render the value of the function to “very low”... and difficult to imagine. I’m not sure how to use the output of the function without relying on the middle being as requested.
I thought the only thing that might change is the size of each block. If the function gave me the prefix where I was expecting middle, the empty slots would likely be a problem.
I base this conclusion on the fact that changes to the algorithm might change “how much” gets allocated to each bucket (each block), but that each block will have the advertised layout. It takes time to maximize the size of the middle chunk; it, middle, being the preferred bucket because it allows more processing with every tick. A different algorithm might be faster net net putting less into the preferred and subsequently faster middle block.
That change in performance dynamic makes sense in the future (or depending on the specific trade-offs). But changing where to point to the different chunks doesn’t seem useful.
The purpose of the function is to optimize a loop over T
s when you can take advantage of loading multiple of them at a time with high alignment if that's available.
In such a situation, you always need a loop over the prefix and suffix anyway to correctly handle them, since the assumption is that you don't know the alignment of the input data. So if the middle happened to be empty, well, it might run slower than you expected, but the correct prefix and suffix handling will mean that the correct result will be computed anyway.
Thanks all for your valuable inputs.
To summarize, so peoples can find the best solution on this thread:
/// Copy u16 slice to u8 slice.
pub fn u16_to_u8_cpy(bytes: &[u16], my_bytes: &mut [u8]) {
bytes.iter()
.zip(my_bytes.chunks_exact_mut(2))
.for_each(|(a, b)| b.copy_from_slice(&a.to_ne_bytes()));
}
Generate this assembly code (with -C opt-level=3
):
example::u16_to_u8_cpy:
push rax
shr rcx
cmp rcx, rsi
cmovae rcx, rsi
test rcx, rcx
je .LBB0_2
mov rax, rdi
add rcx, rcx
mov rdi, rdx
mov rsi, rax
mov rdx, rcx
call qword ptr [rip + memcpy@GOTPCREL]
.LBB0_2:
pop rax
ret
It works starting from 1.36+.
Exactly. I wish I described it as you had as that’s my understanding too. The point I was trying to make is that the function that processed the middle chunk (my function), would likely fail if it received the prefix or suffix. It was designed to process larger swaths of memory. Gaps created by providing it smaller chunks would be problematic (not UB, but just wrong).
So, I would find it difficult to believe the docs for align_to
leave open the possibility that at any later point in time the function I just described should anticipate anything else other than the middle, ever.
There won't be gaps -- the result will always cover the entire input -- the middle just might not be as long as you wished.
Here's an example of what I understand to be a correct use: Rust Playground
fn u32_contains_zero_byte(v: u32) -> bool {
// https://graphics.stanford.edu/~seander/bithacks.html#ZeroInWord
0 != !((((v & 0x7F7F7F7F) + 0x7F7F7F7F) | v) | 0x7F7F7F7F)
}
pub fn contains_zero_byte(x: &[u8]) -> bool {
let (prefix, middle, suffix) =
// SAFETY: `u32` can support any bit pattern, `u8` contains no padding.
unsafe { x.align_to::<u32>() };
prefix.iter().copied().any(|x| x == 0)
|| middle.iter().copied().any(u32_contains_zero_byte)
|| suffix.iter().copied().any(|x| x == 0)
}
That will be better able to take advantage of bigger aligned reads when the middle is longer, but if the prefix or suffix is longer than expected, it just naturally works -- and there's no extra code to handle those cases, because it's always possible for the prefix or suffix (or both) to be non-empty (such as a 6-element slice where the first is at 0xXXXXXXX3
), so that code is needed anyway.
@scottmcm You have me doubting myself. Perhaps it has to do with the specifics of my implementation.
Here is where I believe the logic in the code depends on the middle chunk returned by align_to
being 128bits. Anything less "is not part of the plan" :))
...
// Mmap::align_to
//
// prefix: &[u8] ... aka head_u8
// shorts: &[_128] ... aka body_vectors
// suffix: &[u8] with max length of 15 (not guaranteed but how it plays out now)... aka tail_u8
//
let (head_u8, body_vectors, tail_u8) =
unsafe { bytes.align_to::<__m128>() };
...
while simdinput_cnt <= iter_cnt {
// load a 64-byte slice of the data into the registers
let input = unsafe {
SimdInput::new(body_vectors.get_unchecked(simdinput_cnt as usize..))
};
#[cfg(debug_assertions)]
input.show();
// transform the 64-bytes -> 64-bit structure
input.structure(&mut set_bits, &mut inside_str);
SimdInput::crush_set_bits(
&mut struct_acc,
set_bits,
codepoint_cnt,
&mut array_idx,
);
#[cfg(debug_assertions)]
{
println!(
"🟢 simdinput_cnt: {} of len: {}",
simdinput_cnt, iter_cnt
);
input.show();
}
simdinput_cnt += INPUT_LENGTH; // simdinput => raw input
codepoint_cnt += 64; // codepoint => setbits
}
...
// Maintain continuity of the memory whilst padding to SimdInput
// load the remaining 128
// load tail (0-16 x u8)
let padded_input = unsafe {
SimdInput::new_with_padding(
body_vectors.get_unchecked(simdinput_cnt as usize..),
tail_u8,
)
};
// reset the set_bits b/c the logic relies on any unused
// memory be set to zero.
set_bits = 0;
padded_input.structure(&mut set_bits, &mut inside_str);
SimdInput::crush_set_bits(
&mut struct_acc,
set_bits,
codepoint_cnt,
&mut array_idx,
);
#[cfg(debug_assertions)]
println!(
"\n📋\ntail_u8 len: {}\n{}",
tail_u8.len(),
ByteReport::_u8_as_str(tail_u8)
);
// 🎉 The index result!
#[cfg(debug_assertions)]
{
println!("🎉 index:\n{:?}", struct_acc);
println!("len: {:?}", struct_acc.len());
}
StructureIndex(cast_vec(struct_acc))
/// padding in the number of 128 vectors
pub(crate) fn new_with_padding(ptr: &[__m128], tail: &[u8]) -> Self {
let load = ptr.len();
println!("---------------------------------------------------------------------------------");
println!("Last iteration");
println!("Number of vectors to load {}", load);
println!("The tail len: {}", tail.len());
println!("---------------------------------------------------------------------------------\n");
assert!(
load < 4,
"The SSE/AVX simdinput is not being initialized correctly"
);
assert!(
tail.len() < 16,
"The SSE/AVX tail_u8 is not being initialized correctly"
);
let mut padded_tail: [u8; 16] = [0; 16];
tail.iter()
.enumerate()
.for_each(|(i, value_u8)| padded_tail[i] = *value_u8);
println!("The now padded tail: {:?}", &padded_tail);
unsafe {
let padded_tail = padded_tail.get_unchecked(0..).as_ptr();
match load {
3 => Self {
v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
v1: _mm_load_si128(ptr.as_ptr().add(1) as *const __m128i),
v2: _mm_load_si128(ptr.as_ptr().add(2) as *const __m128i),
v3: _mm_loadu_si128(padded_tail as *const __m128i),
},
2 => Self {
v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
v1: _mm_load_si128(ptr.as_ptr().add(1) as *const __m128i),
v2: _mm_loadu_si128(padded_tail as *const __m128i),
v3: _mm_setzero_si128(),
},
1 => Self {
v0: _mm_load_si128(ptr.as_ptr() as *const __m128i),
v1: _mm_loadu_si128(padded_tail as *const __m128i),
v2: _mm_setzero_si128(),
v3: _mm_setzero_si128(),
},
0 => Self {
v0: _mm_loadu_si128(padded_tail as *const __m128i),
v1: _mm_setzero_si128(),
v2: _mm_setzero_si128(),
v3: _mm_setzero_si128(),
},
_ => panic!(
"The SSE/AVX simdinput is not being initialized correctly"
),
}
}
}
I don't see you using head_u8
or tail_u8
in there? That's almost certainly wrong.
align_to
doesn't actually guarantee that the suffix has a max length of 15
You are really good at catching me when I go-astray and risk creating confusion. In practice, in this instance, the algorithm would "zero-out" the third chunk at 16. That makes sense given what I was specifying for the minimum size of the middle chunk at 128 bits. So, the 16 x 8 = 128 -> zero suffix was useful... also, I had to juggle keeping track of bits vs bytes; it was a useful reminder of how the u8
relates to 128
when my eyes and brain start to fry whilst jumping around the code base.
To be clear, and to the essence of your point: there is no guarantee that this will be the case. Should the std lib choose to switch the algorithm, it is within the contract to allocate any number of u8
to either the prefix or suffix.
I think miri used to always return all the data in the first slice
You are correct. I updated the code to include what I did with the tail. A couple of notes. (1) to @SkiFire13 point about the align_to
guarantee, at the every least, I run a runtime assertion making sure the size does not go higher than what I expect. (2) the code does not process the prefix. In practice, it was not necessary. The files are validated to ensure it meets the ascii spec. The way I was thinking about it, that means I have multiples of u8
, and given 128
, any "remainder" will be in the suffix, anything below u8
would be in the prefix, so "not to worry" - a convenient and somewhat lazy conclusion on my part. I'll update the code to cover these bases in a more complete and useful manner.
I took another look at the docs for Mmap::align_to
and docs for slice::align_to
:
This method splits the slice into three distinct slices: prefix, correctly aligned middle slice of a new type, and the suffix slice. The middle slice will have the greatest length possible for a given type and input slice.
... are we talking about the same align_to
? the docs for std::alloc::Layout::align_to
is a different animal.
That actually comes from its implementation of Deref<Target = [u8]>
, so they're the same. The documentation is different because it's pretty old, the last build of the mmap
crate was done in June 2018, before align_to
was stabilized. When it was stabilized, in September 2018, the documentation was changed to warn that the middle slice may not have the greatest length possible.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.