How to Properly Align a Boxed Slice?

I have set up the following benchmark:

pub fn criterion_benchmark(c: &mut Criterion) {
    const NUM_INPUTS: usize = 64;
    const NUM_HIDDEN_0: usize = 1;
    const NUM_OUTPUTS: usize = 128;

    let mut vertices = vec![
        vec![0.0; NUM_INPUTS].into_boxed_slice(),
        vec![0.0; NUM_INPUTS + NUM_HIDDEN_0].into_boxed_slice(),
        vec![0.0; NUM_INPUTS + NUM_HIDDEN_0 + NUM_OUTPUTS].into_boxed_slice(),
    ]
    .into_boxed_slice();

    let edges = vec![
        vec![0.0; NUM_INPUTS * NUM_HIDDEN_0].into_boxed_slice(),
        vec![0.0; (NUM_INPUTS + NUM_HIDDEN_0) * NUM_OUTPUTS].into_boxed_slice(),
    ]
    .into_boxed_slice();

    c.bench_function("feed_forward", |b| {
        b.iter(|| feed_forward(&mut vertices, &edges))
    });
}

This is the relevant function definition:

pub fn feed_forward(vertices: &mut [Box<[f32]>], edges: &[Box<[f32]>]) { [...] }

The issue I have is this code executed by the feed_forward function:

lgc_f32x8(f32x8::from_slice_unaligned(scalar)).write_to_slice_unaligned(scalar);

I'd much rather call f32x8::from_slice_aligned and f32x8::write_to_slice_aligned, instead. For that purpose my slice needs to be aligned to at least 32 bits bytes.

How would I go about that? Do I need to mess around with std::alloc or is there another solution, either part of the standard or a third party library?

The typical way simd works is to have a "front matter" and "end matter" handler that handle the unaligned parts while the main chunk of the slice is done with full simd.

If you want to guarantee you're aligned properly for a x8 simd type, the easy way to do it is to do your allocation as that type.

I just found out about slice::align_to_mut, which seems to be perfectly suited for my case and fits your description well.

Nonetheless, the main advantage I see in my method is, that receiving a slice, that is already correctly aligned, will save me the branch logic for the "front matter". I hope for this optimization to have a small, but noticeable impact on performance for shorter length slices.

Afterwards, I want to explore getting rid of the "end matter" branch logic by including zeroed padding bytes at the end of the slice to fit the f32x8 type perfectly.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.