Memory Alignment for AVX

I'm working on some digital signal processing code for a SDR and to speed things up I implemented my dot-product function using AVX instructions. Everything works, but it's actually slower than my generic one that uses iterators: ~70ms vs ~170ms with AVX instructions.

I believe the problem is that I have to allocate my buffers on a 32-byte boundary to use the AVX instructions, and this requires me to allocate new memory, and copy over the buffers with every call. My questions is, could I make an allocator that always allocates memory on a 32-byte boundary, and set this as the system allocator? What I'm currently doing requires me to call both alloc and dealloc (which at that point, why am I using Rust?), so my thought is to swap out the system allocator for one that always allocates on a 32-byte boundary.

If making all allocations on a 32-byte boundary is a bad idea (I'm thinking it might be for small allocations?), then is there an easy way to allocate just these buffers on a 32-byte boundary, but have Rust automatically de-allocate them at the proper time, like a normal Vec? Is it as simple as making my own Vec that implements the Drop trait, but allocates on a 32-byte boundary?

Why can't you just use _mm256_loadu_ps instead? IIRC on modern CPUs for aligned data it will demonstrate the same performance as aligned loads.

1 Like

@newpavlov, that worked like a champ! I'd just completely overlooked those instructions as I was cribbing off some C++ code from elsewhere. Thanks!