I'm trying to understand an alignment issue (older bug I worked around) that showed up on MacOS specifically, and I don't have access to that, except through github actions.
It revolves around code that looks something like this using an align(32) struct inside a thread local - using align 32 for compatibility with AVX (SIMD) loads. In this case.
There's a couple of questions. Maybe the outer and inner pointer could be different (MaskBuffer vs MaskBuffer.buffer in the code) - though I wouldn't expect them to be. Maybe thread local storage (TLS) in macos doesn't respect the type's required alignment?
I haven't found any previous discussions about this, but I guess using align(32) in thread locals is also rare.
If I add a bit of different debug assertions, the problem went away. But this is also a problem where code layout arbitrariness plays a role, so it's expected to not be very reliable to reproduce.
You could also try adding a dummy u32 to the struct to see if maybe the align(32) isn’t working. I’m assuming that would get the right alignment.
I did come across this issue that supposedly was fixed in Xcode years ago but maybe there is still an issue that is similar. Their temporary fix seemed to be was adding a dummy variable for alignment.
This is what the debug assertions are doing - they check the alignment that way. I'm probably overly cautious here - in the sense I should maybe just report a bug.
I am not sure what is the reason for an alignment like that; is it the idea that you would like to unsafe load aligned on AVX512? Do you have a benchmark supporting the need for that over un-aligned loads?
I think it would be useful to have a smaller example, e.g. something like this, that would allow to check which instructions are being emitted by the compiler in godbolt.
Also, did you try to run the tests under MIRI on the CI? soundness should be independent of the architecture if you are not using architecture-specific unsafe code. Are you?
It's quite some time since I was actively looking at the simd kernels we use, but it's using the AVX/FMA instruction sets (not AVX512), and aligned AVX loads natively require 32-byte alignment. It would probably be possible to change the kernel and create a benchmark that shows the difference, if any. But I don't think it should be necessary to motivate - it is not over-aligned, it's the native alignment of that SIMD type, and why not support it? (The byte buffer here however, is used generically, regardless if we're currently computing a f32 or f64 matrix mult, regardless of kernel, so it's written with u8 and can't really be written with any single SIMD type).
Yes, the matrix kernels use std::arch simd intrinsics with unsafe and they have worked the same way since Rust 1.28 or so, stable Rust, so they have been working well.
I'm not quite sure in what way you are inquiring about soundness - it seems to be simpler than that here - is the struct aligned to the requested boundary or not. (The example code, that trips the debug assertion on mac os, has now come through a green miri run. However, I'm unsure how to interpret it - it's good, but I'm not sure what it means with the SIMD intrinsics, and due to gigantic slowness it's only testing small matrices. I would also assume that when miri is executing, that it's not detecting that AVX is enabled, so we can't really test the code you wanted to test, it's skipping all the explict SIMD then.)
Thanks. I see, so, you are hard-coding the instruction set on the binary. Ignore my comment on soundness, then. Thanks for clarifying. I can't offer further help, unfortunately.
(below is not related to this problem in particular, just a general info exchange).
In arrow2 we had allocations along cache lines (so 128 bytes on intel x86, which is also aligned with AVX512 and AVX/FMA), but I could not find a performance difference in using this vs native alignment. Thus, we dropped specific alignments and now use Rust's native allocator, which gave us amazing compatibility with Vec.
The other thing we use is packed_simd2 to give us the instructions for free. For example, in architectures that support it, we hit AVX512 instructions. We then use multiversion to compile to different instruction sets, with runtime detection.
There was a discussion over Apache Arrow's mailing list about the need to use cache-friendly alignments, with people from HPC arguing that aligned alllocations are beneficial for vertical ops, but so far I was unable to observe any differences in Rust. This is why I was curious as to whether there was a simple example demonstrating that the alignment was bringing advantages.