Elusive alignment in thread local issue (macOS only?)

I'm trying to understand an alignment issue (older bug I worked around) that showed up on MacOS specifically, and I don't have access to that, except through github actions.

It revolves around code that looks something like this using an align(32) struct inside a thread local - using align 32 for compatibility with AVX (SIMD) loads. In this case.

/// Is the MaskBuffer's alignment respected?
#[repr(align(32))]
struct MaskBuffer {
    buffer: [u8; KERNEL_MAX_SIZE],
}

thread_local! {
    static MASK_BUF: UnsafeCell<MaskBuffer> =
        UnsafeCell::new(MaskBuffer { buffer: [0; KERNEL_MAX_SIZE] });
}

There's a couple of questions. Maybe the outer and inner pointer could be different (MaskBuffer vs MaskBuffer.buffer in the code) - though I wouldn't expect them to be. Maybe thread local storage (TLS) in macos doesn't respect the type's required alignment?

I haven't found any previous discussions about this, but I guess using align(32) in thread locals is also rare.

Evidence of a failed debug assertion (on the MaskBuffer.buffer pointer in that case) on macos is here: Revert "FIX: Align mask buffer pointer manually" · bluss/matrixmultiply@58ea050 · GitHub It's in a test PR where I revert the workaround previously made, to try to understand it better. (macOS 10.15, Rust 1.56)

If I add a bit of different debug assertions, the problem went away. But this is also a problem where code layout arbitrariness plays a role, so it's expected to not be very reliable to reproduce.

Does anyone know more? :slight_smile:

The original bug report of a mac os issue was Test involving many 6x6 matrices fails randomly on Mac OS · Issue #55 · bluss/matrixmultiply · GitHub

Maybe, as linked in the bug, it is caused by this code in rustc, if we still have it in Rust 1.56? Do not allow LLVM to increase a TLS's alignment on macOS. by kennytm · Pull Request #51828 · rust-lang/rust · GitHub

Well you could try printing the address of the buffer and see if its divisible by 32.

You could also try adding a dummy u32 to the struct to see if maybe the align(32) isn’t working. I’m assuming that would get the right alignment.

I did come across this issue that supposedly was fixed in Xcode years ago but maybe there is still an issue that is similar. Their temporary fix seemed to be was adding a dummy variable for alignment.

The alignment for u32 is 4.

This is what the debug assertions are doing - they check the alignment that way. I'm probably overly cautious here - in the sense I should maybe just report a bug.

Oops. Always make that mistake when I read align(N) vs repr(uN). Think my brain needs to be upgraded to a new version.

I am not sure what is the reason for an alignment like that; is it the idea that you would like to unsafe load aligned on AVX512? Do you have a benchmark supporting the need for that over un-aligned loads?

I think it would be useful to have a smaller example, e.g. something like this, that would allow to check which instructions are being emitted by the compiler in godbolt.

Also, did you try to run the tests under MIRI on the CI? soundness should be independent of the architecture if you are not using architecture-specific unsafe code. Are you?

It's quite some time since I was actively looking at the simd kernels we use, but it's using the AVX/FMA instruction sets (not AVX512), and aligned AVX loads natively require 32-byte alignment. It would probably be possible to change the kernel and create a benchmark that shows the difference, if any. But I don't think it should be necessary to motivate - it is not over-aligned, it's the native alignment of that SIMD type, and why not support it? :slightly_smiling_face: (The byte buffer here however, is used generically, regardless if we're currently computing a f32 or f64 matrix mult, regardless of kernel, so it's written with u8 and can't really be written with any single SIMD type).

Yes, the matrix kernels use std::arch simd intrinsics with unsafe and they have worked the same way since Rust 1.28 or so, stable Rust, so they have been working well.

I'm not quite sure in what way you are inquiring about soundness - it seems to be simpler than that here - is the struct aligned to the requested boundary or not. (The example code, that trips the debug assertion on mac os, has now come through a green miri run. However, I'm unsure how to interpret it - it's good, but I'm not sure what it means with the SIMD intrinsics, and due to gigantic slowness it's only testing small matrices. I would also assume that when miri is executing, that it's not detecting that AVX is enabled, so we can't really test the code you wanted to test, it's skipping all the explict SIMD then.)

2 Likes

Thanks. I see, so, you are hard-coding the instruction set on the binary. Ignore my comment on soundness, then. Thanks for clarifying. I can't offer further help, unfortunately.

(below is not related to this problem in particular, just a general info exchange).

In arrow2 we had allocations along cache lines (so 128 bytes on intel x86, which is also aligned with AVX512 and AVX/FMA), but I could not find a performance difference in using this vs native alignment. Thus, we dropped specific alignments and now use Rust's native allocator, which gave us amazing compatibility with Vec.

The other thing we use is packed_simd2 to give us the instructions for free. For example, in architectures that support it, we hit AVX512 instructions. We then use multiversion to compile to different instruction sets, with runtime detection.

There was a discussion over Apache Arrow's mailing list about the need to use cache-friendly alignments, with people from HPC arguing that aligned alllocations are beneficial for vertical ops, but so far I was unable to observe any differences in Rust. This is why I was curious as to whether there was a simple example demonstrating that the alignment was bringing advantages. :slight_smile:

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.