Macos, TLS and alignments

Hi, we're looking at a seemingly macos-specific bug affecting matrixmultiply, which involves the following code.

// set up buffer for masked (redirected output of) kernel
const KERNEL_MAX_SIZE: usize = 8 * 8 * 4;
const KERNEL_MAX_ALIGN: usize = 32;

struct MaskBuffer {
    buffer: [u8; KERNEL_MAX_SIZE],

// Use thread local if we can; this is faster even in the single threaded case because
// it is possible to skip zeroing out the array.
#[cfg(feature = "std")]
thread_local! {
    static MASK_BUF: UnsafeCell<MaskBuffer> =
        UnsafeCell::new(MaskBuffer { buffer: [0; KERNEL_MAX_SIZE] });

It would seem that on macos, buffer doesn't get 32-byte aligned every time. Can this be a problem with TLS? Something else?

Macos + TLS has apparently been involved in other interesting issues before:

If someone has more information it would be valuable.