AtomicU16, godbolt

Can I get some help parsing the output of Compiler Explorer ?

I am trying to figure out if reading from [AtomicU16; ... ] is as fast as reading from [u16; ...], but getting far more x86_64 asm than expected.

EDIT: better link: Compiler Explorer

You got unoptimized assembly. Add -Copt-level=3 (or just -O, although I think it's only optimization level 2) to the options.

1 Like

It is not going to be the same speed: the version with atomics is not vectorized. I guess x86 doesn't apply strong orderings to SSE (or AVX if you specify -Ctarget-cpu=native, which is a good idea) instructions.

1 Like

Actually, this depends heavily on usage right? I just a benchmark of memcp-ing 2 G worth of data, and it seems dram latency dwarfs everything (the AtomicU16 was 'slightly faster' due to noise).

Notably it's not just not using SSE; it's not widening them to dword or qword either.

So I think it's more complex than just SSE -- these are atomic loads and langref says

which other atomic instructions on the same address they synchronize with

So it's not obvious to me what the semantics are supposed to be if you use atomics of different widths and base addresses that overlap the same byte. Maybe doing a wider atomic load would break the synchronization edges between those atomic operations and anything that might be writing to those same places because it wouldn't be the "same address", and thus that's not allowed.

And I don't know what the memory coherence subsystems on x86 would do with merged atomics either. This note was posted recently that, at least for "semaphores", x86 says you should use consistent addresses and widths: https://twitter.com/at_tcsc/status/1501712444451741696

1 Like

Out of curiosity, is there any legitimate reason to do this? I don't see how overlappping atomics would even pass the type/borrow checker.

If I understand correctly, they're talking about the following:

pub fn cp1(dst: &mut [u16; 2], src: &[u16; 2]) {
    for i in 0..2 {
        dst[i] = src[i];
    }
}

pub fn cp2(dst: &mut [u16; 2], src: &[std::sync::atomic::AtomicU16; 2]) {
    for i in 0..2 {
        dst[i] = src[i].load(std::sync::atomic::Ordering::Relaxed);
    }
}

In the first case, both u16s can be copied as one u32. In the second case, however, hardware must make an atomic read of each u16 independently, since otherwise there might be the race between this code reading src[0] and some other code writing src[1].