AtomicU16, godbolt

zeroexcuses · March 30, 2022, 6:16am

Can I get some help parsing the output of Compiler Explorer ?

I am trying to figure out if reading from [AtomicU16; ... ] is as fast as reading from [u16; ...], but getting far more x86_64 asm than expected.

EDIT: better link: Compiler Explorer

chrefr · March 30, 2022, 6:24am

You got unoptimized assembly. Add -Copt-level=3 (or just -O, although I think it's only optimization level 2) to the options.

chrefr · March 30, 2022, 6:27am

It is not going to be the same speed: the version with atomics is not vectorized. I guess x86 doesn't apply strong orderings to SSE (or AVX if you specify -Ctarget-cpu=native, which is a good idea) instructions.

zeroexcuses · March 30, 2022, 6:36am

Actually, this depends heavily on usage right? I just a benchmark of memcp-ing 2 G worth of data, and it seems dram latency dwarfs everything (the AtomicU16 was 'slightly faster' due to noise).

scottmcm · March 30, 2022, 7:25am

Notably it's not just not using SSE; it's not widening them to dword or qword either.

So I think it's more complex than just SSE -- these are atomic loads and langref says

which other atomic instructions on the same address they synchronize with

So it's not obvious to me what the semantics are supposed to be if you use atomics of different widths and base addresses that overlap the same byte. Maybe doing a wider atomic load would break the synchronization edges between those atomic operations and anything that might be writing to those same places because it wouldn't be the "same address", and thus that's not allowed.

And I don't know what the memory coherence subsystems on x86 would do with merged atomics either. This note was posted recently that, at least for "semaphores", x86 says you should use consistent addresses and widths: https://twitter.com/at_tcsc/status/1501712444451741696

zeroexcuses · March 30, 2022, 8:20am

Out of curiosity, is there any legitimate reason to do this? I don't see how overlappping atomics would even pass the type/borrow checker.

Cerber-Ursi · March 30, 2022, 8:59am

If I understand correctly, they're talking about the following:

pub fn cp1(dst: &mut [u16; 2], src: &[u16; 2]) {
    for i in 0..2 {
        dst[i] = src[i];
    }
}

pub fn cp2(dst: &mut [u16; 2], src: &[std::sync::atomic::AtomicU16; 2]) {
    for i in 0..2 {
        dst[i] = src[i].load(std::sync::atomic::Ordering::Relaxed);
    }
}

In the first case, both u16s can be copied as one u32. In the second case, however, hardware must make an atomic read of each u16 independently, since otherwise there might be the race between this code reading src[0] and some other code writing src[1].

system · June 28, 2022, 9:00am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Is it sound to cast &AtomicU32 to &[AtomicU16; 2]? help	6	801	July 21, 2022
Don't use `-C target-cpu=native` on Compiler Explorer	5	702	April 12, 2024
Excessive ASM instructions for typed ptr copies help	7	301	June 19, 2023
Better understanding atomics help	79	2969	April 9, 2023
Optimisations when atomic primitives are not shared? help	12	305	February 15, 2024

AtomicU16, godbolt

Related Topics