Correct way to use the simd intrinsics

vorner · May 8, 2020, 4:44pm

Hello

I'm trying to research how to do SIMD correctly with just the bare-bones intrinsics that are already stable in std. I'm a bit unsure about how far the equivalence between the __mm128i and let's say [u16; 8] goes and how to work with these things.

Let's say I have this:

#[repr(align(16))]
struct Array([u16; 800])

And I want to sum the array together. I've already checked that sse2 is available and therefore use of the _mm_add_epi16 should be OK. But for that, I need to get my hands onto the __mm128i types. What is the difference (in what is or isn't UB, how fast they would be, etc) between:

Transmuting aligned chunks of the array into them.
Casting pointers to aligned chunks of the array and dereferencing them.
Using some other intrinsics to load them (I haven't found one for u16 combination and 128bit vectors, but there seems to be float ones, for example _mm_load_pd.
Keeping an array of [__mm128i; 100] around instead.

Is any of these methods preferred? Why? If I wanted to have a u16x8 type, is it better to keep (aligned) [u16; 8] inside or __mm128i (looking around libraries out there, I seem to be able to find both ways)? Is it safe to deref the later to [u16; 8] by simply casting the pointer type? I think the unsafe guidelines seem to suggest it is safe, but I'm not completely sure. Does that hold for the float-vectors and mask-vectors as well?

sfackler · May 8, 2020, 10:29pm

An option I've seen in code working with SIMD vectors is to use a union:

union u16x8 {
    a: [u16; 8],
    b: __mm128i
}

That handles the alignment requirement and allows you to easily poke at the individual values or work with the whole vector. It does require unsafe to access, but you'll probably be in unsafe-land anyway working with the intrinsics.

BurntSushi · May 9, 2020, 1:52am

I don't know if there is a rock solid authoritative source for this, but here's a small dump from my brain:

__m128i, u8x16/u16x8/u32x4/u64x2 and [u8; 16] are all layout compatible and can be freely transmuted between one another. I think the docs generally support this, but seems to just stop short of making a strong and precise guarantee. Example of transmuting a [u8; 32] to a __m256i. (But note that that example is not in a performance critical path!)
I don't know whether a similar statement can be made about floating point vectors, although I assume it would. I think the thing that could put a wrench in that is with certain classes of NaNs are considered UB. At one point, we thought signaling NaNs were UB, so we actually did NaN masking when converting between bit representations of integers and floats. But @Gankra later helped push toward a policy where sNaNs were okay. Now the question is whether that applies to vectors or not. I don't know. Note also the wording difference in the docs for __m128d when compared to __m128i, which suggests that __m128d is indeed not as flexible.
As far as unions go, I think it is safe to do that, assuming you use repr(C). But otherwise, you'll need to be careful using them because they can apparently inhibit optimizations depending on how you use them. See this PR from @gnzlbg against my code a while back for some very educational comments. It's no coincidence that the most recent version of Teddy does not use unions anywhere. Off the top of my head, I don't believe any SIMD code I've written that is still active uses unions.
Another data point: for converting an __m128i back to its constituent parts, I've found that using appropriate intrinsics can actually be faster than a transmute to an array. I don't know why.
When it comes to creating a __m128i from, say, a &[u8], I use either _mm_load_si128 for an aligned pointer or _mm_loadu_si128 for unaligned loads. Here's an example.

N.B. I've exclusively used SIMD with its integer APIs for accelerating string-related algorithms. I've never used it for floating point, so that puts me in a bit of a niche, and could potentially make my experience narrow. So, I guess, just take what I say with an appropriate amount of salt.

vorner · May 9, 2020, 7:04am

So, you think that the same approach with union but for __m128d could lead to UB by assigning the „wrong“ kind of NaN into one of the array elements?

BurntSushi · May 9, 2020, 12:41pm

Possibly. Maybe @RalfJung or @gnzlbg could provide more guidance?

RalfJung · May 9, 2020, 5:43pm

Please always make your unions repr(C) if you rely on the offset of all fields being 0! Like for structs, we make no layout guarantees for repr(Rust) unions. (But I see @BurntSushi already said that. Still this probably cannot be repeated often enough given how frequent this mistake is made.)

I would say it is safe to assume that you can transmute the SIMD types with integer/float arrays of the right size -- though please keep in mind alignment! &mut [u8; 16] and &mut __m128i are not mutually transmutable. By-value transmutes are fine though.

However, unfortunately I cannot really help much with anything more SIMD-specific as I barely know anything about that subject. This is the first time I hear about signalling NaN concerns for SIMD specifically. For normal floats Rust has no UB there, f32::from_bits is a safe way to create any possible bit pattern as a float. Maybe SIMD has more strict rules though? I wouldn't know. But I assume there is a safe way to build a SIMD float array from individual floats, and thus with from_bits a safe way to get signalling NaNs into that, so they cannot possibly be UB? I don't even know our SIMD APIs, so... yeah I am probably not of much help here, sorry.

vorner · May 9, 2020, 6:00pm

Right, thanks for the pointers ‒ I know about all the repr(C) and alignment. I wanted to know if there's anything more to be aware of.

My intention is/was to wrap the __m128i or __m128d (or the larger ones), but allow indexing into them and deref them to [f64; 2] for convenient API. But if f64 has more valid values than an element of __m128d, then I can't do DerefMut/IndexMut for them :-(.

BurntSushi · May 9, 2020, 6:02pm

Yeah I think someone more familiar with how LLVM and its vector support works would help here. Maybe @comex? (In particular, are all bit patterns valid for types like __m128d?)

Yandros · May 9, 2020, 6:45pm

cc @Lokathor

Lokathor · May 9, 2020, 9:54pm

Hello.

There's a few questions that have come up and maybe been answered so I'll just throw out some points and feel free to follow up with anything that I don't answer:

I'd suggest to use the bytemuck crate for this sort of data casting situation. Disclosure: I wrote the crate.
All the SIMD types are transmutable to/from [foo; bar] arrays of various element types and widths. As long as the bit counts add up it's fine, so you can have [f32;4] or [u8;16] or whatever you want that adds up to 128 bits, and you can transmute it to any of the __m128 types, (float, double, or intger).
If you change your number type the results will be nonsensical but it's still legal to do. The same as you can transmute [f64;2] to [f32;4], you can cast __m128 to __m128d. The cast preserves the bits not the numbers, but if that's what you want, well that's what you can do.
You also may care to check out the safe_arch crate. It's in active development so I haven't really gave it a big announcement, but it'll add up a bunch of u8s just fine even in its current state. Or you could just have a look for some hints and then write your own thing.

As to your main question you had at the start about loading: the integer loading can be done with a normal cast from [u8; 16] of any of the set intrinsics, or you could pass a &[u8; 16] to the unaligned load intrinsic if you're using sub-chunks of a big array, or something like that.

As to the best form to think of the data in memory, I guess it depends on what you're doing with it. I'd just do a huge pile of bytes and unaligned loads and stores unless it was very critical to get that extra ounce of performance.

system · August 7, 2020, 9:54pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
SIMD Vectors and `transmute` help	3	749	January 9, 2022
SIMD Intrinsics + Cell?	12	371	July 29, 2021
SIMD: indexing an shifting	1	575	January 12, 2023
Is `MaybeUninit::uninit().assume_init()` UB when used with intrinsics?	28	476	November 5, 2023
Random segfaults using simd intrinsics	21	1212	January 12, 2023

Correct way to use the simd intrinsics

Related Topics