Correct way to use the simd intrinsics

Hello

I'm trying to research how to do SIMD correctly with just the bare-bones intrinsics that are already stable in std. I'm a bit unsure about how far the equivalence between the __mm128i and let's say [u16; 8] goes and how to work with these things.

Let's say I have this:

#[repr(align(16))]
struct Array([u16; 800])

And I want to sum the array together. I've already checked that sse2 is available and therefore use of the _mm_add_epi16 should be OK. But for that, I need to get my hands onto the __mm128i types. What is the difference (in what is or isn't UB, how fast they would be, etc) between:

  • Transmuting aligned chunks of the array into them.
  • Casting pointers to aligned chunks of the array and dereferencing them.
  • Using some other intrinsics to load them (I haven't found one for u16 combination and 128bit vectors, but there seems to be float ones, for example _mm_load_pd.
  • Keeping an array of [__mm128i; 100] around instead.

Is any of these methods preferred? Why? If I wanted to have a u16x8 type, is it better to keep (aligned) [u16; 8] inside or __mm128i (looking around libraries out there, I seem to be able to find both ways)? Is it safe to deref the later to [u16; 8] by simply casting the pointer type? I think the unsafe guidelines seem to suggest it is safe, but I'm not completely sure. Does that hold for the float-vectors and mask-vectors as well?

5 Likes

An option I've seen in code working with SIMD vectors is to use a union:

union u16x8 {
    a: [u16; 8],
    b: __mm128i
}

That handles the alignment requirement and allows you to easily poke at the individual values or work with the whole vector. It does require unsafe to access, but you'll probably be in unsafe-land anyway working with the intrinsics.

2 Likes

I don't know if there is a rock solid authoritative source for this, but here's a small dump from my brain:

N.B. I've exclusively used SIMD with its integer APIs for accelerating string-related algorithms. I've never used it for floating point, so that puts me in a bit of a niche, and could potentially make my experience narrow. So, I guess, just take what I say with an appropriate amount of salt. :slight_smile:

5 Likes

So, you think that the same approach with union but for __m128d could lead to UB by assigning the „wrong“ kind of NaN into one of the array elements?

Possibly. Maybe @RalfJung or @gnzlbg could provide more guidance?

Please always make your unions repr(C) if you rely on the offset of all fields being 0! Like for structs, we make no layout guarantees for repr(Rust) unions. (But I see @BurntSushi already said that. Still this probably cannot be repeated often enough given how frequent this mistake is made.)

I would say it is safe to assume that you can transmute the SIMD types with integer/float arrays of the right size -- though please keep in mind alignment! &mut [u8; 16] and &mut __m128i are not mutually transmutable. By-value transmutes are fine though.

However, unfortunately I cannot really help much with anything more SIMD-specific as I barely know anything about that subject. This is the first time I hear about signalling NaN concerns for SIMD specifically. For normal floats Rust has no UB there, f32::from_bits is a safe way to create any possible bit pattern as a float. Maybe SIMD has more strict rules though? I wouldn't know. But I assume there is a safe way to build a SIMD float array from individual floats, and thus with from_bits a safe way to get signalling NaNs into that, so they cannot possibly be UB? I don't even know our SIMD APIs, so... yeah I am probably not of much help here, sorry.

1 Like

Right, thanks for the pointers ‒ I know about all the repr(C) and alignment. I wanted to know if there's anything more to be aware of.

My intention is/was to wrap the __m128i or __m128d (or the larger ones), but allow indexing into them and deref them to [f64; 2] for convenient API. But if f64 has more valid values than an element of __m128d, then I can't do DerefMut/IndexMut for them :-(.

Yeah I think someone more familiar with how LLVM and its vector support works would help here. Maybe @comex? (In particular, are all bit patterns valid for types like __m128d?)

cc @Lokathor

Hello.

There's a few questions that have come up and maybe been answered so I'll just throw out some points and feel free to follow up with anything that I don't answer:

  • I'd suggest to use the bytemuck crate for this sort of data casting situation. Disclosure: I wrote the crate.
  • All the SIMD types are transmutable to/from [foo; bar] arrays of various element types and widths. As long as the bit counts add up it's fine, so you can have [f32;4] or [u8;16] or whatever you want that adds up to 128 bits, and you can transmute it to any of the __m128 types, (float, double, or intger).
  • If you change your number type the results will be nonsensical but it's still legal to do. The same as you can transmute [f64;2] to [f32;4], you can cast __m128 to __m128d. The cast preserves the bits not the numbers, but if that's what you want, well that's what you can do.
  • You also may care to check out the safe_arch crate. It's in active development so I haven't really gave it a big announcement, but it'll add up a bunch of u8s just fine even in its current state. Or you could just have a look for some hints and then write your own thing.

As to your main question you had at the start about loading: the integer loading can be done with a normal cast from [u8; 16] of any of the set intrinsics, or you could pass a &[u8; 16] to the unaligned load intrinsic if you're using sub-chunks of a big array, or something like that.

As to the best form to think of the data in memory, I guess it depends on what you're doing with it. I'd just do a huge pile of bytes and unaligned loads and stores unless it was very critical to get that extra ounce of performance.

3 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.