So, Rust's vmull_high_p64 is supposed to take two 128bit registers in ARMv8 Neon and calculate the carryless/polynomial multiplication of the upper 64bits of the registers by using the vmull_high_p64 intrinsic which in turns uses PMULL2 (Documentation – Arm Developer) instruction.
I have a couple of questions about this
Aside from the name, neither the Rust documentation nor Arm's Intrinsics documentation explicitly specifies that vmull_high_p64 uses the upper parts, you really have to look at the architecture documentation for PMULL/PMULL2. Should this information be part of Rust documentation? Or is there a reason to have a minimalistic description and relying on the ARM documentation?
I decided to check the code for vmull_high_p64 and it seems that first the higher 64bits are extracted, put in the lower 64bits of other registers, and then vmull_p64 is used to perform the multiplication on the lower 64bits. I know that intrinsic types like poly128_t and __m128i are not the real CPU types, but rather map to LLVM 64 and 128 bit types which then takes care of translating them. Does LLVM take care of compiling to the correct vmull_high_p64 internally?
I cannot find the LLVM documentation for vmull_high_p64 and even Intel's 'pclmulqdq'. Where can I find it? From the speed of the code I have it seems that the intrinsics are correctly compiles for Intel, so are they just undocumented?
Finally, if Rust doesn't really expose intrinsics because it relies on LLVM why redefine all the intrinsics again instead of defining types and methods that more closely resemble the LLVM code that these "fake" intrinsics are translated to? Because right now I am finding myself of doing the job of defining the general behaviour with traits and different implementations for the architectures. Why is it needed to do this abstraction twice when LLVM has done it already?
All the arch stuff it's expected that you look at the vendor docs. There's way too many of them for use to bother writing stuff up about them, especially since they're not how we would have arranged/named/structured them in Rust.
Ok, so essentially the "natural" way is still nightly.
But referring to the ARM documentation would be ok if the intrinsic would translate to the actual instruction, but it doesn't. That is why the main question is still what does vmull_high_p64 ultimately maps to?
If vmull_high_p64 maps to vmull_p64 then vmull_high_p64 should not exist in the aarch64 crate, because putting it there would be misleading by not compiling to what it claims to be. If I didn't get it wrong, in C++ intrinsics are directly translated by the compiler, but that's impossible in Rust because it will always compile to LLVM first, correct?
So it feels like it does not even make sense to have intrinsics in Rust unless they are directly provided by LLVM. Intrinsics are supposed to directly translate to CPU instruction, so if you have a chain of compilers, as soon as one of them does not expose an intrinsic, how does it make sense for any compiler after that to "reconstruct/simulate" it? Are these in Rust meant to be a temporary solution until the simd crate is stable?
Clang also compiles C++ to LLVM IR first, same as Rust; in general, anything that is true of Rust's compilation backends is true of C++ as compiled by Clang. In both Rust and C++ as compiled by Clang, intrinsics are translated into LLVM intrinsics for the relevant CPU instructions, and LLVM guarantees to output certain machine instructions for certain intrinsics.
For example, for both C++ and Rust, a vmull-p64 family intrinsic will cause them to output the llvm.aarch64.neon.vmull.p64 LLVM intrinsic. This is then defined as outputting the pmull AArch64 instruction.
Well, could be LLVM, could be Rust - for intrinsics, comparing what happens with the intrinsic as used in Clang and as used in Rust is needed to determine where the issue lies.
It's a little bit above my level at the moment but if I interpret the hint correctly it seems that the LLVM compiler will pattern match the normal combination of vmull_p64 and extraction into the vmull_high_p64 intrinsic, effectively reconverting back correctly.
The bug itself seems to be about this pattern match not working in all cases at that time.
Is this correct? I still find it weird that they decided to do it this way (is there a general design principle trying to keep the number of LLVM function names to a minimum?).