zero_me function given in the first post it just jumps to memset. I'm not sure what causes the difference, but that might just be a consequence of my artificial test code. Either way they're not so different that it's worth me trying to pessimistically optimise .
lea rdx, [4*rsi]
xor esi, esi
jmp qword ptr [rip + memset@GOTPCREL]
EDIT: More refined testing shows that, in this case, it all gets shrunk down to the following, no matter which method you used:
mov dword ptr [rsp + 24], 0
mov qword ptr [rsp + 16], 0
In the case of zeroing larger slices the results are similar and using
unsafe code doesn't change anything.
xorps xmm0, xmm0
movups xmmword ptr [rsp + 96], xmm0
movups xmmword ptr [rsp + 84], xmm0
movups xmmword ptr [rsp + 68], xmm0
movups xmmword ptr [rsp + 52], xmm0
movups xmmword ptr [rsp + 36], xmm0
So there really is no difference at all.