Equivalent of `shld`?

Does Rust have an equivalent to the shld assembly instruction?

1 Like

I may be wrong, but from the Shl trait implementors doc it looks like that neither f32 or f64 implement Shl.

Missing feature? To be honest, I never needed this asm feature. In which field is useful? Vectorization? Crypto? Just for curiosity.

Ah sorry, shld is weirdly named and isn't to do with floats. It means the same as shift left but instead of filling with zeros it uses the bits from another source.

The equivalent for u32 would be something like:

const BITS: u32 = 32;
fn shld(dest: u32, src: u32, cnt: u32) -> u32 {
    (dest << cnt) | (src >> (BITS-cnt))
}

It can be useful for shifting the bits of multiple integers as though they were one.

You could try https://rust.godbolt.org/ and see if there's a pattern of shifts, masks and casts that lets LLVM reduce it to the shld instruction.

Ah it seems it's very picky. If I make cnt a const it works. So:

const BITS: usize = 32;
const CNT: usize = 3;
pub fn shld(dest: u32, src: u32) -> u32 {
    (dest << CNT) | (src >> (BITS - CNT))
}

Gives

example::shld:
  shld edi, esi, 3
  mov eax, edi
  ret

But if I use a variable then it doesn't work.

Ok, I completely misunderstood what I read -- "double precision left shift" made me think to floats.

That's interesting: I tried the following in C++ to see what we get:

#include <cstdint>

std::uint32_t
shld(std::uint32_t dest, std::uint32_t src, std::uint8_t cnt) {
    return (dest << cnt) | (src >> (32 - cnt));
}

Clang is able to optimize the code, resulting in exactly what you are expecting. GCC cannot. Rust performs better than GCC, but worse that Clang. Considering that both Clang and rustc are based on LLVM, we are probably using some optimization opportunities here.

If someone is interested and thinks that something can (and should) be done, here the LLVM IR from both clang and rustc:

clang

define i32 @shld(i32, i32, i8 zeroext) local_unnamed_addr #0 !dbg !7 {
  call void @llvm.dbg.value(metadata i32 %0, metadata !19, metadata !DIExpression()), !dbg !22
  call void @llvm.dbg.value(metadata i32 %1, metadata !20, metadata !DIExpression()), !dbg !23
  call void @llvm.dbg.value(metadata i8 %2, metadata !21, metadata !DIExpression()), !dbg !24
  %4 = zext i8 %2 to i32, !dbg !25
  %5 = shl i32 %0, %4, !dbg !26
  %6 = sub nsw i32 32, %4, !dbg !27
  %7 = lshr i32 %1, %6, !dbg !28
  %8 = or i32 %7, %5, !dbg !29
  ret i32 %8, !dbg !30

rustc

define i32 @_ZN7example4shld17hc11e3013da8f49d3E(i32, i32, i8) unnamed_addr #0 !dbg !4 {
  call void @llvm.dbg.value(metadata i32 %0, metadata !11, metadata !DIExpression()), !dbg !14
  call void @llvm.dbg.value(metadata i32 %1, metadata !12, metadata !DIExpression()), !dbg !14
  call void @llvm.dbg.value(metadata i8 %2, metadata !13, metadata !DIExpression()), !dbg !14
  %3 = and i8 %2, 31, !dbg !15
  %4 = zext i8 %3 to i32, !dbg !15
  %5 = shl i32 %0, %4, !dbg !15
  %6 = sub i8 0, %2, !dbg !16
  %7 = and i8 %6, 31, !dbg !17
  %8 = zext i8 %7 to i32, !dbg !17
  %9 = lshr i32 %1, %8, !dbg !17
  %10 = or i32 %5, %9, !dbg !15
  ret i32 %10, !dbg !18
}

At a first look, there are these "and"s that (maybe?) could be omitted.

3 Likes

https://stackoverflow.com/questions/39281925/making-g-use-shld-shrd-instructions

3 Likes

Good point @vitalyd!
Indeed, using -C target-cpu=haswell, rustc produces the following asm

shlx    ecx, edi, edx
neg     dl
shrx    eax, esi, edx
or      eax, ecx
ret

which looks a bit better than GCC, which produces

movzx   edx, dl
mov     eax, 32
sub     eax, edx
shrx    eax, esi, eax
shlx    esi, edi, edx
or      eax, esi
ret

But you already made the point: benchmark! :wink:

However, I am still curious about the difference in the IR between clang and rustc. Do you have any idea?

I'm not sure why rustc emits the and in the LLVM IR. If I'm looking at this level of detail, I typically don't care about the IR that's emitted - only the resulting asm. My understanding is rustc doesn't do much IR level optimization, and mostly leans on the LLVM backend to optimize unneeded things away.

And yeah, as the SO discussion mentions, the better (=faster) instruction sequence appears to be arch-dependent.

1 Like