Equivalent of `shld`?

chrisd · September 18, 2018, 10:09am

Does Rust have an equivalent to the shld assembly instruction?

dodomorandi · September 18, 2018, 10:27am

I may be wrong, but from the Shl trait implementors doc it looks like that neither f32 or f64 implement Shl.

Missing feature? To be honest, I never needed this asm feature. In which field is useful? Vectorization? Crypto? Just for curiosity.

chrisd · September 18, 2018, 10:59am

Ah sorry, shld is weirdly named and isn't to do with floats. It means the same as shift left but instead of filling with zeros it uses the bits from another source.

The equivalent for u32 would be something like:

const BITS: u32 = 32;
fn shld(dest: u32, src: u32, cnt: u32) -> u32 {
    (dest << cnt) | (src >> (BITS-cnt))
}

It can be useful for shifting the bits of multiple integers as though they were one.

kornel · September 18, 2018, 11:09am

You could try https://rust.godbolt.org/ and see if there's a pattern of shifts, masks and casts that lets LLVM reduce it to the shld instruction.

chrisd · September 18, 2018, 11:29am

Ah it seems it's very picky. If I make cnt a const it works. So:

const BITS: usize = 32;
const CNT: usize = 3;
pub fn shld(dest: u32, src: u32) -> u32 {
    (dest << CNT) | (src >> (BITS - CNT))
}

Gives

example::shld:
  shld edi, esi, 3
  mov eax, edi
  ret

But if I use a variable then it doesn't work.

dodomorandi · September 18, 2018, 11:55am

Ok, I completely misunderstood what I read -- "double precision left shift" made me think to floats.

That's interesting: I tried the following in C++ to see what we get:

#include <cstdint>

std::uint32_t
shld(std::uint32_t dest, std::uint32_t src, std::uint8_t cnt) {
    return (dest << cnt) | (src >> (32 - cnt));
}

Clang is able to optimize the code, resulting in exactly what you are expecting. GCC cannot. Rust performs better than GCC, but worse that Clang. Considering that both Clang and rustc are based on LLVM, we are probably using some optimization opportunities here.

If someone is interested and thinks that something can (and should) be done, here the LLVM IR from both clang and rustc:

clang

define i32 @shld(i32, i32, i8 zeroext) local_unnamed_addr #0 !dbg !7 {
  call void @llvm.dbg.value(metadata i32 %0, metadata !19, metadata !DIExpression()), !dbg !22
  call void @llvm.dbg.value(metadata i32 %1, metadata !20, metadata !DIExpression()), !dbg !23
  call void @llvm.dbg.value(metadata i8 %2, metadata !21, metadata !DIExpression()), !dbg !24
  %4 = zext i8 %2 to i32, !dbg !25
  %5 = shl i32 %0, %4, !dbg !26
  %6 = sub nsw i32 32, %4, !dbg !27
  %7 = lshr i32 %1, %6, !dbg !28
  %8 = or i32 %7, %5, !dbg !29
  ret i32 %8, !dbg !30

rustc

define i32 @_ZN7example4shld17hc11e3013da8f49d3E(i32, i32, i8) unnamed_addr #0 !dbg !4 {
  call void @llvm.dbg.value(metadata i32 %0, metadata !11, metadata !DIExpression()), !dbg !14
  call void @llvm.dbg.value(metadata i32 %1, metadata !12, metadata !DIExpression()), !dbg !14
  call void @llvm.dbg.value(metadata i8 %2, metadata !13, metadata !DIExpression()), !dbg !14
  %3 = and i8 %2, 31, !dbg !15
  %4 = zext i8 %3 to i32, !dbg !15
  %5 = shl i32 %0, %4, !dbg !15
  %6 = sub i8 0, %2, !dbg !16
  %7 = and i8 %6, 31, !dbg !17
  %8 = zext i8 %7 to i32, !dbg !17
  %9 = lshr i32 %1, %8, !dbg !17
  %10 = or i32 %5, %9, !dbg !15
  ret i32 %10, !dbg !18
}

At a first look, there are these "and"s that (maybe?) could be omitted.

vitalyd · September 18, 2018, 12:08pm

https://stackoverflow.com/questions/39281925/making-g-use-shld-shrd-instructions

dodomorandi · September 18, 2018, 12:17pm

Good point @vitalyd!
Indeed, using -C target-cpu=haswell, rustc produces the following asm

shlx    ecx, edi, edx
neg     dl
shrx    eax, esi, edx
or      eax, ecx
ret

which looks a bit better than GCC, which produces

movzx   edx, dl
mov     eax, 32
sub     eax, edx
shrx    eax, esi, eax
shlx    esi, edi, edx
or      eax, esi
ret

But you already made the point: benchmark!

However, I am still curious about the difference in the IR between clang and rustc. Do you have any idea?

vitalyd · September 18, 2018, 12:33pm

I'm not sure why rustc emits the and in the LLVM IR. If I'm looking at this level of detail, I typically don't care about the IR that's emitted - only the resulting asm. My understanding is rustc doesn't do much IR level optimization, and mostly leans on the LLVM backend to optimize unneeded things away.

And yeah, as the SO discussion mentions, the better (=faster) instruction sequence appears to be arch-dependent.

Topic		Replies	Views
Why do the non-panicking variants of left bit-shift work the way they do? help	12	2500	July 1, 2023
Apparent arithmetic bug help	22	1210	July 11, 2020
High performance operations	7	4068	January 12, 2023
Intentionally overflow on shift? help	10	11896	January 9, 2020
Bitshift and overflowing behavior help	3	2886	January 12, 2023

Equivalent of `shld`?

clang

rustc

Related topics