Why do the non-panicking variants of left bit-shift work the way they do?

So I’ve just been directed to the non-panicking versions of << (shl), overflowing_shl and wrapping_shl. Now when reading their descriptions I was quite confused; this was not what I expected them to do. In fact, the description was so confusing to me that I thought it was documentation error.

Now it seems to me that there’s two sensible ways you could possibly want left bit-shift to work.

  1. Shift the bits, dropping any bits that overflow beyond the capacity of the relevant integer type. A return field or panic that indicates if this happened would be useful. This seems equivalent to doing the mathematical shift (·2ⁿ), and casting the result to the relevant integer type, at least for unsigned integers.
  2. Rotate the bits. This can never fail.

Now, what the overflowing_shl and wrapping_shl do, and, in fact, what << does, don’t fit any of these! And I don’t understand why. Is this aspect of Rust designed to conform to what hardware currently does, or is there a different reason you’d like the current behaviours?

2 Likes

Why would you expect a shift left to rotate the bits? That is a rotate left.

1 Like

As far as I'm aware, that should be the reason, yes.

I agree all three functions are confusing and a pain to use.

Yes, that is the historical reason.

But actually I think it's a design mistake, for the following reasons:

What the hardware does actually depends on which hardware instructions you use. On x86, the shld instructions indeed has this weird behavior. But the SSE2 shift instruction (pslld) has the sane behavior of returning 0 if you shift all bits out. So, this design choice ends up fighting against the hardware if those vector instructions could otherwise be used.

The weird hardware instructions are only really directly useful if you know the shift size is smaller than the number of bits (because you never really want the crazy behavior for larger amounts). But in this case, if you know, then most of the time the compiler also knows. So it would be able to use the crazy hardware instruction anyway.

When you don't know whether the shift size is small enough, this only forces you to deal with the special case separately. So again, this design doesn't help your performance and only makes code harder to write and more error prone.

This only helps performance in the rare case when you have a variable shift amount that you know is smaller than 32 or 64, but the compiler doesn't know that. It allows the compiler to use the weird machine instruction without checking. But I think that's a pretty small gain. The extra check would only be one or two extra instructions and can be done branchlessly.

5 Likes

The odd part about wrapping_shl is that the number being left shifted isn't the one wrapping (that's a rotate left); it's the shift amount which is wrapped into the domain of 0..N (for the shifted uN).

Given a time machine, I agree with @tczajka; shifts ideally should've been defined to shift all bytes out of the value leaving a zero behind for large shift values, essentially truncating the abstract bitstring result back to the representable number of bits.

(Someone should submit an API Change Proposal for .truncating_shl(_) and friends, I think. I have no good names to offer for a truncating version with an overflow flag.)

In theory, we could add the truncating behavior as a valid behavior for the operator shift over an edition, since this is a language choice rather than a library choice. In practice, though, it's probably rather difficult to do so; you'd need to somehow tell the optimizing backend that either behavior (but not UB) is acceptable, and impressions in the standard library need to be valid on all editions. (Theoretically the standard library shouldn't have any operator shifts which overflow, but are we willing to bet on that moreso than we already do?)

truncating_shl isn't that great of a name because x << 1 also already truncates the top bit. any_shl? unbounded_shl?

I don't understand this. Presumably the operator would still be deterministic with this change, it would just be different from what it is now so it would compile to different LLVM IR.

This would cause a panic in debug mode, so it would be a bug, wouldn't it?

I suppose that is correct; I was more focused on the out of range behavior. I suppose you could make a weak argument for saturating_shl, but that name's certainly not much better than wrapping_shl. In fact; it's relying on the same sleight of hand, in that the shift amount is "saturating" to a full shift out, rather than talking about a quality of the shift itself.

Perhaps exhausting_shl, since the bits are "exhausted" when you shift by N or more? Or flushing_shl. I don't dislike unbounded_shl either. any_shl feels specifically like what the operators do, where you get a compiletime switch between behaviors (-Coverflow-checks).

But on the other hand, truncating_shl isn't wrong, either. It is consistently truncating the resultant shifted bitstring, no matter how much it's been shifted by. On the other other hand, that wrapping_shl and overflowing_shl are talking about the wrapping/overflowing of the shift amount makes a _ing_shl talking about the operation rather than the amount questionable, even though that makes it match the other ops more closely.

Tongue-in-cheek suggestion: euler_shl, for no reason other than in analogy to euler_div being the "mathematically pure" choice of integer division semantics. I don't think Euler actually defined anything related to shifting fixed-width binary numbers. (Though with how much Euler did, it's at least possible.)

The counter reason I presumed it might not be is that release mode semantics for operators was IMHO intended to be a "what the hardware makes fast" choice, with the overflow condition occurring always being a bug. (If it was "reasonable default", overflowing shifts would flush to zero.) The problem is that there isn't just one answer for the "fast" shift, as you've noted, since the normal shift instruction truncates/wraps the shift amount first, but the SSE shift flushes to zero.

Aside

To that point, I suspect that if Rust ever gets a target for a machine which saturates by default instead of wrapping, while it probably won't become the default on that target, there's a somewhat low but possible chance that saturating will become a second option for the overflow-checks=off behavior of operators, alongside today's choice of wrapping. Same goes for if we get a target where overflow traps, giving us an abort option for overflow-checks. The likelihood of such a target becoming popular enough to motivate extensions to core semantics is on its own rather low, thanks to the mathematical conveniences of wrapping arithmetic on two's compliment integers, but the growing support of hardware-level sanitizing machines like CHERI make it a decidedly nonzero chance.

The choice does seem to matter for nonvectorized shifts, but essentially not matter for vectorized shifts in current LLVM [godbolt].

Worked example

Currently, for

pub fn wrapping_shl(lhs: u32, rhs: u32) -> u32 {
    lhs.checked_shl(rhs % 32).unwrap()
}

pub fn saturating_shl(lhs: u32, rhs: u32) -> u32 {
    lhs.checked_shl(rhs).unwrap_or(0)
}

LLVM uses shl for both when targeting x86-64-v2, shlx for x86-64-v3. This results in saturating_shl being more machine code, requiring the cmp and cmove. This suggests that if overflow-checks=off means overflow should choose "what the hardware makes fast," at least for LLVM emitting x86_64 it does seem like the wrapping version is better.

My attempt at giving autovectorization the best chance

use std::simd::*;

pub fn wrapping_shl_x4(lhs: u32x4, rhs: u32x4) -> u32x4 {
    let lhs = lhs.to_array();
    let rhs = rhs.to_array();
    let out = lhs.zip(rhs).map(|(lhs, rhs)| wrapping_shl(lhs, rhs));
    u32x4::from_array(out)
}

pub fn saturating_shl_x4(lhs: u32x4, rhs: u32x4) -> u32x4 {
    let lhs = lhs.to_array();
    let rhs = rhs.to_array();
    let out = lhs.zip(rhs).map(|(lhs, rhs)| saturating_shl(lhs, rhs));
    u32x4::from_array(out)
}

pub fn vector_shl_x4(lhs: u32x4, rhs: u32x4) -> u32x4 {
    lhs << rhs
}

results for both in the emitted code reading and shifting each element individually rather than using a vectorized operation. Only the version using std::simd results in a vectorized shift in LLVM, and AIUI that's because it bottoms out in an intrinsic stating to do so (and, currently, wrapping the shift amount).

LLVM's IR only supports shifts where the shift amount is less than the bitwidth of the shifted type; wider shifts create poison (deferred UB). This means that backend instselect is required to do the detection of code to wrap or flush with a wrap; it can't be done during the general optimization passes. However, in theory at least this shouldn't need to prevent autovectorization, as the vectorized version just directly does the non-vectorized operations on vectorized types.

Also, FWIW, by my attempt, both vectorized wrapping and flushing shifts seem to generate essentially identical assembly, with the wrapping version shorter by a single instruction. This does seem to indicate that it's not the semantics of shifts getting in the way of vectorization at all, but rather just limitations of the autovectorization pass not handling shifts. It may also be an opportunity for x86-64-v3 instselect to improve a bit, to take advantage of vpsllvd's overlong shift semantics the way that the nonvectorized version takes advantage of shlx's semantics to elide the source-level mask of the shift amount.


Yes, with some caveats.

std is distributed in release mode without overflow checks for most of the stdlib, so if it's internal most people wouldn't see the panic. Additionally, std uses #[rustc_inherit_overflow_checks] a lot to not do that and get the compilation's overflow checks setting for the standard arithmetic operators; I'm not confident to say whether or not any stdlib functionality is doing so for shifts.

It's these caveats that make up the "are we willing to bet on that." std shouldn't be relying on the behavior of overlong shifts, but it does carry a risk that changing the behavior of the shift operator over editions could result in std visibly behaving differently based on the caller's edition. On the other hand, if we can securely ensure that std's code always gets the current edition behavior, even if it's inheriting overflow checks, then that's sufficient to mix editions like with any other edition change.

I'm not saying that it would cause problems, just that there's more potential for it to cause problems than other language-only edition changes.

I see, I misread what you said about the new edition.

I think if you're going to change the semantics in a new edition, it would be a lot more useful to say that it's always 0 when you shift by 32 (for all the reasons I stated in my first post).

If there is a need for the "do what your particular hardware does fastest with incorrect results for large shifts" version, I think it would be better suited as a specialized function for advanced use, such as std::intrinsics::hardware_shl, rather than as the default << operator.

It's a stretch to call the case of shifting by 32 or more an "overflow", it is a surprising artificial limitation (as OP noted). Normally an "overflow" is a situation where an arithmetic operation result doesn't fit in a type. But that's not what this is. u32::MAX >> 32 definitely fits. And it's not clear why u32::MAX << 32 is more of an overflow than u32::MAX << 31. Both fit (or both don't fit) just as much, so it's strange to call one of them and not the other an "overflow". Both cases are useful, for example:

fn low_bits(n: u32) -> u32 {
    // watch out, it doesn't work for n==0 !
    u32::MAX >> (32 - n)
}

It's an overflow because the shift amount, despite being typed as u32 in Rust, actually has the domain 0..N. That's the value which is overflowing/wrapping down to the smaller domain; u8 >> u3, u16 >> u4, u32 >> u5, u64 >> u6, and u128 >> u7.

I'm only restating this for clarity. I fully agree that in every other case of overflowing_op, the value which is overflowing is the result of the computation overflowing the same-width output range; with wrapping_op, the result of the computation wrapping back into the same-width output range. The shifts are the odd ones out here.

Ultimately, this is LLVM's semantics leaking out of Rust's abstractions. LLVM says it's UB for the shift rhs for iN to be ≥N, thus the shift rhs got exposed into Rust as wrapping/overflowing from its natural domain down to the operation domain.

1 Like

I feel like a core point is that left bit-shift, discarding anything that goes over the boundary, is perfectly well-defined for every positive (and even negative) integer value. There’s no real reason the default behaviour shouldn’t just be shifting everything out when the number gets too large. I also believe a core misunderstanding of my original posting was that I didn’t understand that shifting out some bits is often a core component of what you want a bit-shift to do. That’s why I suggested the (to some ridiculous) option of just rotating.

There is. Barrel shifter naturally behaves like Rust makes shifts behave.

Which means many CPU designs did shifts like that.

Of course it's not too hard to add a tiny bit of logic to make “too large” shift behave in a different fashion, and some architectures did that… in a limited fashion (ARM is prime example: shift by 128 would zero everything but shift by 256 would keep everything intact).

What Rust does is just simply the best compromise between what hardware makes available and what people can deal with.

1 Like

TBH, given a time machine, I think we'd define shift-left-by-BITS-or-more to just be zero. That's most consistent with the general "the result is equivalent to doing it in infinite precision, then truncating" that we follow in most places. And it's still quite cheap in hardware to do that check, if it's not the IS behaviour. After all, even today we emit << as bitwise-and then shift to LLVM.

Rust is usually pretty good about the default one being the one that does things properly, with the "faster but weirder" ones being methods with longer names. It ought to be that for shifts too. But it's too late to change now.


Though one day I hope we get the fully general Integer<MIN, MAX> type that can move all the overflow problems to the at-the-end truncation rather than needing to worry about them on every intermediate value.

For example, if we had that then (x + y) / 2 would just work on those types, and we'd not need things like Add midpoint function for all integers and floating numbers by Urgau · Pull Request #92048 · rust-lang/rust · GitHub to handle the edge cases.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.