How to avoid double-parsing when checking for an item in a `vec`?

Hi,
I'm working on something that involves going through a vec, checking for something (let's call it Foo), and popping that foo out of the vec. Right now, this looks like:

if let Some(Really::Long(Thing::Foo(_))) = l.last() {
   if let Some(Really::Long(Thing::Foo(owned_foo))) = l.pop() {
      return owned_foo
   } else {
      unreachable!()
   }
}

Because I only want to pop if l.last() is a Foo, I have to double-check everything, which is getting really annoying. I hope I'm missing something really obvious here. Any help would be very much appreciated.

1 Like

Why don't you always pop it and .push() it back if it's not what you want?

match l.pop() {
    None => {}
    Some(Really::Long(Thing::Foo(foo))) => return foo,
    Some(other) => l.push(other),
}
5 Likes

That's admittedly probably the best solution here, but wouldn't that result in a lot of unnecessary cloning? These are pretty deeply-nested structs and I want to avoid moves as much as possible.

Moves don't ever clone, they are a simple memcpy at worst. So long as mem::size_of::<T>() isn't very large, moving T won't cause any meaningful slowdown.

8 Likes

Huh, I didn't know that. Thanks!

I'm surprised - that doesn't optimise as nicely as I thought it would.

Given these types:

pub struct Foo {
    first: u16,
    big_field: [u8; 512],
}

pub enum Really {
    Long(Thing),
    Short(String),
}

pub enum Thing {
    Foo(Foo),
    Bar(Vec<u8>),
}

and this function:

pub fn take_last_foo(items: &mut Vec<Really>) -> Option<Foo> {
    match items.pop() {
        None => None,
        Some(Really::Long(Thing::Foo(foo))) => Some(foo),
        Some(other) => {
            items.push(other);
            None
        }
    }
}

We can ask the playground for the generated assembly when using a nightly compiler in release mode.

playground::take_last_foo: # @playground::take_last_foo
	push	rbp
	push	r15
	push	r14
	push	r13
	push	r12
	push	rbx
	sub	rsp, 520
	mov	r14, rdi
	mov	rax, qword ptr [rsi + 16]
	test	rax, rax
	je	.LBB3_10
	mov	r12, rsi
	lea	r15, [rax - 1]
	mov	qword ptr [rsi + 16], r15
	mov	rbx, qword ptr [rsi]
	mov	rcx, r15
	shl	rcx, 9
	lea	rdx, [rcx + 8*rax]
	add	rdx, -8
	movzx	ecx, word ptr [rbx + rdx]
	lea	rax, [rbx + rdx + 2]
	mov	rbp, qword ptr [rbx + rdx + 8]
	mov	r13, qword ptr [rbx + rdx + 16]
	lea	rsi, [rbx + rdx]
	add	rsi, 24
	test	ecx, ecx
	je	.LBB3_5
	cmp	ecx, 3
	je	.LBB3_10
	mov	word ptr [rsp], cx
	mov	ecx, dword ptr [rax]
	mov	dword ptr [rsp + 2], ecx
	movzx	eax, word ptr [rax + 4]
	mov	word ptr [rsp + 6], ax
	mov	qword ptr [rsp + 8], rbp
	mov	qword ptr [rsp + 16], r13
	lea	rdi, [rsp + 24]
	mov	edx, 496
	call	qword ptr [rip + memcpy@GOTPCREL]
	cmp	r15, qword ptr [r12 + 8]
	jne	.LBB3_9
	mov	rdi, r12
	mov	rsi, r15
	call	alloc::raw_vec::RawVec<T,A>::reserve_for_push
	mov	rbx, qword ptr [r12]
	mov	r15, qword ptr [r12 + 16]

.LBB3_9:
	mov	rax, r15
	shl	rax, 9
	lea	rdi, [rax + 8*r15]
	add	rdi, rbx
	mov	rsi, rsp
	mov	edx, 520
	call	qword ptr [rip + memcpy@GOTPCREL]
	inc	r15
	mov	qword ptr [r12 + 16], r15

.LBB3_10:
	xor	eax, eax
	jmp	.LBB3_11

.LBB3_5:
	movzx	ecx, word ptr [rax + 4]
	mov	word ptr [r14 + 6], cx
	mov	eax, dword ptr [rax]
	mov	dword ptr [r14 + 2], eax
	lea	rdi, [r14 + 24]
	mov	edx, 492
	call	qword ptr [rip + memcpy@GOTPCREL]
	mov	qword ptr [r14 + 8], rbp
	mov	qword ptr [r14 + 16], r13
	mov	ax, 1

.LBB3_11:
	mov	word ptr [r14], ax
	mov	rax, r14
	add	rsp, 520
	pop	rbx
	pop	r12
	pop	r13
	pop	r14
	pop	r15
	pop	rbp
	ret
	mov	rbx, rax
	mov	rdi, rsp
	call	core::ptr::drop_in_place<playground::Really>
	mov	rdi, rbx
	call	_Unwind_Resume@PLT
	ud2

In particular, there's a call to alloc::raw_vec::RawVec<T,A>::reserve_for_push() that I'd expect to optimise out because we're guaranteed to have enough capacity for the push(). The capacity field never changed and was valid to begin with, but I'm guessing LLVM can't see that because it doesn't know what capacity was before take_last_foo() was called.

There's also a memcpy() of 496 bytes onto the stack which I don't understand and another memcpy() of 492 bytes to the return location (r14, which contains the value from rdi) when returning None, but I'd assume you can skip that second big copy because the bytes are all uninit.

3 Likes

unreachable! is probably about as good as you can do, here.

You'll see the compiler do the same in various places:

It's almost certain to optimize out, and it's safe.

There are probably evil ways to do this with unsafe, but it's probably not worth it 99.99% of the time.

1 Like

So I did some fiddling around on godbolt and found a few interesting things.

When I did a version of the function based on the redundant check:

pub fn redundant_check(items: &mut Vec<Really>) -> Option<Foo>{
    if let Some(Really::Long(Thing::Foo(foo))) = items.last(){
        if let Some(Really::Long(Thing::Foo(foo))) = items.pop(){
            Some(foo)
        } else{
            unreachable!()
        }
    } else {
        None
    }
}

I got this:

example::redundant_check:
        push    rbx
        sub     rsp, 528
        mov     rbx, rdi
        mov     rcx, qword ptr [rsi + 16]
        sub     rcx, 1
        jb      .LBB5_2
        mov     rax, qword ptr [rsi]
        mov     rdx, rcx
        shl     rdx, 9
        lea     rdx, [rdx + 8*rcx]
        cmp     word ptr [rax + rdx], 0
        je      .LBB5_4
.LBB5_2:
        mov     word ptr [rbx], 0
.LBB5_3:
        mov     rax, rbx
        add     rsp, 528
        pop     rbx
        ret
.LBB5_4:
        add     rax, rdx
        mov     qword ptr [rsi + 16], rcx
        lea     rdi, [rsp + 8]
        mov     edx, 520
        mov     rsi, rax
        call    qword ptr [rip + memcpy@GOTPCREL]
        cmp     word ptr [rsp + 8], 0
        jne     .LBB5_5
        lea     rsi, [rsp + 10]
        lea     rdi, [rbx + 2]
        mov     edx, 514
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     word ptr [rbx], 1
        jmp     .LBB5_3
.LBB5_5:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 40
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        jmp     .LBB5_7
        mov     rbx, rax
        lea     rdi, [rsp + 8]
        call    core::ptr::drop_in_place<core::option::Option<example::Really>>
        mov     rdi, rbx
        call    _Unwind_Resume@PLT
.LBB5_7:
        ud2

There's still a double memcpy there, which is less than ideal. However, if I use a version based on unreachable_unchecked:

pub fn redundant_check_unreachable(items: &mut Vec<Really>) -> Option<Foo>{
    if let Some(Really::Long(Thing::Foo(foo))) = items.last(){
        if let Some(Really::Long(Thing::Foo(foo))) = items.pop(){
            Some(foo)
        } else{
            unsafe{ std::hint::unreachable_unchecked() }
        }
    } else {
        None
    }
}

It optimizes to this:

example::redundant_check_unreachable:
        push    rbx
        mov     rbx, rdi
        mov     rcx, qword ptr [rsi + 16]
        sub     rcx, 1
        jb      .LBB6_1
        mov     rax, rsi
        mov     rsi, qword ptr [rsi]
        mov     rdx, rcx
        shl     rdx, 9
        lea     rdx, [rdx + 8*rcx]
        cmp     word ptr [rsi + rdx], 0
        je      .LBB6_4
.LBB6_1:
        xor     eax, eax
        jmp     .LBB6_5
.LBB6_4:
        add     rsi, rdx
        mov     qword ptr [rax + 16], rcx
        add     rsi, 2
        lea     rdi, [rbx + 2]
        mov     edx, 514
        call    qword ptr [rip + memcpy@GOTPCREL]
        mov     ax, 1
.LBB6_5:
        mov     word ptr [rbx], ax
        mov     rax, rbx
        pop     rbx
        ret

Which is much better, and only has one memcpy. In the version with pop+push, LLVM doesn't seem to be able to see through Vec::push or Vec::pop very well, and in the version with the redundant check, LLVM doesn't seem to realize the redundant check is in fact redundant and winds up effectively popping twice because of it. It might be worth a bug report, especially since the redundant check version in particular seems to be LLVM shooting itself in the foot.

3 Likes

It's hard for LLVM to know that len <= capacity.

I experimented by adding assert!(items.len() <= items.capacity()) at the beginning, but surprisingly, it doesn't help. Adding assert!(items.len() < items.capacity()); just before push does help.

I reduced the issue to this:

pub fn foo(a: u32, b: u32) -> u32 {
    assert!(a <= b);
    assert!(a != 0);
    if a - 1 < b { 1234 } else { 5678 }
}

The optimizer isn't able to figure out that a - 1 < b is always true, which is surprising.

However, if I replace assert macros by if statements, it is able to figure it out:

pub fn foo(a: u32, b: u32) -> u32 {
    if a > b { 0 }
    else if a == 0 { 0 }
    else if a - 1 < b { 1234 }
    else { 5678 }
}
1 Like

I have filed an LLVM ticket about missed optimization.

5 Likes

Part of the problem here is

LLVM could look at the a != 0 and turn sub i32 %a, 1 into sub nuw i32 %a, 1, at which point knowing that it's smaller than b is feasible, because it doesn't need to worry about the wrap. But it throws away that information in canonicalizing sub nuw i32 %a, 1 into add i32 %a, -1, which can definitely overflow, and thus what it actually sees is a.wrapping_add(u32::MAX) < b, which is much harder to prove always true.

You'll notice that the same example flipped the other way

pub fn foo(a: u32, b: u32) -> u32 {
    assert!(a <= b);
    assert!(b != u32::MAX);
    if a < b + 1 { 1234 } else { 5678 }
}

does optimize as expected, because the b + 1 gets a nuw that sticks around long enough to be used in a proof: https://rust.godbolt.org/z/Moo9r4odb

3 Likes