Dereference result in 2 memcpy


#1

Hi,

I am trying to understand why this simple code:

pub fn crazy_stuff() {
  let mut array = *b"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
}

result in 2 memcpy:

example::crazy_stuff:
.Lfunc_begin0:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 112
        mov     eax, 36
        mov     ecx, eax
        lea     rdx, [rbp - 72]
        lea     rsi, [rbp - 36]
        lea     rdi, [rip + .Lbyte_str.0]
.Ltmp0:
        mov     r8, rsi
        mov     qword ptr [rbp - 80], rdi
        mov     rdi, r8
        mov     r8, qword ptr [rbp - 80]
        mov     qword ptr [rbp - 88], rsi
        mov     rsi, r8
        mov     qword ptr [rbp - 96], rdx
        mov     rdx, rcx
        mov     qword ptr [rbp - 104], rcx
        call    memcpy@PLT
        mov     rcx, qword ptr [rbp - 88]
        mov     rdx, qword ptr [rbp - 96]
        mov     rdi, rdx
        mov     rsi, rcx
        mov     rdx, qword ptr [rbp - 104]
        call    memcpy@PLT
        add     rsp, 112
        pop     rbp
        ret
.Ltmp1:
.Lfunc_end0: 

I understand that the dereference should result in a copy of my buffer on the stack, but where is the second copy coming from ?

Thank you


#2

It’s not useful to over-analyze this kind of thing in debug mode. In release mode, this crazy_stuff gets completely optimized away, and even in general useless memcpy will be reduced. But if you really want, you can use the playground to examine rustc’s MIR output:

fn crazy_stuff() -> (){
    let mut _0: ();                      // return place
    scope 1 {
        let mut _1: [u8; 36];            // "array" in scope 1 at src/main.rs:4:7: 4:16
    }
    scope 2 {
    }
    let mut _2: [u8; 36];
    let mut _3: &[u8; 36];

    bb0: {                              
        StorageLive(_1);                 // bb0[0]: scope 0 at src/main.rs:4:7: 4:16
        StorageLive(_2);                 // bb0[1]: scope 0 at src/main.rs:4:19: 4:59
        StorageLive(_3);                 // bb0[2]: scope 0 at src/main.rs:4:20: 4:59
        _3 = const ByVal(Ptr(MemoryPointer { alloc_id: AllocId(0), offset: 0 })):&[u8; 36]; // bb0[3]: scope 0 at src/main.rs:4:20: 4:59
                                         // ty::Const
                                         // + ty: &[u8; 36]
                                         // + val: Value(ByVal(Ptr(MemoryPointer { alloc_id: AllocId(0), offset: 0 })))
                                         // mir::Constant
                                         // + span: src/main.rs:4:20: 4:59
                                         // + ty: &[u8; 36]
                                         // + literal: const ByVal(Ptr(MemoryPointer { alloc_id: AllocId(0), offset: 0 })):&[u8; 36]
        _2 = (*_3);                      // bb0[4]: scope 0 at src/main.rs:4:19: 4:59
        _1 = move _2;                    // bb0[5]: scope 0 at src/main.rs:4:19: 4:59
        StorageDead(_2);                 // bb0[6]: scope 0 at src/main.rs:4:58: 4:59
        StorageDead(_3);                 // bb0[7]: scope 0 at src/main.rs:4:59: 4:60
        StorageDead(_1);                 // bb0[8]: scope 0 at src/main.rs:5:1: 5:2
        return;                          // bb0[9]: scope 0 at src/main.rs:5:2: 5:2
    }
}

So _1 is the array local, and it looks like _2 is a temporary for the dereferenced value that that will be written to array. The first memcpy fills _2, and the second moves it to _1.


#3

I have a feeling that this over-reliance on optimizers will eventually bite Rust in the rump.


#4

I believe it already does in that rustc ships a lot of IR to LLVM to optimize, and so compile times go up.


#5

In this case, please explain why I get the same in release mode ;).

https://play.rust-lang.org/?gist=bca922fa877dcaed45b3ac2a8b7a0916&version=nightly&mode=release


#6

There’s no copying there. Did you look at the asm and see something otherwise?


#7

I see no memcpy at all:

playground::crazy_stuff:
	subq	$120, %rsp
	movups	.Lbyte_str.1(%rip), %xmm0
	movaps	%xmm0, (%rsp)
	movq	.Lbyte_str.1+29(%rip), %rax
	movq	%rax, 29(%rsp)
	movups	.Lbyte_str.1+16(%rip), %xmm0
	movaps	%xmm0, 16(%rsp)
	movb	$0, 2(%rsp)
	movq	%rsp, %rax
	movq	%rax, 40(%rsp)
	leaq	2(%rsp), %rax
	movq	core::fmt::num::<impl core::fmt::Display for u8>::fmt@GOTPCREL(%rip), %rcx
	movq	%rcx, 48(%rsp)
	movq	%rax, 56(%rsp)
	movq	%rcx, 64(%rsp)
	leaq	.Lbyte_str.5(%rip), %rax
	movq	%rax, 72(%rsp)
	movq	$3, 80(%rsp)
	leaq	.Lbyte_str.6(%rip), %rax
	movq	%rax, 88(%rsp)
	movq	$2, 96(%rsp)
	leaq	40(%rsp), %rax
	movq	%rax, 104(%rsp)
	movq	$2, 112(%rsp)
	leaq	72(%rsp), %rdi
	callq	std::io::stdio::_print@PLT
	addq	$120, %rsp
	retq

#8

I think it’s pretty common to start with a simple-minded translation of the code, get it correct, then let the optimizer go to town. Rust could do some optimizations up front in MIR before handing it off to the backend, and IIRC they are planning to do so.


#9

If you analyzing functions, then it’s better to use godbolt: https://godbolt.org/g/Lok13j As you can see function is fully optimized.

Not only it connects assembly with code (when it can), but it also does not have unnececcary utility assembly. (because playground compiles binary, not lib)


#10

Just an FYI for whoever doesn’t know, but you can add

#![crate_type = "lib"]

at the top and it’ll compile a lib.


#11

No memcpy does not mean there is no memory copy happening, in this case it is optimize using xmm registers.

As for the dual copy, not sure how I read this assembly the first time, there seem to be only one copy there. I am still a bit unsure as to why is the stack so big.


#12

OK, yes, if you look at LLVM IR, there is just one memcpy call in the optimized version.


#13

I think most of the stack is likely due to the printing you’re doing. If you take a dummy function like this:

pub fn crazy_stuff() -> u8 {
  let mut array = *b"aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
  unsafe {std::ptr::write_volatile(&mut array[2] as *mut _, 4u8);}
  array[24]
}

It produces the following asm:

playground::crazy_stuff:
	subq	$1, %rsp
	movb	$4, (%rsp)
	movb	$97, %al
	addq	$1, %rsp
	retq

So the array isn’t materialized at all. But I guess doing the printing, which I know expands to a bunch of gook code, presumably spooks the optimizer.


#14

fmt::Arguments captures a reference to each value – I’m guessing that LLVM thinks that a pointer within the array could legally access other parts of the same array in the callee.

You can force it to use temporaries with println!("{}-{}", {array[0]}, {array[2]}), and then the array is optimized away.

I wonder if there’s a way to help LLVM understand that it’s not legal in Rust for those original references to access the rest of the object?


#15

But you always can go unsafe { std::ptr::read((val as *const u8).offset(100500)) }, so I don’t think that it will be possible without LTO. (well, we could mark those references somehow, but I doubt LLVM has such functionality)


#16

Have the unsafe rust guidelines addressed this kind of possibility? It seems like an obvious hazard, if not full UB, since you can’t know what’s happening in the rest of the object. The other parts could be mutably borrowed elsewhere, for instance.


#17

LLVM does not know about any guidelines, you must somehow prove (or at least declare) to it that the given reference will be used only for reading one byte. I think that potential performance improvements dwarf in comparison to the added complexity.


#18

Sure, this comes in two parts – decide whether it should be legal, and then express that as much as possible to the backend. We have a similar situation with mutable aliasing, which still isn’t declared noalias AFAIK.


#19

That was put into nightly about a month ago: https://github.com/rust-lang/rust/pull/50744