Help understanding this assembly

In C if you have a struct point with x and y members, and a struct rect with x, y, width, height, you can just cast a pointer to a rect into a pointer to point and access the x and y members as if the rect was a struct, rather than having to copy the x and y to another location in memory first. In rust, you have to initialize a new point from the rect, which at least appears to have to copy the x and y. I was curious if the rust compiler were smart enough to avoid that copy and just access the x and y in the rect directly, so I came up with this test code in the playground.

#[derive(Debug)]
struct Point {
    x:i32,
    y:i32
}

#[derive(Debug)]
struct Rectangle {
    x:i32,
    y:i32,
    width:i32,
    height:i32
}
fn main() {
    let r = Rectangle {
        x: 88,
        y: 88,
        width: 10,
        height: 10
    };
    println!("r = {:?}", r);
    let p = Point {
        x: r.x,
        y: r.y
    };
    println!("p = {:?}", p);
}

(Playground)

Looking at the assembly generated, it appears that yes, it is just using the x and y in the rect directly rather than copying them to a new location, but I still have some more questions about this assembly.

First, it looks like the function allocates 104 bytes of stack space, but the highest address I can see being referenced is rsp + 80, so it only appears to be using 88 bytes of stack. Am I missing something or is the compiler allocating more than it needs?

Second, what are the arguments to the print function? It looks like it is passed a pointer in rdi that points to an array of arguments built on the stack. Those arguments seem to be some hard coded numbers, a pointer to some other hard coded data, and a pointer to a pointer to a pointer to the struct. The address of the fmt function also appears to be placed on the stack, but it appears to be at rsp + 16, thus comes before the rsp + 32 pointer that is being passed into the print function via rdi. Why is that? And why the triple indirection?

Here is the assembly I annotated with comments:

playground::main:
	push	r14
	push	rbx
#allocate 104 bytes of stack.  Why so much?
	sub	rsp, 104
#compute Address of struct Point.  Why not just use mov immd?
	movaps	xmm0, xmmword ptr [rip + .LCPI2_0]
#store address to stack
	movaps	xmmword ptr [rsp + 80], xmm0
#compute and store pointer to pointer to struct on stack.  Why double indirection?
	lea	rax, [rsp + 80]
	mov	qword ptr [rsp + 8], rax
#compute address of format function
	lea	rax, [rip + <playground::Rectangle as core::fmt::Debug>::fmt]
#store to stack
	mov	qword ptr [rsp + 16], rax
#compute address of some data and store to stack
	lea	rax, [rip + .L__unnamed_1]
	mov	qword ptr [rsp + 32], rax
#add more arguments to the call stack
	mov	qword ptr [rsp + 40], 2
	mov	qword ptr [rsp + 48], 0
#compute pointer to pointer to pointer to struct and store to the stack.  WTF?
	lea	rbx, [rsp + 8]
	mov	qword ptr [rsp + 64], rbx
	mov	qword ptr [rsp + 72], 1
#call print function with the triple pointer and a value of 1 packed into a struct and passed by reference
	mov	r14, qword ptr [rip + std::io::stdio::_print@GOTPCREL]
	lea	rdi, [rsp + 32]
	call	r14
#Copy address of struct Point to another location on the stack. WHY?
	mov	rax, qword ptr [rsp + 80]
	mov	qword ptr [rsp + 24], rax
#compute pointer to pointer to struct and store on the stack, even though it is already there
	lea	rax, [rsp + 24]
	mov	qword ptr [rsp + 8], rax
#compute and store another format helper function pointer to the stack
	lea	rax, [rip + <playground::Point as core::fmt::Debug>::fmt]
	mov	qword ptr [rsp + 16], rax
#build some more arguments on the stack, including pointer to pointer to pointer to struct in rbx
	lea	rax, [rip + .L__unnamed_2]
	mov	qword ptr [rsp + 32], rax
	mov	qword ptr [rsp + 40], 2
	mov	qword ptr [rsp + 48], 0
	mov	qword ptr [rsp + 64], rbx
	mov	qword ptr [rsp + 72], 1
	lea	rdi, [rsp + 32]
#call function again by passing a struct by reference
	call	r14
#cleanup and return
	add	rsp, 104
	pop	rbx
	pop	r14
	ret

Did you do this in release mode or debug mode? There seems to be a tad bit too many instructions for release mode.
Also, please format the asm.

Most of the instructions come from the formatting machinery used by println!(). (You can see the actual function calls using Tools > Expand Macros in the Playground.) If we separate the print statements into their own functions (Rust Playground), the assembly for main() is simplified considerably:

.LCPI4_0:
	.long	88
	.long	88
	.long	10
	.long	10

playground::main:
	sub	rsp, 40
	movaps	xmm0, xmmword ptr [rip + .LCPI4_0]
	movaps	xmmword ptr [rsp + 16], xmm0
	lea	rdi, [rsp + 16]
	call	playground::print_rectangle
	mov	rax, qword ptr [rsp + 16]
	mov	qword ptr [rsp + 8], rax
	lea	rdi, [rsp + 8]
	call	playground::print_point
	add	rsp, 40
	ret

(Note that despite the #[inline(never)] attributes, LLVM has still inlined the "r" and "p" strings into the two print functions. There's no trivial way to prevent it from doing this.) You can observe that the two fields are still copied on the stack. This is because the compiler gives all variables on the stack their own address, even when they don't exist at the same time. I'm not well-versed in the reason for this; perhaps it's because LLVM thinks the called functions might try to write to the addresses, even though the immutable references disallow this.

The value at [rsp + 80] is 16 bytes long, so it's using 96 bytes of that allocation, plus 24 bytes for the return address and two push statements. It rounded up 96 to 104 in order to round the total to a multiple of 16, because the stack is 16-byte aligned.

3 Likes

Release mode.

That does not appear to be true. Both calls to print are using the location rsp+16.

Oh yea, xmm0 is 128 bits isn't it? Why the heck is it computing a pointer ( which is 64 bits ) in a 128 bit register? And the return address and two pushes already moved rsp by the correct amount so should not be counted in the sub rsp.

The first is using rsp + 16, and the second is using rsp + 8, if I read it correctly. Note the lea rdi, [rsp + X] instructions.

Now I'm not sure where I got the 16 from, but both are using:

lea	rdi, [rsp + 32]

Your original assembly is misleading, since println!() is a heavyweight operation that is inlined into main(). Instead, I proposed an alternative program:

#[inline(never)]
fn print_rectangle(name: &str, value: &Rectangle) {
    println!("{} = {:?}", name, value);
}

#[inline(never)]
fn print_point(name: &str, value: &Point) {
    println!("{} = {:?}", name, value);
}

fn main() {
    let r = Rectangle {
        x: 88,
        y: 88,
        width: 10,
        height: 10,
    };
    print_rectangle("r", &r);
    let p = Point { x: r.x, y: r.y };
    print_point("p", &p);
}

which produces the assembly I copied above:

.LCPI4_0:
	.long	88
	.long	88
	.long	10
	.long	10

playground::main:
	sub	rsp, 40
	movaps	xmm0, xmmword ptr [rip + .LCPI4_0]
	movaps	xmmword ptr [rsp + 16], xmm0
	lea	rdi, [rsp + 16]
	call	playground::print_rectangle
	mov	rax, qword ptr [rsp + 16]
	mov	qword ptr [rsp + 8], rax
	lea	rdi, [rsp + 8]
	call	playground::print_point
	add	rsp, 40
	ret

This removes the triple-references, but it does not remove the copies.

That is very strange. I don't see the "p" or "r" arguments being passed into either print function. It looks like the first one is only given a single argument at rsp + 16, which for some reason is a 128 bit address of the hard coded coordinates? And then this same address is copied to rsp + 8 and passed to the second print function.

It looks like the compiler hard coded the "r" into print_rectangle rather than passing it in like it's supposed to.

Recall that lea copies an address, while mov reads from or writes to an address.

The two movaps instructions copy the 128-bit value of r, from [rip + .LCPI4_0] to [rsp + 16] via xmm0. The lea and call pass the address [rsp + 16] to print_rectangle.

Then, the two mov instructions copy the 64-bit value of s, from [rsp + 16] to [rsp + 8] via rax. The lea and call pass the address [rsp + 8] to print_point.

Ahh, right... so that is actually moving the entire Rectangle structure onto the stack. So I guess my version was only using double indirect pointers and smartly reusing the x,y part of the Rectangle rather than copying, but for some reason, your version does copy that 64 bit portion to another location on the stack for the second call. I wonder why that is and why the "r" and "p" strings are hard coded into the respective print functions rather than being passed as arguments.

That argument is always "r" -- this function is only called in one place with that argument, so it may as well be hard-coded inside the function, why not?

Actually, your original version also copies both values separately onto the stack. r is copied into [rsp + 80]:

	movaps xmm0, xmmword ptr [rip + .LCPI2_0]
	movaps xmmword ptr [rsp + 80], xmm0

Then, after r is printed, p is copied into [rsp + 24]:

	mov rax, qword ptr [rsp + 80]
	mov qword ptr [rsp + 24], rax

For reference, here's my own annotation of the println!() version:

# stack:
# [rsp + 8]:  arg_v1: fmt::ArgumentV1
# [rsp + 8]:  arg_v1.value: &fmt::Opaque
# [rsp + 16]: arg_v1.formatter: fn(&fmt::Opaque, &mut fmt::Formatter) -> fmt::Result
# [rsp + 24]: p: Point
# [rsp + 32]: args: fmt::Arguments
# [rsp + 32]: args.pieces: &[&str]
# [rsp + 48]: args.fmt: Option<&[fmt::rt::v1::Argument]>
# [rsp + 64]: args.args: &[fmt::ArgumentV1]
# [rsp + 80]: r: Rectangle

playground::main:
# enter
	push r14
	push rbx
	sub rsp, 104
# r = Rectangle { x: 88, y: 88, width: 10, height: 10 }
	movaps xmm0, xmmword ptr [rip + .LCPI2_0]
	movaps xmmword ptr [rsp + 80], xmm0
# arg_v1.value = transmute(&r)
	lea rax, [rsp + 80]
	mov qword ptr [rsp + 8], rax
# arg_v1.formatter = transmute(<Rectangle as fmt::Debug>::fmt)
	lea rax, [rip + <playground::Rectangle as core::fmt::Debug>::fmt]
	mov qword ptr [rsp + 16], rax
# args.pieces = &["r = ", "\n"][..]
	lea rax, [rip + .L__unnamed_1]
	mov qword ptr [rsp + 32], rax
	mov qword ptr [rsp + 40], 2
# args.fmt = None
	mov qword ptr [rsp + 48], 0
	lea rbx, [rsp + 8]
# args.args = &[arg_v1][..]
	mov qword ptr [rsp + 64], rbx
	mov qword ptr [rsp + 72], 1
# call io::_print(&args)
	mov r14, qword ptr [rip + std::io::stdio::_print@GOTPCREL]
	lea rdi, [rsp + 32]
	call r14
# p = Point { x: r.x, y: r.y }
	mov rax, qword ptr [rsp + 80]
	mov qword ptr [rsp + 24], rax
# arg_v1.value = transmute(&p)
	lea rax, [rsp + 24]
	mov qword ptr [rsp + 8], rax
# arg_v1.formatter = transmute(<Point as fmt::Debug>::fmt)
	lea rax, [rip + <playground::Point as core::fmt::Debug>::fmt]
	mov qword ptr [rsp + 16], rax
# args.pieces = &["p = ", "\n"][..]
	lea rax, [rip + .L__unnamed_2]
	mov qword ptr [rsp + 32], rax
	mov qword ptr [rsp + 40], 2
# args.fmt = None
	mov qword ptr [rsp + 48], 0
# args.args = &[arg_v1][..]
	mov qword ptr [rsp + 64], rbx
	mov qword ptr [rsp + 72], 1
# call io::_print(&args)
	lea rdi, [rsp + 32]
	call r14
# leave
	add rsp, 104
	pop rbx
	pop r14
	ret

Curiously, it does reuse the fmt::Arguments struct, probably because it is stored as a temporary instead of a full variable.

The annoyance here is that even though I ask for #[inline(never)], LLVM still inlines the argument. To prevent it from doing it, you either have to declare the print_* functions in an extern "Rust" block, or give them another call site to prevent the single-use optimizations.

Instead of these tricks and attributes, mark your functions pub and compile the code as a library rather than as a binary (rename the main function in the playground).

Because it could be called from another file and given a different argument.

Oops... darn... so the compiler isn't smart enough to avoid the copy eh?

Is there any way to get something like the C compatible pointer cast to reuse a subset of one structure without copying it?

It can't, because the function is local to this binary. Like I said, if you want to compile it as a standalone function, mark it pub and compile the code as a library.

Yes, but this binary could have other files that call that function. Or are you saying that this optimization is done at link time?

It can't, there are no other modules in your binary because your file doesn't have any mod statements.

Unfortunately not; even if you declare both as consts, LLVM will not fold them together during compilation or linking. Also, using pointer casts in C for type punning is undefined behavior, so use unions or memcpy instead.