--emit=asm changes the produced machine code

I am running into an issue that has me scratching my head and I wonder if I am missing something or if it is a bug.

If I add --emit=asm to rustc the compiled binary completely changes. In particular, the code is “better” optimized without --emit=asm.

I put together a small testcase at https://github.com/rekka/weird-asm with samples of the objdump and more details:

No --emit=asm

RUSTFLAGS="-C target-cpu=native" cargo build --release

The code is vectorized and unrolled:

7ab0:	c4 c1 6d 59 74 f2 d8 	vmulpd -0x28(%r10,%rsi,8),%ymm2,%ymm6
7ab7:	c5 fd 10 7c f1 e0    	vmovupd -0x20(%rcx,%rsi,8),%ymm7
7abd:	c4 c1 45 58 7c f2 d0 	vaddpd -0x30(%r10,%rsi,8),%ymm7,%ymm7
7ac4:	c4 c1 45 58 7c f2 e0 	vaddpd -0x20(%r10,%rsi,8),%ymm7,%ymm7
7acb:	c5 c5 58 7c f0 e0    	vaddpd -0x20(%rax,%rsi,8),%ymm7,%ymm7
7ad1:	c4 c1 45 58 7c f0 08 	vaddpd 0x8(%r8,%rsi,8),%ymm7,%ymm7
7ad8:	c5 c5 59 fb          	vmulpd %ymm3,%ymm7,%ymm7
7adc:	c4 c1 45 59 7c f1 e0 	vmulpd -0x20(%r9,%rsi,8),%ymm7,%ymm7
7ae3:	c5 cd 58 f7          	vaddpd %ymm7,%ymm6,%ymm6
7ae7:	c4 c1 7d 11 74 f4 e0 	vmovupd %ymm6,-0x20(%r12,%rsi,8)
7aee:	c4 c1 6d 59 74 f2 f8 	vmulpd -0x8(%r10,%rsi,8),%ymm2,%ymm6
7af5:	c5 fd 10 3c f1       	vmovupd (%rcx,%rsi,8),%ymm7
7afa:	c4 c1 45 58 7c f2 f0 	vaddpd -0x10(%r10,%rsi,8),%ymm7,%ymm7
7b01:	c4 c1 45 58 3c f2    	vaddpd (%r10,%rsi,8),%ymm7,%ymm7
7b07:	c5 c5 58 3c f0       	vaddpd (%rax,%rsi,8),%ymm7,%ymm7
7b0c:	c4 c1 45 58 7c f0 28 	vaddpd 0x28(%r8,%rsi,8),%ymm7,%ymm7
7b13:	c5 c5 59 fb          	vmulpd %ymm3,%ymm7,%ymm7
7b17:	c4 c1 45 59 3c f1    	vmulpd (%r9,%rsi,8),%ymm7,%ymm7
7b1d:	c5 cd 58 f7          	vaddpd %ymm7,%ymm6,%ymm6
7b21:	c4 c1 7d 11 34 f4    	vmovupd %ymm6,(%r12,%rsi,8)
7b27:	48 83 c6 08          	add    $0x8,%rsi
7b2b:	48 83 c5 02          	add    $0x2,%rbp
7b2f:	0f 85 7b ff ff ff    	jne    7ab0 <_ZN9weird_asm21precomp_damped_jacobi17h4c145fd63e2bf0d0E+0x390>

With --emit=asm

RUSTFLAGS="-C target-cpu=native --emit=asm" cargo build --release

No vectorization:

79b0:	c5 fb 10 6c c5 f8    	vmovsd -0x8(%rbp,%rax,8),%xmm5
79b6:	48 83 c0 01          	add    $0x1,%rax
79ba:	c5 f3 59 f4          	vmulsd %xmm4,%xmm1,%xmm6
79be:	c4 c1 53 58 2c c3    	vaddsd (%r11,%rax,8),%xmm5,%xmm5
79c4:	c5 fb 10 64 c5 00    	vmovsd 0x0(%rbp,%rax,8),%xmm4
79ca:	c5 d3 58 ec          	vaddsd %xmm4,%xmm5,%xmm5
79ce:	c4 c1 53 58 2c c0    	vaddsd (%r8,%rax,8),%xmm5,%xmm5
79d4:	c4 c1 53 58 2c c6    	vaddsd (%r14,%rax,8),%xmm5,%xmm5
79da:	c5 d3 59 e8          	vmulsd %xmm0,%xmm5,%xmm5
79de:	c4 c1 53 59 2c c2    	vmulsd (%r10,%rax,8),%xmm5,%xmm5
79e4:	c5 cb 58 ed          	vaddsd %xmm5,%xmm6,%xmm5
79e8:	c4 c1 7b 11 6c c5 00 	vmovsd %xmm5,0x0(%r13,%rax,8)
79ef:	49 39 c1             	cmp    %rax,%r9
79f2:	75 bc                	jne    79b0 <_ZN9weird_asm21precomp_damped_jacobi17h4c145fd63e2bf0d0E+0x290>

This is both on the latest stable and nightly, Ubuntu 17.10 and macOS.

It has a performance impact on the binary as well.

I use --emit=asm to inspect the produced code to check that loops get autovectorized but I noticed this difference when using perf record on the code without --emit=asm. Does --emit=asm change the codegen in some way? That would make it much less useful.

1 Like

Peeking at the compiler source, it looks like the issue is that rustc enables ThinLTO by default in release mode, but turns it off if certain output options are specified, including --emit=asm. A comment mentions this issue report:

The behavior does seem rather surprising. Forcing ThinLTO back on with -C lto=thin seems to work, though it produces a warning:

warning: ignoring emit path because multiple .s files were produced

Another option is to use -C lto=fat, but combining that with --emit=asm produces a fun internal error:

DICompileUnit not listed in llvm.dbg.cu
!19682 = distinct !DICompileUnit(language: DW_LANG_Rust, file: !2011, producer: "clang LLVM (rustc version 1.28.0-nightly (990d8aa74 2018-05-25))", isOptimized: true, runtimeVersion: 0, emissionKind: FullDebug, enums: !2012, globals: !19683)
LLVM ERROR: Broken module found, compilation aborted!

Looks like turning off debug info might be a workaround.

Thank you for looking into this. This might be the at least part of the reason.

Indeed, disabling ThinLTO manually on the nightly compiler produces a code that is not vectorized:

RUSTFLAGS="-C target-cpu=native -Z thinlto=no" cargo build --release

However, it is less optimized (whatever that means, at least from the extra movs sprinkled around) than the code with --emit=asm .

I suppose that if there is a ThinLTO pass that optimizes the output of the compile step that we see in the asm output, it makes sense that the resulting binary does not match the asm output. But this makes --emit=asm quite useless for my use case. I have always assumed that the asm output will exactly appear in the binary. Also, the fact that --emit=asm disables ThinLTO without any warning is a bit unfortunate.

Luckily, objdump -d is easy enough, albeit a little less readable. So I will use that instead.

@rekka FYI, I’ve just hit this behavior too. The difference comes from the fact that —emit=asm forces the number of codegen units to 1.

The performance issue between CG=1 and (CG=n + ThinLTO) seems to be known:

1 Like