Performance regression 1.78 -> 1.79

ppershing · August 29, 2024, 11:30am

I have a large project which showed some runtime performance regressions up to ~10% in some cases. Given how large the project is I am not sure how to exactly pinpoint where the regression is coming from.

However, I also observed similar regression behavior on on of my iai-run microbenchmarks.
The microbenchmark (which, unfortunately, still contains a ton of proprietary code) shows added 10 instructions in a hot deserialization loop. First note that I am not sure that this microbenchmark is regression on an actual hardware. iai uses callgrind to get the number of instructions but latency/throughput of SIMD instructions is not modeled.

That being out of the way - the regression is interesting. In the benchmark my code basically returns a large struct Result<FlatStructAllRequired, std::io::Error - compatible layout> from a function.

pub struct FlatStructAllRequired {
    pub f1: f64,
    pub f2: f64,
    pub f3: f64,
    pub f4: f64,
    pub f5: f64,
    pub f6: f64,
    pub f7: f64,
    pub f8: f64,
    pub f9: f64,
    pub f10: f64,
}

I went over generated assembly and one thing that stuck out was indeed 10 extra instructions at the end of the function with v1.79.

1.78

   5ea43:	c5 f8 28 85 00 ff ff 	vmovaps -0x100(%rbp),%xmm0
   5ea4a:	ff 
   5ea4b:	c5 f9 14 85 10 ff ff 	vunpcklpd -0xf0(%rbp),%xmm0,%xmm0
   5ea52:	ff 
   5ea53:	c5 f8 28 8d e0 fe ff 	vmovaps -0x120(%rbp),%xmm1
   5ea5a:	ff 
   5ea5b:	c5 f1 14 8d f0 fe ff 	vunpcklpd -0x110(%rbp),%xmm1,%xmm1
   5ea62:	ff 
   5ea63:	c4 e3 75 18 c0 01    	vinsertf128 $0x1,%xmm0,%ymm1,%ymm0
   5ea69:	c5 f8 28 8d 40 ff ff 	vmovaps -0xc0(%rbp),%xmm1
   5ea70:	ff 
   5ea71:	c5 f1 14 8d 50 ff ff 	vunpcklpd -0xb0(%rbp),%xmm1,%xmm1
   5ea78:	ff 
   5ea79:	c5 f8 28 95 20 ff ff 	vmovaps -0xe0(%rbp),%xmm2
   5ea80:	ff 
   5ea81:	c5 e9 14 95 30 ff ff 	vunpcklpd -0xd0(%rbp),%xmm2,%xmm2
   5ea88:	ff 
   5ea89:	c4 e3 6d 18 c9 01    	vinsertf128 $0x1,%xmm1,%ymm2,%ymm1
   5ea8f:	4d 89 3e             	mov    %r15,(%r14)
   5ea92:	49 89 5e 08          	mov    %rbx,0x8(%r14)
   5ea96:	c4 c1 7c 11 46 10    	vmovups %ymm0,0x10(%r14)
   5ea9c:	c4 c1 7c 11 4e 30    	vmovups %ymm1,0x30(%r14)
   5eaa2:	48 81 c4 08 01 00 00 	add    $0x108,%rsp
   5eaa9:	5b                   	pop    %rbx
   5eaaa:	41 5e                	pop    %r14
   5eaac:	41 5f                	pop    %r15
   5eaae:	5d                   	pop    %rbp
   5eaaf:	c5 f8 77             	vzeroupper 
   5eab2:	c3                   	ret

vs 1.79

   5ceed:	c5 fb 10 85 20 ff ff 	vmovsd -0xe0(%rbp),%xmm0
   5cef4:	ff 
   5cef5:	c5 fb 11 85 78 ff ff 	vmovsd %xmm0,-0x88(%rbp)
   5cefc:	ff 
   5cefd:	c5 fb 10 85 28 ff ff 	vmovsd -0xd8(%rbp),%xmm0
   5cf04:	ff 
   5cf05:	c5 fb 11 45 80       	vmovsd %xmm0,-0x80(%rbp)
   5cf0a:	c5 fb 10 85 30 ff ff 	vmovsd -0xd0(%rbp),%xmm0
   5cf11:	ff 
   5cf12:	c5 fb 11 45 88       	vmovsd %xmm0,-0x78(%rbp)
   5cf17:	c5 fb 10 85 38 ff ff 	vmovsd -0xc8(%rbp),%xmm0
   5cf1e:	ff 
   5cf1f:	c5 fb 11 45 90       	vmovsd %xmm0,-0x70(%rbp)
   5cf24:	c5 fb 10 85 40 ff ff 	vmovsd -0xc0(%rbp),%xmm0
   5cf2b:	ff 
   5cf2c:	c5 fb 11 45 98       	vmovsd %xmm0,-0x68(%rbp)
   5cf31:	c5 fb 10 85 48 ff ff 	vmovsd -0xb8(%rbp),%xmm0
   5cf38:	ff 
   5cf39:	c5 fb 11 45 a0       	vmovsd %xmm0,-0x60(%rbp)
   5cf3e:	c5 fb 10 85 50 ff ff 	vmovsd -0xb0(%rbp),%xmm0
   5cf45:	ff 
   5cf46:	c5 fb 11 45 a8       	vmovsd %xmm0,-0x58(%rbp)
   5cf4b:	c5 fb 10 85 58 ff ff 	vmovsd -0xa8(%rbp),%xmm0
   5cf52:	ff 
   5cf53:	c5 fb 11 45 b0       	vmovsd %xmm0,-0x50(%rbp)
   5cf58:	c5 fc 10 45 98       	vmovups -0x68(%rbp),%ymm0
   5cf5d:	c5 fc 11 43 30       	vmovups %ymm0,0x30(%rbx)
   5cf62:	c5 fc 10 45 88       	vmovups -0x78(%rbp),%ymm0
   5cf67:	c5 fc 11 43 20       	vmovups %ymm0,0x20(%rbx)
   5cf6c:	48 8b 85 68 ff ff ff 	mov    -0x98(%rbp),%rax
   5cf73:	48 89 03             	mov    %rax,(%rbx)
   5cf76:	48 8b 85 70 ff ff ff 	mov    -0x90(%rbp),%rax
   5cf7d:	48 89 43 08          	mov    %rax,0x8(%rbx)
   5cf81:	c5 f8 10 85 78 ff ff 	vmovups -0x88(%rbp),%xmm0
   5cf88:	ff 
   5cf89:	c5 f8 11 43 10       	vmovups %xmm0,0x10(%rbx)
   5cf8e:	48 81 c4 c8 00 00 00 	add    $0xc8,%rsp
   5cf95:	5b                   	pop    %rbx
   5cf96:	41 5e                	pop    %r14
   5cf98:	41 5f                	pop    %r15
   5cf9a:	5d                   	pop    %rbp
   5cf9b:	c5 f8 77             	vzeroupper

From this it seems that the regression is in the function epilogue where it writes the content of the FlatStructAllRequired struct into a caller's stack. Rust 1.78 used some clever SIMD moves to achieve this faster.

I tried to reproduce this behavior on a simple example on godbolt but I failed. From this I suspect there is something else going on - perhaps some field/stack var reordering somewhere? This would be corraborated by the fact that the epilogue does have different rsp adjustment - 0x108 in v1.78 vs 0xc8 in 1.79.

Do you have any suggestions on how can I efficiently narrow this regression down so that I can submit it to rust repo?

afetisov · August 29, 2024, 12:35pm

I have no idea how to troubleshoot your issue, but have you verified that it reproduces on the latest stable, beta and nightly toolchains? It could already be fixed.

If it isn't, first thing I'd try to minimize your example to something reasonable, which could be submitted in the issue. I'd try something along the lines of rust-reduce, whether automatic or manual.

Topic		Replies	Views
Another performance regression, 1.78.0 -> 1.79.0 help	10	909	September 22, 2024
Performance regression 1.77.0 -> 1.78.0? help	8	598	August 28, 2024
Benchmarks game regressions	10	1878	January 12, 2023
Unexplained order of magnitude drop in performance help	70	2917	January 2, 2024
Odd 2x performance gap with structs of different sizes help	3	738	January 12, 2023

Performance regression 1.78 -> 1.79

Related topics