I have a large project which showed some runtime performance regressions up to ~10% in some cases. Given how large the project is I am not sure how to exactly pinpoint where the regression is coming from.
However, I also observed similar regression behavior on on of my iai
-run microbenchmarks.
The microbenchmark (which, unfortunately, still contains a ton of proprietary code) shows added 10 instructions in a hot deserialization loop. First note that I am not sure that this microbenchmark is regression on an actual hardware. iai
uses callgrind to get the number of instructions but latency/throughput of SIMD instructions is not modeled.
That being out of the way - the regression is interesting. In the benchmark my code basically returns a large struct Result<FlatStructAllRequired, std::io::Error - compatible layout>
from a function.
pub struct FlatStructAllRequired {
pub f1: f64,
pub f2: f64,
pub f3: f64,
pub f4: f64,
pub f5: f64,
pub f6: f64,
pub f7: f64,
pub f8: f64,
pub f9: f64,
pub f10: f64,
}
I went over generated assembly and one thing that stuck out was indeed 10 extra instructions at the end of the function with v1.79.
1.78
5ea43: c5 f8 28 85 00 ff ff vmovaps -0x100(%rbp),%xmm0
5ea4a: ff
5ea4b: c5 f9 14 85 10 ff ff vunpcklpd -0xf0(%rbp),%xmm0,%xmm0
5ea52: ff
5ea53: c5 f8 28 8d e0 fe ff vmovaps -0x120(%rbp),%xmm1
5ea5a: ff
5ea5b: c5 f1 14 8d f0 fe ff vunpcklpd -0x110(%rbp),%xmm1,%xmm1
5ea62: ff
5ea63: c4 e3 75 18 c0 01 vinsertf128 $0x1,%xmm0,%ymm1,%ymm0
5ea69: c5 f8 28 8d 40 ff ff vmovaps -0xc0(%rbp),%xmm1
5ea70: ff
5ea71: c5 f1 14 8d 50 ff ff vunpcklpd -0xb0(%rbp),%xmm1,%xmm1
5ea78: ff
5ea79: c5 f8 28 95 20 ff ff vmovaps -0xe0(%rbp),%xmm2
5ea80: ff
5ea81: c5 e9 14 95 30 ff ff vunpcklpd -0xd0(%rbp),%xmm2,%xmm2
5ea88: ff
5ea89: c4 e3 6d 18 c9 01 vinsertf128 $0x1,%xmm1,%ymm2,%ymm1
5ea8f: 4d 89 3e mov %r15,(%r14)
5ea92: 49 89 5e 08 mov %rbx,0x8(%r14)
5ea96: c4 c1 7c 11 46 10 vmovups %ymm0,0x10(%r14)
5ea9c: c4 c1 7c 11 4e 30 vmovups %ymm1,0x30(%r14)
5eaa2: 48 81 c4 08 01 00 00 add $0x108,%rsp
5eaa9: 5b pop %rbx
5eaaa: 41 5e pop %r14
5eaac: 41 5f pop %r15
5eaae: 5d pop %rbp
5eaaf: c5 f8 77 vzeroupper
5eab2: c3 ret
vs 1.79
5ceed: c5 fb 10 85 20 ff ff vmovsd -0xe0(%rbp),%xmm0
5cef4: ff
5cef5: c5 fb 11 85 78 ff ff vmovsd %xmm0,-0x88(%rbp)
5cefc: ff
5cefd: c5 fb 10 85 28 ff ff vmovsd -0xd8(%rbp),%xmm0
5cf04: ff
5cf05: c5 fb 11 45 80 vmovsd %xmm0,-0x80(%rbp)
5cf0a: c5 fb 10 85 30 ff ff vmovsd -0xd0(%rbp),%xmm0
5cf11: ff
5cf12: c5 fb 11 45 88 vmovsd %xmm0,-0x78(%rbp)
5cf17: c5 fb 10 85 38 ff ff vmovsd -0xc8(%rbp),%xmm0
5cf1e: ff
5cf1f: c5 fb 11 45 90 vmovsd %xmm0,-0x70(%rbp)
5cf24: c5 fb 10 85 40 ff ff vmovsd -0xc0(%rbp),%xmm0
5cf2b: ff
5cf2c: c5 fb 11 45 98 vmovsd %xmm0,-0x68(%rbp)
5cf31: c5 fb 10 85 48 ff ff vmovsd -0xb8(%rbp),%xmm0
5cf38: ff
5cf39: c5 fb 11 45 a0 vmovsd %xmm0,-0x60(%rbp)
5cf3e: c5 fb 10 85 50 ff ff vmovsd -0xb0(%rbp),%xmm0
5cf45: ff
5cf46: c5 fb 11 45 a8 vmovsd %xmm0,-0x58(%rbp)
5cf4b: c5 fb 10 85 58 ff ff vmovsd -0xa8(%rbp),%xmm0
5cf52: ff
5cf53: c5 fb 11 45 b0 vmovsd %xmm0,-0x50(%rbp)
5cf58: c5 fc 10 45 98 vmovups -0x68(%rbp),%ymm0
5cf5d: c5 fc 11 43 30 vmovups %ymm0,0x30(%rbx)
5cf62: c5 fc 10 45 88 vmovups -0x78(%rbp),%ymm0
5cf67: c5 fc 11 43 20 vmovups %ymm0,0x20(%rbx)
5cf6c: 48 8b 85 68 ff ff ff mov -0x98(%rbp),%rax
5cf73: 48 89 03 mov %rax,(%rbx)
5cf76: 48 8b 85 70 ff ff ff mov -0x90(%rbp),%rax
5cf7d: 48 89 43 08 mov %rax,0x8(%rbx)
5cf81: c5 f8 10 85 78 ff ff vmovups -0x88(%rbp),%xmm0
5cf88: ff
5cf89: c5 f8 11 43 10 vmovups %xmm0,0x10(%rbx)
5cf8e: 48 81 c4 c8 00 00 00 add $0xc8,%rsp
5cf95: 5b pop %rbx
5cf96: 41 5e pop %r14
5cf98: 41 5f pop %r15
5cf9a: 5d pop %rbp
5cf9b: c5 f8 77 vzeroupper
From this it seems that the regression is in the function epilogue where it writes the content of the FlatStructAllRequired
struct into a caller's stack. Rust 1.78 used some clever SIMD moves to achieve this faster.
I tried to reproduce this behavior on a simple example on godbolt but I failed. From this I suspect there is something else going on - perhaps some field/stack var reordering somewhere? This would be corraborated by the fact that the epilogue does have different rsp
adjustment - 0x108
in v1.78 vs 0xc8
in 1.79.
Do you have any suggestions on how can I efficiently narrow this regression down so that I can submit it to rust repo?