Hi! Are you running nightly, or stable? I've noticed quite a perf regression on nightly β 51-ish seconds vs. 43-ish on stable.
Well, you heard well! The []
operator on vecs and slices does indeed perform bound-checking by default (also, llvm usually takes good care of removing them). Anyway, there's a lot better solution than using raw pointers here! There's a get_unchecked
(and the _mut
counterpart) method on slices β if you only care about removing bound checks β get rid of all the raw pointers and just use them instead of offset
Also, you have a use-after-free inside main
!
let working_area = vec![0 as f64; ts_size * ts_size].as_mut_ptr();
The vector you allocate here is dropped at the end of this statement! This means that the raw pointer points to deallocated memory. Fortunately, it seems that you don't overwrite the memory afterwards and it "just works". Anyway, you should just change it to:
let working_area = vec![0 as f64; ts_size * ts_size];
...
total_e += vi.compute_dtw(vj, &mut working_area);
Assuming that compute_dtw
has this signature:
fn compute_dtw(&self, other: &Self, m : &mut [f64]) -> f64
Using raw pointers instead of Rust types makes the compiler forget all the important information about aliasing. Also, you should iterators where you can! The .zip
method is really well implemented on slices' iterators. Unfortunately in your current algorithm here, you're writing and reading the same row at once, so you can't really benefit from the iterators :(.
Anyway, the no-raw-pointers version is still as βslowβ as your original one. I don't really know why, since the main loop looks really dense here:
βββmovsd -0x8(%rbp,%r14,8),%xmm0
β movsd 0x0(%rbp,%r14,8),%xmm1
β lea 0x0(%rbp,%rcx,1),%rbp
β movsd -0x8(%rbp,%r14,8),%xmm2
β movapd %xmm0,%xmm3
β minsd %xmm1,%xmm3
β cmplts %xmm2,%xmm0
β minsd %xmm1,%xmm2
β andpd %xmm0,%xmm3
β andnpd %xmm2,%xmm0
β orpd %xmm3,%xmm0
β movsd (%rsi),%xmm1
β subsd (%rbx,%r14,8),%xmm1
β mulsd %xmm1,%xmm1
β addsd %xmm0,%xmm1
β movsd %xmm1,0x0(%rbp,%r14,8)
β add $0x8,%rsi
β dec %rdi
βββjne 2dd0
I don't know what tricks is Java's JIT pulling here, if it's over 2Γ faster. Maybe it's the better choice of instructions? Maybe it just unrolls the whole loop? (it's 271 cols, right?) Maybe it's doing some vectorization (that would be quite magic)?
Are you 100% sure the Rust program is doing the same work as Java one?
edit: Forgot the quite obvious thing to check!:
Setting your environment variable RUSTFLAGS
to '-C target-cpu=native'
makes the Rust code run 22s on my machine (Skylake)!
The main loop looks like that:
βββvmovsd (%rax,%rdx,1),%xmm0
β vmovsd (%rdx),%xmm1
β vmovsd 0x8(%rdx),%xmm2
β vminsd %xmm2,%xmm1,%xmm3
β vminsd %xmm2,%xmm0,%xmm2
β vcmplt %xmm0,%xmm1,%xmm0
β vblend %xmm0,%xmm3,%xmm2,%xmm0
β vmovsd (%rcx,%r10,8),%xmm1
β vsubsd (%rsi,%r9,8),%xmm1,%xmm1
β vmulsd %xmm1,%xmm1,%xmm1
β vaddsd %xmm1,%xmm0,%xmm0
β vmovsd %xmm0,0x8(%rax,%rdx,1)
β vmovsd (%rax,%rdx,1),%xmm1
β vmovsd (%rbx,%rdx,1),%xmm2
β vminsd %xmm0,%xmm1,%xmm3
β vminsd %xmm0,%xmm2,%xmm0
β vcmplt %xmm2,%xmm1,%xmm1
β vblend %xmm1,%xmm3,%xmm0,%xmm0
β vmovsd 0x8(%rcx,%r10,8),%xmm1
β vsubsd (%rsi,%r9,8),%xmm1,%xmm1
β vmulsd %xmm1,%xmm1,%xmm1
β vaddsd %xmm1,%xmm0,%xmm0
β vmovsd %xmm0,0x8(%rbx,%rdx,1)
β lea (%rbx,%rdx,1),%rdx
β add $0x2,%r10
β cmp %r10,%r14
βββjne 3470
Still no vectorization, but the loop is unrolled twice.