Performance issue with C-array like computation (2 times worst than naive java)

Hi !

Here is the background: at work we have, internally, some java code doing heavy computation on time series. Performance being critical, I decided to try Rust (note: I'm a rust beginner). However, I can't achieve good perf with my code, and actually it ended up being 2 times slower than our java code (51s for rust, 21s for java), with cargo --release. Now, there is no "trick" on the java side, no parallelism, nothing : it is plain naive, basically the same algo as the one I used in Rust. I can share the Rust code, but not the java one.

Here is the code with the Cargo.toml at the beginning: //--------------------// TOML /*[package]name = "ts"version = "0.1.0"a - Pastebin.com
You can test with 50words/50words_TEST.csv from Welcome to the UCR Time Series Classification/Clustering Page (big archive to be downloaded, sorry for that!), e.g. cargo run --release -- /path/to/50words_TEST.csv

The algorithm use a pre-allocated mutable 2D array of doubles that I implemented with *mut f64 (so it's a matrix, called m), that's why I'm doing index computation in the offset functions.
The heavy computation is done in compute_dtw, first the point m[0,0] is filled in, then the first line and the first column, then the rest of the matrix. In an attempt to speed things up, I tried using raw pointer (I heard that there is bound checking on vector? I'm not sure, I'm only starting with rust...).
So, once the main computation starts, no heavy allocation is done (or so I believe, I might be wrong). I also tried to inline helpers functions. Still, I can't have good perf.

Does someone now why ? Or have any suggestion ?

Thanks a lot !
Cheers
Mat

1 Like

you right, but there is unsafe methods in Vec to do access without checks,
so there is no need for pointer arithmetic.

I've had only a cursory look at the code, so I might be wrong, but the culprit seems to be the iteration pattern of

            // --- Compute DTW
            for idx_col in 1..dim as isize {
                for idx_line in 1..dim as isize {
                    *m.offset(dim*idx_line+idx_col) = {
                        // Compute ancestors
                        let d11 = *m.offset(dim*(idx_line-1) + idx_col-1);
                        let d01 = *m.offset(dim*(idx_line)   + idx_col-1);
                        let d10 = *m.offset(dim*(idx_line-1) + idx_col);
                        // Take the smallest ancestor and add the current distance
                        (if d11<d01 { if d11<d10 {d11} else {d10} }
                         else { if d01<d10 {d01} else {d10} }) + self.square_dist(idx_line, idx_col, other)
                            // The next line actually call cmath
                            // d11.min(d01).min(d10) + self.square_dist(idx_line, idx_col, other)
                    };
                }
            }

More precisely, the fact that the inner loop is for idx_line, which is always multiplied by dim. That means that on each iteration of the inner loop you get a cache miss.

I think if you just change the order of the fors you'll get a massive speedup.

I also think that there's a fair amount of premature optimization here :slight_smile: If I were you, I would start with a simple Vec<Vec<f64>> implementation to get a baseline and establish bottlenecks, before embarking on the unsafe journey :slight_smile:

3 Likes

It would be useful to check which order does Java version use!

2 Likes

This archive require password, to get password you need to read article,
article hidden advertisment?

Hi and thanks for your replies!
@davemilter: Thanks, I'll check that!

@matklad: that's a good hint, but the code behaves badly for the cache mainly because, for a given cell, we are looking at both the previous column and row. Switching the loop is actually more costly on my system...
I'm looking for C-like performance, so for raw memory accesses, and I'm very puzzled by why my code is so slow compared to Java (which indeed use the loop in the other direction).
Maybe java is performing some loop transformation to suit the access pattern?
On the other hand, if in the loop I remove

(if d11<d01 { if d11<d10 {d11} else {d10} }
                         else { if d01<d10 {d01} else {d10} }) + self.square_dist(idx_line, idx_col, other)

and put

d11+d10+d01

The code performs in 24s (java is 21s...)

@Dushistov: Please be sure that I'm truly sorry for that, this is becoming a standard datasets in time series, and as I am working with the data on my computer I forgot about this part. All the material is available freely online, but now I'm not sure what I can post (the link to the paper - that would be paper advertisement, the password...) or not...

I guess I can link to github... Here is a link to an issue containing the password. Please tell me if this is out of place and I'll delete the link.
https://github.com/NLeSC/mcfly/issues/7

Again, I'm truly sorry for that, I did not realized that it could be interpreted that way. Thank you for pointing that, I'll be more careful in the future.

1 Like

Hi! Are you running nightly, or stable? I've noticed quite a perf regression on nightly – 51-ish seconds vs. 43-ish on stable.

Well, you heard well! The [] operator on vecs and slices does indeed perform bound-checking by default (also, llvm usually takes good care of removing them). Anyway, there's a lot better solution than using raw pointers here! There's a get_unchecked (and the _mut counterpart) method on slices – if you only care about removing bound checks – get rid of all the raw pointers and just use them :slight_smile: instead of offset

Also, you have a use-after-free inside main!

let working_area = vec![0 as f64; ts_size * ts_size].as_mut_ptr();

The vector you allocate here is dropped at the end of this statement! This means that the raw pointer points to deallocated memory. Fortunately, it seems that you don't overwrite the memory afterwards and it "just works". Anyway, you should just change it to:

let working_area = vec![0 as f64; ts_size * ts_size];
...
total_e += vi.compute_dtw(vj, &mut working_area);

Assuming that compute_dtw has this signature:

fn compute_dtw(&self, other: &Self, m : &mut [f64]) -> f64

Using raw pointers instead of Rust types makes the compiler forget all the important information about aliasing. Also, you should iterators where you can! The .zip method is really well implemented on slices' iterators. Unfortunately in your current algorithm here, you're writing and reading the same row at once, so you can't really benefit from the iterators :(.

Anyway, the no-raw-pointers version is still as β€žslow” as your original one. I don't really know why, since the main loop looks really dense here:

β”Œβ”€β†’movsd  -0x8(%rbp,%r14,8),%xmm0
β”‚  movsd  0x0(%rbp,%r14,8),%xmm1 
β”‚  lea    0x0(%rbp,%rcx,1),%rbp  
β”‚  movsd  -0x8(%rbp,%r14,8),%xmm2
β”‚  movapd %xmm0,%xmm3            
β”‚  minsd  %xmm1,%xmm3            
β”‚  cmplts %xmm2,%xmm0            
β”‚  minsd  %xmm1,%xmm2            
β”‚  andpd  %xmm0,%xmm3            
β”‚  andnpd %xmm2,%xmm0            
β”‚  orpd   %xmm3,%xmm0            
β”‚  movsd  (%rsi),%xmm1           
β”‚  subsd  (%rbx,%r14,8),%xmm1    
β”‚  mulsd  %xmm1,%xmm1            
β”‚  addsd  %xmm0,%xmm1            
β”‚  movsd  %xmm1,0x0(%rbp,%r14,8) 
β”‚  add    $0x8,%rsi              
β”‚  dec    %rdi                   
└──jne    2dd0                   

I don't know what tricks is Java's JIT pulling here, if it's over 2Γ— faster. Maybe it's the better choice of instructions? Maybe it just unrolls the whole loop? (it's 271 cols, right?) Maybe it's doing some vectorization (that would be quite magic)?

Are you 100% sure the Rust program is doing the same work as Java one?


edit: Forgot the quite obvious thing to check!:
Setting your environment variable RUSTFLAGS to '-C target-cpu=native' makes the Rust code run 22s on my machine (Skylake)!

The main loop looks like that:

β”Œβ”€β†’vmovsd (%rax,%rdx,1),%xmm0     
β”‚  vmovsd (%rdx),%xmm1            
β”‚  vmovsd 0x8(%rdx),%xmm2         
β”‚  vminsd %xmm2,%xmm1,%xmm3       
β”‚  vminsd %xmm2,%xmm0,%xmm2       
β”‚  vcmplt %xmm0,%xmm1,%xmm0       
β”‚  vblend %xmm0,%xmm3,%xmm2,%xmm0 
β”‚  vmovsd (%rcx,%r10,8),%xmm1     
β”‚  vsubsd (%rsi,%r9,8),%xmm1,%xmm1
β”‚  vmulsd %xmm1,%xmm1,%xmm1       
β”‚  vaddsd %xmm1,%xmm0,%xmm0       
β”‚  vmovsd %xmm0,0x8(%rax,%rdx,1)  
β”‚  vmovsd (%rax,%rdx,1),%xmm1     
β”‚  vmovsd (%rbx,%rdx,1),%xmm2     
β”‚  vminsd %xmm0,%xmm1,%xmm3       
β”‚  vminsd %xmm0,%xmm2,%xmm0       
β”‚  vcmplt %xmm2,%xmm1,%xmm1       
β”‚  vblend %xmm1,%xmm3,%xmm0,%xmm0 
β”‚  vmovsd 0x8(%rcx,%r10,8),%xmm1  
β”‚  vsubsd (%rsi,%r9,8),%xmm1,%xmm1
β”‚  vmulsd %xmm1,%xmm1,%xmm1       
β”‚  vaddsd %xmm1,%xmm0,%xmm0       
β”‚  vmovsd %xmm0,0x8(%rbx,%rdx,1)  
β”‚  lea    (%rbx,%rdx,1),%rdx      
β”‚  add    $0x2,%r10               
β”‚  cmp    %r10,%r14               
└──jne    3470                    

Still no vectorization, but the loop is unrolled twice.

1 Like

Looking at the previous row doesn't really change much here – it's still in the predictable place for cache. I think that for larger matrices you'd start seeing cache-misses. In your case, single cache-line-column fits in the L1 (270Γ—64 = 17kB) and the whole matrix almost fits in L2. That's why you're not seeing cache-misses. Why the column-row is slightly faster than row-column is a quite interesting question though.

Does it use O(N^2) memory as well? At least in the Rust version it is easily to optimize memory to O(N) by storing only two rows of the matrix.

The JavaVM is quite good at performing static and even dynamic unrolling, and this often improves the performance of numeric kernels. LLVM doesn't seem equally able so far.

I've managed to make the code two times faster on my machine (60 vs 30 seconds) with the following changes:

  1. use vectors and bound checked indexing instead of offsets and pointers

  2. allocate temporary storage every time instead of managing a shared buffer

  3. rewrite algorithm to use O(N) memory

Not sure which one has brought the most benefits, but it definitely was not O(N^2) -> O(N) alone. The code is here: https://gist.github.com/matklad/0d1912d3bcd4a8a8a5b3e867e7466881

1 Like

I tried to vectorize some things, and simply by precomputing min(d10, d11) I've made it a little faster – improved from 22s to 17s. full code

I feel there's a lot of potential to vectorize, but I couldn't manage to untangle the serial dependency here. I've also tried to precompute square_dist in a vectorized way, but it didn't bring any speedup.

Also, let's not forget that there's an opportunity for parallelization of the main loop (not the one in the algorihtm)!

@matklad I've tried your code, and I'm too seeing it's faster, but the gains are smaller on my machine. The timings as printed by the program.

Original code (without -C target-cpu=native) – 43s.
Your code (from now all with native) – 35s.
Original code – 22s
Your code – 21s
Your code with slight vectorization – 19s
Original with slight vectorization – 17s.

Fun fact, your code on current nightly takes 109s :), the indexing seems not to be inlined (get_unchecked too).

1 Like

Thanks guys, you are awesome!
By just putting the compilation flag, it came down to 28s.

Thanks a lot for your code suggestions! We have optimized algorithms in our java code base, my purpose here is mainly to test Rust as a possible replacement. Thus, I'm not looking for algorithmic change, but more for things I did wrong or missed (like the flag or the use after free). Anyway, many thanks :slight_smile: It's nice to have a cool community here!

@krdln : I'm using the stable rustc 1.15.1. Thank you for pointing me in the right direction with vector usage! I'm pretty confident that the java code and the rust one are doing the same thing (same loops, matrix in O(N^2), etc ...). Yes this example is 270 columns, but in general that can be way bigger (for example, check HandOutlines from the archive).

@matklad : I would have think that allocating every time is very costly! The java version I'm comparing with is also O(N^2)

Any idea why the nightly is so slow ?

Edit :
@krdln: So the loop is unrolled, that's great. Do you know if we can have a control on this, e.g. to unroll more times, or if LLVM will eventually have vectorisation ?

1 Like

Sorry to be pedantic, but "JavaVM" is too generic - I'll assume you mean Oracle Hotspot :slight_smile:

Hotspot will look at the dynamic trip count of a loop (via profiling it collects) and then will unroll by up to 16, IIRC. There's a cmdline flag to change the unroll factor, but I'm guessing @HerrmannM would've mentioned it if it was tweaked.

Also, Hotspot doesn't do loop interchange or anything of the sort, so I don't think it's doing anything "magical" here. There's an LLVM flag to change unroll factor (right?), maybe that can be tweaked to see if it makes a difference. Barring that, can try unrolling by 16 manually :slight_smile:.

In general, Hotspot doesn't (currently, as of latest Java 8 release) do autovectorization of anything except array copies (ie no vector math, no vector logical ops, etc).

1 Like

Maybe there's some funky cache associativity collisions there and the cache isn't fully utilized due to replacement? I guess running both versions under perf might shed some light.

Hi @vitalyd !

Sorry for that, I should have give the info from the beginning:

java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

Nothing tweaked here, java is launched without any options.

I'm trying to use a llvm flags but I can't pass it to rust (Auto-Vectorization in LLVM β€” LLVM 16.0.0git documentation). This fails:

 export RUSTFLAGS="-C target-cpu=native -C llvm-args=-fslp-vectorize-aggressive"

I also would like to try with those: Auto-Vectorization in LLVM β€” LLVM 16.0.0git documentation, but I can't pass them to LLVM through rustc...

By curiosity, how easy would it be to parallelize that loop? Let's say I don't want to spawn a thread per iteration, but I would like to work with a pool of threads, with a something that would be "parallel iterators". I see there is Rayon and simple_parrallel ; do you have any experience with one of those? Or a recommendation for a crate?

In my version with O(N) memory allocation is dwarfed by the actual computation. Passing pre allocated buffers does not affect performance at all, as well as switching to unchecked indexing.

With O(N^2) allocation should matter I guess.

What does Java version use to store 2d matrix? The same single array of doubles? Or does it allocate double[][]?