A performance problem compared with Julia

rust code

fn calc_n(n:u64){
    print!("N={},",n);
    let now=std::time::Instant::now();
    let mut ans=0u64;
    let (mut r1,mut r2,mut r3,mut r4,mut r5,mut r6);
    for a1 in 1..=n>>3{
        r1=n-a1;
        for a2 in a1..=r1/7{
            r2=r1-a2;
            for a3 in a2..=r2/6{
                r3=r2-a3;
                for a4 in a3..=r3/5{
                    r4=r3-a4;
                    for a5 in a4..=r4>>2{
                        r5=r4-a5;
                        for a6 in a5..=r5/3{
                            r6=r5-a6;
                            for a7 in a6..=r6>>1{
                                ans+=a1^a2^a3^a4^a5^a6^a7^(r6-a7);
                            }
                        }
                    }
                }
            }
        }
    }
    println!("{}, cost={:?}",ans,now.elapsed());
}
fn main(){
    calc_n(100);
    calc_n(160);
    calc_n(300);
    calc_n(400);
    calc_n(500);
    calc_n(600);
}

julia code:

function calcN(n::Int64)
    t1 = time()
    ans::Int64 = 0
   
    for a1=1:div(n, 8)
        r1 = n - a1
        for a2=a1:div(r1, 7)
            r2 = r1 - a2
            for a3=a2:div(r2, 6)
                r3 = r2 - a3
                for a4=a3:div(r3, 5)
                    r4 = r3 - a4
                    for a5=a4:div(r4, 4)
                        r5 = r4 - a5
                        for a6=a5:div(r5, 3)
                            r6 = r5 - a6
                            for a7=a6:div(r6, 2)
                                a8=r6 - a7
                                ans += xor(xor(xor(a1, a2), xor(a3, a4)), xor(xor(a5, a6), xor(a7, a8)))
                            end
                        end
                    end
                end
            end
        end
    end
    println("n=$n, ans = $ans, cost = $(time() -t1)")
end

Using my laptop with i7-8750H CPU, I found that Julia is 5x faster than Rust.

julia> calcN(100);
n=100, ans = 29892426, cost = 0.0006000995635986328

julia> calcN(160);
n=160, ans = 901994896, cost = 0.009427070617675781

julia> calcN(300);
n=300, ans = 109638857854, cost = 0.4203529357910156

julia> calcN(400);
n=400, ans = 1260273347474, cost = 2.5435290336608887

julia> calcN(500);
n=500, ans = 6722928203618, cost = 10.702456951141357

julia> calcN(600);
n=600, ans = 25125831176186, cost = 34.18349599838257
$ rustc test.rs -O 2>/dev/null && ./test
N=100,29892426, cost=1.616831ms
N=160,901994896, cost=30.246651ms
N=300,109638857854, cost=1.737976853s
N=400,1260273347474, cost=11.636545741s
N=500,6722928203618, cost=51.702947067s
N=600,25125831176186, cost=177.460155807s

I asked the same question in Chinese Rust forum, a reply said that, Code generated by Julia use avx2 code, so that Julia is much faster than Rust.
To my best knowledge, both Julia and Rust are using llvm as their backend.
Why Julia could generate much faster code than Rust?

Further more, let us suppose every instruction in Julia is written in avx2. Since the input of calcN is a 64-bit integer, avx2 instructions could only boost ~4x.
but actually we have, 34.18349599838257*4<177.460155807
avx2 it not all the reasons for the performance regression.

Is such phenomenon worth an issue?

2 Likes

Does -C target-cpu=native help? Julia is a JIT language, so it will always optimize for the native CPU. Rust by default optimizes for SSE2 only.

1 Like
$ rustc test-4.rs -O -C target-cpu=native && ./test-4
N=100,29892426, cost=1.731606ms
N=160,901994896, cost=31.866504ms
N=300,109638857854, cost=1.758067421s
N=400,1260273347474, cost=11.810247732s

nothing changed.
even with -C target-feature=+avx2,+fma, the cost of Rust keeps the same.

You could check rust.godbolt.org to see what assembly is generated, and whether Rust/LLVM generated something poorly.

The only fact I could found is that, Rust do not generate avx2 instructions.
Since I am not familiar with assembly especially avx assembly, I could find nothing.

Now I'm preparing an issue about it, but don't know whether such phenomenon worth an issue.
any suggestions?

Using a1..(n>>3)+1 instead of ..= in all the loops makes a dramatic improvement. I knew ..= was considered inferior to .. but the difference is way more than I expected.

Original code with ..= loops:

N=100,29892426, cost=1.999233ms
N=160,901994896, cost=37.641195ms
N=300,109638857854, cost=1.119226351s
N=400,1260273347474, cost=7.54105234s
N=500,6722928203618, cost=33.758585613s
N=600,25125831176186, cost=115.323697372s

With ..:

N=100,29892426, cost=1.477499ms
N=160,901994896, cost=22.607924ms
N=300,109638857854, cost=491.372439ms
N=400,1260273347474, cost=2.960076654s
N=500,6722928203618, cost=12.473686557s
N=600,25125831176186, cost=41.312578888s
8 Likes

Amazing, so it does. Would never have expected that.

On my old Intel i7-2600K (8) @ 3.401GHz Rust and Julia are much closer (after gkcjones mod):

julia> calcN(600)
n=600, ans = 25125831176186, cost = 61.74007701873779

vs

$ cargo run --release
...
N=600,25125831176186, cost=70.8863923s

Yeah, ..= has known performance problems compared to ..

1 Like

curious story.. first time I’m hearing about this, could you provide an explanation or a link to a related discussion with an explanation as to why this is the case?

3 Likes

“Considered inferior to” might not have been the best words to describe it! I’ve only been learning and using Rust for the past 6 months or so (it’s bit of a lock-down time-filler while the pubs are closed) and have read so many articles, URLO threads, r/rust threads, Stack Overflow, the Book of course, etc. that I’ve quite forgotten where most of the specifics I’ve absorbed came from. Somewhere along the way I got the impression that .. should be preferred over ..= where possible. I’ll try to remember/search a bit.

As for this case, the only explanation I can see for such a big difference would be if RangeInclusive::next() does something that prevents loop vectorization, but I haven’t looked at the assembly or Range* source to confirm that. I quick look just now shows that RangeInclusive has an extra boolean field compared to plain Range though.

3 Likes

Ah, that makes a lot of sense. So the actual problem is the use of for loops instead of something like for_each, not the use of ..=.

Something like e.g. 0..=u8::MAX has to store an extra bool for whether it’s depleted since you can’t have the lower bound run past the upper bound without overflow (and overflow would take you back to your initial 0..=u8::MAX range). An the extra bool is checked on every call to next which is what for loops desugar to.

10 Likes

You should use i64 instead of u64 to compare the same thing between rust and julia.
The generated assembly is not the same: i64 and u64.

Source code that corroborates this:

1 Like

I decided to have a play with this this evening.

Findings so far:

  1. There is no benefit to using let (mut r1…—a local immutable let in each loop’s assignment is fine.
  2. Using shift instead of divide offers no benefit. Modern C optimizers do that sort of thing just fine, and it seems that is true for Rust/LLVM too.
  3. Using i64 instead of u64 is noticeably slower.
  4. (_..=_).for_each(_) is actually slightly slower than a plain for loop in this case, so unfortunately that doesn’t let ..= off the hook. (But thanks for the explanation @steffahn.)
  5. With rustc -C opt-level=3 -C target-cpu=native -C codegen-units=1, the plain .. range version is only 2 seconds slower than Julia on my machine (Threadripper 2950X), but that’s still 2 seconds too slow :wink:
  6. I miss the pub.
10 Likes

-C lto is another one you might want to try. Seems to result in a reliable speedup on my PC, at least.

LTO does seem to reproducibly knock another second off N=600, which is surprising to me given the nature of the code as primitive operations and what I assume are lightweight in-lines.

Slower than the for loop for ..= or only slower than the for loop for ..? If (_..=_).for_each(_) is slower than the for _ in _..=_ loop I’m still kind-of curious why. Actually, either way I’m curious either way unless it is only marginally slower than the for _ in _.._ loop.

(_..=_).for_each(_) was slower than for _ in (_..=_).

for loop: N=600,25125831176186, cost=115.075504448s,
for_each: N=600,25125831176186, cost=127.498682164s

Maybe this is due to inlining failing because the for_each implementation (using this try_fold code internally), calls the closure it gets in two places (line 725 and 731) which can prohibit the whole thing from being inlined because each nesting level would require duplicating the body, and depth 7, so a factor of 2^7 of 128 is probably too much. What happens if you only turn the innermost or the two innermost loops into for_each? Not that this matters too much, I guess that using for with .. everywhere is pretty much the cleanest option.

6 Likes

Woah. Fastest yet. This beats both .. and Julia.

N=100,29892426, cost=1.118508ms
N=160,901994896, cost=17.40738ms
N=300,109638857854, cost=437.5456ms
N=400,1260273347474, cost=2.660773117s
N=500,6722928203618, cost=11.429060049s
N=600,25125831176186, cost=38.101639024s
4 Likes