# A performance problem compared with Julia

rust code

``````fn calc_n(n:u64){
print!("N={},",n);
let now=std::time::Instant::now();
let mut ans=0u64;
let (mut r1,mut r2,mut r3,mut r4,mut r5,mut r6);
for a1 in 1..=n>>3{
r1=n-a1;
for a2 in a1..=r1/7{
r2=r1-a2;
for a3 in a2..=r2/6{
r3=r2-a3;
for a4 in a3..=r3/5{
r4=r3-a4;
for a5 in a4..=r4>>2{
r5=r4-a5;
for a6 in a5..=r5/3{
r6=r5-a6;
for a7 in a6..=r6>>1{
ans+=a1^a2^a3^a4^a5^a6^a7^(r6-a7);
}
}
}
}
}
}
}
println!("{}, cost={:?}",ans,now.elapsed());
}
fn main(){
calc_n(100);
calc_n(160);
calc_n(300);
calc_n(400);
calc_n(500);
calc_n(600);
}
``````

julia code:

``````function calcN(n::Int64)
t1 = time()
ans::Int64 = 0

for a1=1:div(n, 8)
r1 = n - a1
for a2=a1:div(r1, 7)
r2 = r1 - a2
for a3=a2:div(r2, 6)
r3 = r2 - a3
for a4=a3:div(r3, 5)
r4 = r3 - a4
for a5=a4:div(r4, 4)
r5 = r4 - a5
for a6=a5:div(r5, 3)
r6 = r5 - a6
for a7=a6:div(r6, 2)
a8=r6 - a7
ans += xor(xor(xor(a1, a2), xor(a3, a4)), xor(xor(a5, a6), xor(a7, a8)))
end
end
end
end
end
end
end
println("n=\$n, ans = \$ans, cost = \$(time() -t1)")
end
``````

Using my laptop with i7-8750H CPU, I found that Julia is 5x faster than Rust.

``````julia> calcN(100);
n=100, ans = 29892426, cost = 0.0006000995635986328

julia> calcN(160);
n=160, ans = 901994896, cost = 0.009427070617675781

julia> calcN(300);
n=300, ans = 109638857854, cost = 0.4203529357910156

julia> calcN(400);
n=400, ans = 1260273347474, cost = 2.5435290336608887

julia> calcN(500);
n=500, ans = 6722928203618, cost = 10.702456951141357

julia> calcN(600);
n=600, ans = 25125831176186, cost = 34.18349599838257
``````
``````\$ rustc test.rs -O 2>/dev/null && ./test
N=100,29892426, cost=1.616831ms
N=160,901994896, cost=30.246651ms
N=300,109638857854, cost=1.737976853s
N=400,1260273347474, cost=11.636545741s
N=500,6722928203618, cost=51.702947067s
N=600,25125831176186, cost=177.460155807s
``````

I asked the same question in Chinese Rust forum, a reply said that, Code generated by Julia use avx2 code, so that Julia is much faster than Rust.
To my best knowledge, both Julia and Rust are using llvm as their backend.
Why Julia could generate much faster code than Rust?

Further more, let us suppose every instruction in Julia is written in avx2. Since the input of `calcN` is a 64-bit integer, avx2 instructions could only boost ~4x.
but actually we have, 34.18349599838257*4<177.460155807
avx2 it not all the reasons for the performance regression.

Is such phenomenon worth an issue?

2 Likes

Does `-C target-cpu=native` help? Julia is a JIT language, so it will always optimize for the native CPU. Rust by default optimizes for SSE2 only.

1 Like
``````\$ rustc test-4.rs -O -C target-cpu=native && ./test-4
N=100,29892426, cost=1.731606ms
N=160,901994896, cost=31.866504ms
N=300,109638857854, cost=1.758067421s
N=400,1260273347474, cost=11.810247732s
``````

nothing changed.
even with `-C target-feature=+avx2,+fma`, the cost of Rust keeps the same.

You could check rust.godbolt.org to see what assembly is generated, and whether Rust/LLVM generated something poorly.

The only fact I could found is that, Rust do not generate avx2 instructions.
Since I am not familiar with assembly especially avx assembly, I could find nothing.

Now I'm preparing an issue about it, but don't know whether such phenomenon worth an issue.
any suggestions?

Using `a1..(n>>3)+1` instead of `..=` in all the loops makes a dramatic improvement. I knew `..=` was considered inferior to `..` but the difference is way more than I expected.

Original code with `..=` loops:

``````N=100,29892426, cost=1.999233ms
N=160,901994896, cost=37.641195ms
N=300,109638857854, cost=1.119226351s
N=400,1260273347474, cost=7.54105234s
N=500,6722928203618, cost=33.758585613s
N=600,25125831176186, cost=115.323697372s
``````

With `..`:

``````N=100,29892426, cost=1.477499ms
N=160,901994896, cost=22.607924ms
N=300,109638857854, cost=491.372439ms
N=400,1260273347474, cost=2.960076654s
N=500,6722928203618, cost=12.473686557s
N=600,25125831176186, cost=41.312578888s
``````
8 Likes

Amazing, so it does. Would never have expected that.

On my old Intel i7-2600K (8) @ 3.401GHz Rust and Julia are much closer (after gkcjones mod):

``````julia> calcN(600)
n=600, ans = 25125831176186, cost = 61.74007701873779
``````

vs

``````\$ cargo run --release
...
N=600,25125831176186, cost=70.8863923s
``````

Yeah, `..=` has known performance problems compared to `..`

1 Like

curious story.. first time Iâ€™m hearing about this, could you provide an explanation or a link to a related discussion with an explanation as to why this is the case?

3 Likes

â€śConsidered inferior toâ€ť might not have been the best words to describe it! Iâ€™ve only been learning and using Rust for the past 6 months or so (itâ€™s bit of a lock-down time-filler while the pubs are closed) and have read so many articles, URLO threads, r/rust threads, Stack Overflow, the Book of course, etc. that Iâ€™ve quite forgotten where most of the specifics Iâ€™ve absorbed came from. Somewhere along the way I got the impression that `..` should be preferred over `..=` where possible. Iâ€™ll try to remember/search a bit.

As for this case, the only explanation I can see for such a big difference would be if `RangeInclusive::next()` does something that prevents loop vectorization, but I havenâ€™t looked at the assembly or `Range*` source to confirm that. I quick look just now shows that `RangeInclusive` has an extra boolean field compared to plain `Range` though.

3 Likes

Ah, that makes a lot of sense. So the actual problem is the use of `for` loops instead of something like `for_each`, not the use of `..=`.

Something like e.g. `0..=u8::MAX` has to store an extra `bool` for whether itâ€™s depleted since you canâ€™t have the lower bound run past the upper bound without overflow (and overflow would take you back to your initial `0..=u8::MAX` range). An the extra bool is checked on every call to `next` which is what `for` loops desugar to.

10 Likes

You should use `i64` instead of `u64` to compare the same thing between rust and julia.
The generated assembly is not the same: i64 and u64.

Source code that corroborates this:

1 Like

I decided to have a play with this this evening.

Findings so far:

1. There is no benefit to using `let (mut r1â€¦`â€”a local immutable `let` in each loopâ€™s assignment is fine.
2. Using shift instead of divide offers no benefit. Modern C optimizers do that sort of thing just fine, and it seems that is true for Rust/LLVM too.
3. Using `i64` instead of `u64` is noticeably slower.
4. `(_..=_).for_each(_)` is actually slightly slower than a plain `for` loop in this case, so unfortunately that doesnâ€™t let `..=` off the hook. (But thanks for the explanation @steffahn.)
5. With `rustc -C opt-level=3 -C target-cpu=native -C codegen-units=1`, the plain `..` range version is only 2 seconds slower than Julia on my machine (Threadripper 2950X), but thatâ€™s still 2 seconds too slow
6. I miss the pub.
10 Likes

`-C lto` is another one you might want to try. Seems to result in a reliable speedup on my PC, at least.

LTO does seem to reproducibly knock another second off N=600, which is surprising to me given the nature of the code as primitive operations and what I assume are lightweight in-lines.

Slower than the `for` loop for `..=` or only slower than the for loop for `..`? If `(_..=_).for_each(_)` is slower than the `for _ in _..=_` loop Iâ€™m still kind-of curious why. Actually, either way Iâ€™m curious either way unless it is only marginally slower than the `for _ in _.._` loop.

`(_..=_).for_each(_)` was slower than `for _ in (_..=_)`.

for loop: `N=600,25125831176186, cost=115.075504448s`,
for_each: `N=600,25125831176186, cost=127.498682164s`

Maybe this is due to inlining failing because the `for_each` implementation (using this `try_fold` code internally), calls the closure it gets in two places (line 725 and 731) which can prohibit the whole thing from being inlined because each nesting level would require duplicating the body, and depth `7`, so a factor of `2^7` of 128 is probably too much. What happens if you only turn the innermost or the two innermost loops into `for_each`? Not that this matters too much, I guess that using `for` with `..` everywhere is pretty much the cleanest option.

6 Likes

Woah. Fastest yet. This beats both `..` and Julia.

``````N=100,29892426, cost=1.118508ms
N=160,901994896, cost=17.40738ms
N=300,109638857854, cost=437.5456ms
N=400,1260273347474, cost=2.660773117s
N=500,6722928203618, cost=11.429060049s
N=600,25125831176186, cost=38.101639024s
``````
4 Likes