I was micro bench-marking these two codes and I was shocked to see Rust actually get beaten by Kotlin/JVM. I have buffered the stdout, put a lock in before the print loop and kept the overhead low. Where is it falling behind...
use std::io::Write;
fn main() {
let stdout = std::io::stdout();
let mut lock = stdout.lock();
for i in 0..100_000_000 {
write!(lock, "Hello, {}\t", i).unwrap();
}
}
Okay I didn't notice that, I was messing around and it got left out unnoticed. It still does not improve any performance though. It's still an 8k buffer.
use std::io::Write;
fn main() {
let stdout = std::io::stdout();
let mut lock = stdout.lock();
for i in 0..100_000_000 {
lock.write_fmt(format_args!("Hello, {}\t", i)).unwrap();
}
}
use std::io::Write;
fn main() {
let stdout = std::io::stdout();
let mut lock = stdout.lock();
for i in 0..10_000_000 {
lock.write(b"Hello, ");
itoa::write(&mut lock, i);
}
}
Can you redirect the output of both programs to eliminate any shell overhead? Then, compare the two files to make sure that they produce the same output.
For note, of the original versions, if you just remove ....unwrap(); and replace it with let _ = ...; then locally the Rust version is faster than kotlin/java's by a little bit (4% faster). Your kotlin/java code is not doing error checks where the unwrap in the rust is doing error checks. Should compare the same code on both.
EDIT: And for further comparison, if I remove the formatting code in both to just print static strings for both then they both run in a fraction of the time, so it doesn't seem they are bottlenecked on the stdout, but rather the integer-to-string formatting for them both (and rusts is still faster with the error check removed).
I'm curious, what if you put the buffer back in the rust code? Right now you are buffering in java/kotlin when stdout is already buffered there too, and adding the extra buffer might be what's helping since it's basically able to buffer twice as much? I just tried removing the buffer on the rust side away from your original code and it's slower than the java/kotlin now.
Ok that helped.
Loos like the conversion was indeed the bottleneck
As suggested, I used the itoa crate (why is it not in the std yet?)
use std::io::Write;
fn main() {
let stdout = std::io::stdout();
let lock = stdout.lock();
let mut buf = std::io::BufWriter::new(lock);
for i in 0..100_000_000 {
let _ = buf.write(b"Hello, ");
let _ = itoa::write(&mut buf, i);
let _ = buf.write(b"\t");
}
}
$ hyperfine --warmup 1 './rust/target/release/rust' 'java -jar program.j
ar' -r 3
Benchmark #1: ./rust/target/release/rust
Time (mean ± σ): 2.416 s ± 0.047 s [User: 2.389 s, System: 0.025 s]
Range (min … max): 2.362 s … 2.446 s 3 runs
Benchmark #2: java -jar program.jar
Time (mean ± σ): 4.640 s ± 0.023 s [User: 5.098 s, System: 0.147 s]
Range (min … max): 4.618 s … 4.663 s 3 runs
Summary
'./rust/target/release/rust' ran
1.92 ± 0.04 times faster than 'java -jar program.jar'
use std::io::Write;
fn main() {
let stdout = std::io::stdout();
let lock = stdout.lock();
let mut buf = std::io::BufWriter::new(lock);
let _ = buf.write(b"Hello, ");
let _ = itoa::write(&mut buf, 0);
for i in 1..100_000_000 {
let _ = buf.write(b"\tHello, ");
let _ = itoa::write(&mut buf, i);
}
let _ = buf.write(b"\t");
}
> hyperfine -w 1 -r 3 target/release/rust{,-unrolled}
Benchmark #1: target/release/rust
Time (mean ± σ): 3.155 s ± 0.002 s [User: 3.094 s, System: 0.055 s]
Range (min … max): 3.153 s … 3.157 s 3 runs
Benchmark #2: target/release/rust-unrolled
Time (mean ± σ): 2.804 s ± 0.093 s [User: 2.732 s, System: 0.063 s]
Range (min … max): 2.697 s … 2.860 s 3 runs
Summary
'target/release/rust-unrolled' ran
1.12 ± 0.04 times faster than 'target/release/rust'