Need help with I/O bottleneck: slower than JVM?

I was micro bench-marking these two codes and I was shocked to see Rust actually get beaten by Kotlin/JVM. I have buffered the stdout, put a lock in before the print loop and kept the overhead low. Where is it falling behind...

use std::io::Write;

fn main() {
    let stdout = std::io::stdout();
    let mut lock = stdout.lock();
    for i in 0..100_000_000 {
        write!(lock, "Hello, {}\t", i).unwrap();
    }
}
// build with:
// kotlinc program.kt -include-runtime -d program.jar
// run with:
// java -jar program.jar
import java.io.BufferedOutputStream

fun main() {
    val file = BufferedOutputStream(System.out, 8192)
    (0..100_000_000).forEach { file.write("Hello, $it\t".toByteArray()) }
}

I then check with hyperfine, Kotlin on JVM ends up beating Rust:

$ hyperfine --warmup 1 './rust/target/release/rust' 'java -jar program.j
ar'
Benchmark #1: ./rust/target/release/rust
  Time (mean ± σ):      5.390 s ±  0.077 s    [User: 5.361 s, System: 0.024 s]
  Range (min … max):    5.326 s …  5.591 s    10 runs
 
Benchmark #2: java -jar program.jar
  Time (mean ± σ):      4.712 s ±  0.032 s    [User: 5.151 s, System: 0.172 s]
  Range (min … max):    4.665 s …  4.770 s    10 runs
 
Summary
  'java -jar program.jar' ran
    1.14 ± 0.02 times faster than './rust/target/release/rust'
1 Like

stdout is already buffered, FYI. Adding another layer is just increasing the work done.

Okay I didn't notice that, I was messing around and it got left out unnoticed. It still does not improve any performance though. It's still an 8k buffer.

use std::io::Write;

fn main() {
    let stdout = std::io::stdout();
    let mut lock = stdout.lock();
    for i in 0..100_000_000 {
        lock.write_fmt(format_args!("Hello, {}\t", i)).unwrap();
    }
}

I don't know if it makes a difference but in the Rust code you're using "\t" but in Kotlin you're using "\n".

3 Likes

Also shouldn't matter, but that's usually written

write!(lock, "Hello, {}\t", i).unwrap();

Missed that too, updated the code in the OP

Format is not designed for speed. Use itoa cate:

use std::io::Write;

fn main() {
    let stdout = std::io::stdout();
    let mut lock = stdout.lock();
    for i in 0..10_000_000 {
        lock.write(b"Hello, ");
        itoa::write(&mut lock, i);
    }
}
4 Likes

Can you redirect the output of both programs to eliminate any shell overhead? Then, compare the two files to make sure that they produce the same output.

1 Like
$ hyperfine --warmup 1 './rust/target/release/rust >> /dev/null' 'java -
jar program.jar >> /dev/null' -r 3
Benchmark #1: ./rust/target/release/rust >> /dev/null
  Time (mean ± σ):      5.285 s ±  0.063 s    [User: 5.256 s, System: 0.023 s]
  Range (min … max):    5.219 s …  5.344 s    3 runs
 
Benchmark #2: java -jar program.jar >> /dev/null
  Time (mean ± σ):      4.677 s ±  0.048 s    [User: 5.136 s, System: 0.150 s]
  Range (min … max):    4.622 s …  4.710 s    3 runs
 
Summary
  'java -jar program.jar >> /dev/null' ran
    1.13 ± 0.02 times faster than './rust/target/release/rust >> /dev/null'

For note, of the original versions, if you just remove ....unwrap(); and replace it with let _ = ...; then locally the Rust version is faster than kotlin/java's by a little bit (4% faster). Your kotlin/java code is not doing error checks where the unwrap in the rust is doing error checks. Should compare the same code on both.

EDIT: And for further comparison, if I remove the formatting code in both to just print static strings for both then they both run in a fraction of the time, so it doesn't seem they are bottlenecked on the stdout, but rather the integer-to-string formatting for them both (and rusts is still faster with the error check removed).

1 Like

I removed the unwrap in favor of let _ = write!(lock, "Hello, {}\t", i);, JVM remains 1.11x faster.

So the bottleneck is the conversion

$ hyperfine --warmup 1 './rust/target/release/rust' 'java -jar program.j
ar' -r 3
Benchmark #1: ./rust/target/release/rust
  Time (mean ± σ):      5.216 s ±  0.095 s    [User: 5.188 s, System: 0.024 s]
  Range (min … max):    5.118 s …  5.307 s    3 runs
 
Benchmark #2: java -jar program.jar
  Time (mean ± σ):      4.699 s ±  0.101 s    [User: 5.196 s, System: 0.170 s]
  Range (min … max):    4.606 s …  4.806 s    3 runs
 
Summary
  'java -jar program.jar' ran
    1.11 ± 0.03 times faster than './rust/target/release/rust'

I'm curious, what if you put the buffer back in the rust code? Right now you are buffering in java/kotlin when stdout is already buffered there too, and adding the extra buffer might be what's helping since it's basically able to buffer twice as much? I just tried removing the buffer on the rust side away from your original code and it's slower than the java/kotlin now.

1 Like

Ok that helped.
Loos like the conversion was indeed the bottleneck

As suggested, I used the itoa crate (why is it not in the std yet?)

use std::io::Write;

fn main() {
    let stdout = std::io::stdout();
    let lock = stdout.lock();
    let mut buf = std::io::BufWriter::new(lock);
    for i in 0..100_000_000 {
        let _ = buf.write(b"Hello, ");
        let _ = itoa::write(&mut buf, i);
        let _ = buf.write(b"\t");
    }
}
$ hyperfine --warmup 1 './rust/target/release/rust' 'java -jar program.j
ar' -r 3
Benchmark #1: ./rust/target/release/rust
  Time (mean ± σ):      2.416 s ±  0.047 s    [User: 2.389 s, System: 0.025 s]
  Range (min … max):    2.362 s …  2.446 s    3 runs
 
Benchmark #2: java -jar program.jar
  Time (mean ± σ):      4.640 s ±  0.023 s    [User: 5.098 s, System: 0.147 s]
  Range (min … max):    4.618 s …  4.663 s    3 runs
 
Summary
  './rust/target/release/rust' ran
    1.92 ± 0.04 times faster than 'java -jar program.jar'
8 Likes

Manually unroll the loop helps as well.

use std::io::Write;

fn main() {
    let stdout = std::io::stdout();
    let lock = stdout.lock();
    let mut buf = std::io::BufWriter::new(lock);
    let _ = buf.write(b"Hello, ");
    let _ = itoa::write(&mut buf, 0);
    for i in 1..100_000_000 {
        let _ = buf.write(b"\tHello, ");
        let _ = itoa::write(&mut buf, i);
    }
    let _ = buf.write(b"\t");
}
> hyperfine -w 1 -r 3 target/release/rust{,-unrolled}
Benchmark #1: target/release/rust
  Time (mean ± σ):      3.155 s ±  0.002 s    [User: 3.094 s, System: 0.055 s]
  Range (min … max):    3.153 s …  3.157 s    3 runs

Benchmark #2: target/release/rust-unrolled
  Time (mean ± σ):      2.804 s ±  0.093 s    [User: 2.732 s, System: 0.063 s]
  Range (min … max):    2.697 s …  2.860 s    3 runs

Summary
  'target/release/rust-unrolled' ran
    1.12 ± 0.04 times faster than 'target/release/rust'
1 Like