Rust flate2 seems slower than Elixir/Erlang zlib

Hello.

I have been comparing the "gunzip" function between Elixir/Erlang and Rust, using zlib for the Elixir and flate2 for Rust. This is using Rustler to convert Elixir data types to Rust. With previous cases Rust usually works out faster locally.

For using gunzip between zlib and flate2, it seems Rust is 3 times slower but uses less memory. I was expecting it to be much quicker or near enough comparable. The Elixir version manages over 1 million iterations per second but Rust only a third of this.

Name                    ips        average  deviation         median         99th %
:zlib.gunzip         1.07 M        0.94 μs  ±2580.31%        0.67 μs        1.29 μs
Zlibrs.gunzip        0.34 M        2.92 μs   ±459.74%        2.83 μs        3.75 μs

Comparison: 
:zlib.gunzip         1.07 M
Zlibrs.gunzip        0.34 M - 3.12x slower +1.98 μs

The code is here:

use flate2::read::GzDecoder;
use std::io::Read;

#[rustler::nif]
fn gunzip(payload: Vec<u8>) -> String {
    let mut gz = GzDecoder::new(&payload[..]);
    let mut s = String::new();
    gz.read_to_string(&mut s).unwrap().to_string();
    s
}

rustler::init!("Elixir.Zlibrs");

I see I can specify different backends. When using "zlib-ng" I can get to just below one million ips, but it is still 1.16 slower than the Elixir version. If this is the best I can get then fine, but the default "miniz_oxide" is in Rust and I would have thought quicker.

Is there anything I could do to improve the performance? Any suggestions on this is appreciated.

Two things:

  1. Are you compiling with optimizations? cargo run will compile in debug mode by default, you want cargo run --release for benchmarking.
  2. FFI (de-)serialization takes time. You should pass very large strings in order to make the actual decompression time dominate the total elapsed time. Better yet, only count the difference between compressing a large string and compressing the empty string (the latter of which will therefore only include the FFI conversion overhead).

Yes. It looks to be compiling in release mode.

Compiling crate zlibrs in release mode (native/zlibrs)
Finished `release` profile [optimized] target(s) in 6.84s

I am passing a short string, but it is the same for both versions being compared. I could try with a larger string as that might show more of a difference the other way, but would have thought it would have been the same outcome, maybe a bit slower for both versions.

Try using flate2::bufread::GzDecoder. The read version wraps the compressed slice in a BufReader, causing an additional copy of the data.

Generally you should prefer the bufread version if your input data implements BufRead.

2 Likes

I managed to get the Rust version slightly quicker than Elixir using "BufRead". When increasing the content length to 100kb of random text both versions are slower as expected. But, the Rust version is then nearly 4 times slower.

Name                    ips        average  deviation         median         99th %
:zlib.gunzip         6.12 K      163.33 μs     ±6.55%      160.13 μs      197.79 μs
Zlibrs.gunzip        1.65 K      607.41 μs     ±2.27%      602.42 μs      654.75 μs

Comparison: 
:zlib.gunzip         6.12 K
Zlibrs.gunzip        1.65 K - 3.72x slower +444.09 μs

You should not expect the flate2 crate to be faster than a C implementation of zlib. Rust is not magic. The C zlib implementations are very mature, and many of them include fine-tuned assembly.

The flate2 crate uses miniz_oxide by default, which is a basic implementation, and isn't performance-oriented. It has known inefficiencies caused by its naive safe-Rust implementation.

The flate2 crate allows picking other zlib backends, which are just different flavors C+asm zlib implementations, so at best you just get the same speed as if you linked to the same C library directly.

Also, --release is not enough for maximum performance here. You should set --target-cpu (needs RUSTFLAGS) to get at least some autovectorized SIMD.

6 Likes

Thank you for the pointers. I was thinking "zlib-ng" or "zlib" was equivalent to the system implementation and would have the same performance, maybe not better but at least comparable.

I will look at the flags for target cpu to see if that improves things. I was thinking the release would default to the number of CPUs but I will take a look at the options.

The target-cpu flag has nothing to do with the number of cores, and the compiler can't force programs to leverage multiple cores.

It tells the compiler to use newer CPU-specific instructions that are not compatible with older CPUs (such as AVX extensions on x86-64) and tunes selection of instructions based on their performance on the particular CPU architecture (some instructions are fast on some CPU models and very slow on other models).

By default compilers maximise compatibility to make programs work on oldest CPUs, and therefore avoid using newer instructions that could be faster.

In special case of target-cpu=native it will compile program to work only on the specific CPU model that you have in your computer.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.