Reading binary files in Rust is slower than Go on Windows?

Hey there folks, I hope you're doing well. In developing a little console application to pull some data out of some binary files, I noticed that reading binary data into memory was quite a lot faster in Go than Rust on my Windows system.

I managed to reproduce the problem with a simple application in both languages using the 20 MB test file found at https://testfiledownload.com/ (although you may use any similarly sized file you wish). I used hyperfine to benchmark the various versions.

Here's the Go version:

package main

import (
	"fmt"
	"os"
)

func main() {
	for i := 0; i < 500; i++ {
		data, err := os.ReadFile("20MB.zip")
		if err != nil {
			panic(err)
		}
		fmt.Println(len(data))
	}
}

And the Rust version:

fn main() {
    for _ in 0..500 {
        let data = std::fs::read("20MB.zip").unwrap();
        println!("{}", data.len());
    }
}

And here are the benchmark results on Windows 11 Pro running on the same drive with hyperfine performing 10 runs of each:

4.022  Rust GNU Compiler (using the --release flag)
3.979  Rust MSVC Compiler (using the --release flag)
2.868  Go

Just for fun, I also tried with C++ using the following code:

#include <fstream>
#include <vector>

int main()
{
    for (int i = 0; i < 500; i++)
    {
        std::ifstream stream("20MB.zip", std::ios::binary | std::ios::ate);

        std::streampos size = stream.tellg();
        stream.seekg(0, std::ios::beg);
        if (size == -1)
        {
            stream.close();
            return 1;
        }

        std::vector<char> data(size);
        stream.read(data.data(), size);
        stream.close();
        printf("%d\n", data.size());
    }
    return 0;
}

And using MinGW, it took 4.571 secs which confused me further as I would have expected that at least one of Rust or C++ to beat Go on this benchmark.:blush:

Would you happen to know why there's such a difference in performance for this task?

Thanks heaps
Fotis

Can you try it without the printing? Maybe add all the lengths together?

Sure, the following benchmarks don't print anything but just total up the number of bytes read across the 250 iterations:

2.889  Go
3.969  Rust MSVC --release

Sadly it appears to be the actual reading itself which is the cause. I am hoping to try and implement this directly using windows-sys just to see if that yields a different result.

All other thoughts are welcome of course :blush:

Cheers
Fotis

There does seem to be something seriously interesting going on in Go land, because even using the native Win32 APIs via windows-rs, I still end up with a similar result in Rust. Please forgive the panic calls, just wanted to get something going fast.

use std::ptr;

use widestring::u16cstr;
use windows_sys::Win32::{
    Foundation::{CloseHandle, GENERIC_READ, INVALID_HANDLE_VALUE},
    Storage::FileSystem::{
        CreateFileW, GetFileSize, ReadFile, FILE_SHARE_READ, INVALID_FILE_SIZE, OPEN_EXISTING,
    },
};

fn main() {
    let mut total = 0;
    for _ in 0..500 {
        let data = unsafe { read_file() };
        total += data.len();
    }
}

unsafe fn read_file() -> Vec<u8> {
    let filename = u16cstr!("20MB.zip");
    let handle = CreateFileW(
        filename.as_ptr(),
        GENERIC_READ,
        FILE_SHARE_READ,
        ptr::null(),
        OPEN_EXISTING,
        0,
        0,
    );
    if handle == INVALID_HANDLE_VALUE {
        panic!("unable to open file");
    }

    let size = GetFileSize(handle, ptr::null_mut());
    if size == INVALID_FILE_SIZE {
        panic!("unable to get file size");
    }

    let mut buffer = vec![0u8; size as usize];
    let mut num_bytes_read = 0u32;
    let result = ReadFile(
        handle,
        buffer.as_mut_ptr(),
        size,
        &mut num_bytes_read,
        ptr::null_mut(),
    );
    if result != 1 {
        panic!("unable to read bytes");
    }

    let result = CloseHandle(handle);
    if result != 1 {
        panic!("unable to close file");
    }

    buffer
}

And the resulting time taken was 3.914 seconds.

Out of interest, I created a very similar low-level implementation in Go as follows:

package main

import (
	"fmt"
	"os"
	"syscall"
)

func main() {
	total := 0
	for i := 0; i < 500; i++ {
		data, err := ReadFile()
		if err != nil {
			panic(err)
		}
		total += len(data)
	}
	fmt.Println(total)
}

func ReadFile() ([]byte, error) {
	filename := "20MB.zip"

	filenameUTF16Ptr, err := syscall.UTF16PtrFromString(filename)
	if err != nil {
		panic(err)
	}

	handle, err := syscall.CreateFile(
		filenameUTF16Ptr,
		syscall.GENERIC_READ,
		syscall.FILE_SHARE_READ,
		nil,
		syscall.OPEN_EXISTING,
		0,
		0,
	)
	if err != nil {
		panic(err)
	}
	defer syscall.CloseHandle(handle)

	stat, err := os.Stat(filename)
	if err != nil {
		panic(err)
	}

	buffer := make([]byte, stat.Size())

	var done uint32
	err = syscall.ReadFile(handle, buffer, &done, nil)
	if err != nil {
		panic(err)
	}

	return buffer, nil
}

And again, this resulted in a faster runtime at 2.903 seconds.

Cheers
Fotis

This might have something to do with go’s garbage collection, the read buffer dealloc could be deferred to after the loop (or may not even happen at all). But there’s no way to tell for sure without looking at the machine code or profiling

4 Likes

That does cut it in half for me:

let mut buf = Vec::new();
for _ in 0..500 {
    let data = std::fs::File::open("20MB.zip").unwrap().read_to_end(&mut buf).unwrap();
    println!("{}", data);
    buf.clear();
}
4 Likes

Ah that would make sense actually!

Amazing, I can confirm the same result here also. With this approach, the run completes in 2.236 seconds (on average over 10 runs).

And assuming I'm doing this correctly (I have never forced the GC to run in Go before), forcing garbage collection in Go shows the real result here.

package main

import (
	"fmt"
	"os"
	"runtime/debug"
)

func main() {
	for i := 0; i < 500; i++ {
		data, err := os.ReadFile("20MB.zip")
		if err != nil {
			panic(err)
		}
		fmt.Println(len(data))
		debug.FreeOSMemory()
	}
}

This results in a runtime of around 7.122 seconds.

I wouldn't recommend thinking of that as “real”. Garbage collectors are generally designed to run only as needed and do their work in batches; forcing GC every time in your loop after only one allocation has been made is therefore going to make things slower. It might also be running in a “ensure absolutely no garbage is left uncollected” mode that is even less efficient than the normal mode.

(I've heard that some games run GC every frame, because they care about “no long pauses” more than they care about throughput of non-GC work — but that's a different requirement.)

Of course, in this particular case, the approach of reusing a single buffer instead of reallocating is going to be superior, regardless of whether that allocation is managed by GC or an end-of-scope destructor (Drop implementation).

8 Likes

Thanks so much for the explanation @kpreid. And just wanted to say a huge thanks to everyone for their help. I'm extremely happy to put this mystery to bed. :blush:

Another thing you might try for fun is

fn main() {
    for _ in 0..500 {
        let data = std::fs::read("20MB.zip").unwrap();
        println!("{}", data.len());
        std::mem::forget(data); // <-- added
    }
}

To still allocate every time, but never deallocate.

(TBH, for short-lifetime CLI apps, never deallocating can be a legitimate strategy, since the OS will clean up after you anyway. It's a classic approach in competitive programming, for example, and trivially avoids double-free in C(++).)

6 Likes

It's funny that you mention this. I was watching a talk just recently (I'm pretty sure it was about Zig) where one of the core devs said that this is a strategy they use to speed up short-running CLIs. :blush:

Edit: It was this talk https://www.youtube.com/watch?v=5_oqWE9otaE&t=3153s (timestamped link).

Cheers
Fotis