Std::fs::read slow?

I was benchmarking some code, and noticed how std::fs::read was 2X slower than read() in Julia.

I wrote a terrible alternative that seems to work (besides leaking memory, and maybe being broken in other unexpected ways) but it's approx. 2X faster:

#![feature(allocator_api)]
#![feature(ptr_as_uninit)]

use std::{alloc::{Layout, Allocator}, time::Instant};

use nix::{fcntl::OFlag, sys::stat::Mode};
use rand::{Rng, distributions::Uniform};

fn main() {
    let mut rng = rand::thread_rng();
    let mut s = Vec::<u8>::new();
    for _ in 0..400_000_000 {
        s.push(rng.sample(Uniform::new(0, 26)) + 'a' as u8);
    }
    let path = "testfile";
    std::fs::write(path, s).unwrap();

    let tic = Instant::now();
    let f = std::fs::File::open(path).unwrap();
    let n = f.metadata().unwrap().len() as usize;
    let a = std::alloc::System;
    let layout = Layout::from_size_align(n, 32).unwrap();
    let v = a.allocate(layout).unwrap();
    let fd = nix::fcntl::open(path, OFlag::O_RDONLY, Mode::empty()).unwrap();
    let v1 = unsafe { std::mem::transmute::<_, &mut [u8]>(v.as_uninit_slice_mut()) };
    let x = nix::unistd::read(fd, v1).unwrap(); 
    assert_eq!(x, n); // I know this can fail
    println!("Terrible hack: {:10}µs", tic.elapsed().as_micros());

    let tic = Instant::now();
    let v2 = std::fs::read(path).unwrap();
    println!("std::fs::read: {:10}µs", tic.elapsed().as_micros());

    assert_eq!(v1, v2);
}

Output:

Terrible hack:      67359µs
std::fs::read:     142929µs

It simply allocates the necessary memory and then does a read syscall.

Where does all the additional overhead come from in std::fs::read? Could it be improved?
Maybe the buffer extension in the default_read_to_end used by std::fs::read is too conservative?

2 Likes

Interesting question! Let me start by posting the numbers I get on my machine for the code you provided:

$ rustc +nightly --version
rustc 1.67.0-nightly (1eb62b123 2022-11-27)
$ cargo +nightly run --release
   Compiling read_test v0.1.0 (/tmp/read_test)
    Finished release [optimized + debuginfo] target(s) in 0.42s
     Running `target/release/read_test`
Terrible hack:      57107µs
std::fs::read:     102822µs

Let's see what happens if we switch the order of the test cases:

#![feature(allocator_api)]
#![feature(ptr_as_uninit)]

use std::{alloc::{Layout, Allocator}, time::Instant};

use nix::{fcntl::OFlag, sys::stat::Mode};
use rand::{Rng, distributions::Uniform};

fn main() {
    let mut rng = rand::thread_rng();
    let mut s = Vec::<u8>::new();
    for _ in 0..400_000_000 {
        s.push(rng.sample(Uniform::new(0, 26)) + 'a' as u8);
    }
    let path = "testfile";
    std::fs::write(path, s).unwrap();

    let tic = Instant::now();
    let v2 = std::fs::read(path).unwrap();
    println!("std::fs::read: {:10}µs", tic.elapsed().as_micros());

    let tic = Instant::now();
    let f = std::fs::File::open(path).unwrap();
    let n = f.metadata().unwrap().len() as usize;
    let a = std::alloc::System;
    let layout = Layout::from_size_align(n, 32).unwrap();
    let v = a.allocate(layout).unwrap();
    let fd = nix::fcntl::open(path, OFlag::O_RDONLY, Mode::empty()).unwrap();
    let v1 = unsafe { std::mem::transmute::<_, &mut [u8]>(v.as_uninit_slice_mut()) };
    let x = nix::unistd::read(fd, v1).unwrap();
    assert_eq!(x, n); // I know this can fail
    println!("Terrible hack: {:10}µs", tic.elapsed().as_micros());

    assert_eq!(v1, v2);
}
Source
$ cargo +nightly run --release
   Compiling read_test v0.1.0 (/tmp/read_test)
    Finished release [optimized + debuginfo] target(s) in 0.41s
     Running `target/release/read_test`
std::fs::read:      59099µs
Terrible hack:      96395µs

Now they've switched positions and std::fs::read is significantly faster than the "terrible hack" implementation! This is a strong sign that we're not actually measuring what we think we're measuring.

Let's use strace to see what syscalls we actually make:

std::fs::read

clock_gettime(CLOCK_MONOTONIC, {tv_sec=97469, tv_nsec=90079100}) = 0
openat(AT_FDCWD, "testfile", O_RDONLY|O_CLOEXEC) = 3
statx(0, NULL, AT_STATX_SYNC_AS_STAT, STATX_ALL, NULL) = -1 EFAULT (Bad address)
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|0x1000, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=400000000, ...}) = 0
lseek(3, 0, SEEK_CUR)                   = 0
mmap(NULL, 400003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f2331def000
read(3, "oktktftsgibrigcysbdamxxylnyfbqhr"..., 400000000) = 400000000
read(3, "", 32)                         = 0
close(3)                                = 0
clock_gettime(CLOCK_MONOTONIC, {tv_sec=97469, tv_nsec=153213700}) = 0
write(1, "std::fs::read:      63134\302\265s\n", 29std::fs::read:      63134µs

Terrible hack

clock_gettime(CLOCK_MONOTONIC, {tv_sec=97469, tv_nsec=153860400}) = 0
openat(AT_FDCWD, "testfile", O_RDONLY|O_CLOEXEC) = 3
statx(3, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|0x1000, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=400000000, ...}) = 0
mmap(NULL, 400003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f231a076000
openat(AT_FDCWD, "testfile", O_RDONLY)  = 4
read(4, "oktktftsgibrigcysbdamxxylnyfbqhr"..., 400000000) = 400000000
clock_gettime(CLOCK_MONOTONIC, {tv_sec=97469, tv_nsec=252478000}) = 0
write(1, "Terrible hack:      98617\302\265s\n", 29Terrible hack:      98617µs

While there are some slight differences in the exact syscalls made (for a variety of reasons), both basically just boil down to:

allocate enough memory for the file content (via mmap)
read all of the file content into memory

Given that whichever test case is first runs considerably faster and consists of the same core syscalls, it's reasonable to assume that what is actually happening is that the same operations take different amounts of time (with the first set being faster for some reason and the second set being slower) and that difference is what is being measured. The next step here would be to confirm that via perf but I'm currently using WSL and that isn't available on my machine.


Note: I believe the "terrible hack" implementation is essentially equivalent to:

use std::{io::Read, time::Instant};

let tic = Instant::now();
let mut f = std::fs::File::open(path).unwrap();
let n = f.metadata().unwrap().len() as usize;
let mut v3 = vec![0u8; n];
let x = f.read(&mut v3).unwrap();
assert_eq!(x, n);
println!("std::io::Read::read: {:10}µs", tic.elapsed().as_micros());

Which doesn't require 3rd party crates, unsafe or unstable features. It produces the following syscalls:

clock_gettime(CLOCK_MONOTONIC, {tv_sec=98081, tv_nsec=272918100}) = 0
openat(AT_FDCWD, "testfile", O_RDONLY|O_CLOEXEC) = 5
statx(5, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_ALL|0x1000, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=400000000, ...}) = 0
mmap(NULL, 400003072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fddf6880000
read(5, "oktktftsgibrigcysbdamxxylnyfbqhr"..., 400000000) = 400000000
clock_gettime(CLOCK_MONOTONIC, {tv_sec=98081, tv_nsec=399405200}) = 0
write(1, "std::io::Read::read_to_end:     "..., 42std::io::Read::read_to_end:     126487µs

With similar performance characteristics to the others depending on which implementation runs first in the program.

13 Likes

That call is essentially saying "I need at least 32 more bytes but you add what you believe is best." Eventually execution ends up here which essentially doubles the size.

1 Like

I strongly believe this is a fluke (have you tried running it again and checking if you consistently get this result?), but I will provide a criterion benchmark to prove it:

It doesn't matter what order you arrange the benchmarks in, the terrible hack is consistently more than twice as fast as std::fs::read.

I've also added the version that the terrible hack is supposedly equivalent to and it is as slow as std::fs::read.

Maybe the speed difference comes from not initializing the memory to zero?

Benchmark report:


3 Likes

It's probably not called at all - both versions just perform a single read syscall to read the entire file.

1 Like

I suspect that the removed overhead in the "terrible hack" version will amount to the extra work to ensure both soundness and completeness of the read operation. I'm not using a *nix, so I'll likely get vastly different results from the Windows filesystem, so everything here is pure conjecture.

Note that File uses its own read_to_end implementation, not the default provided one, which does add a reserve for file.metadata()?.len() - file.stream_position()?, and it doesn't zero the memory before reading to it.

Given the 2x performance difference, I suspect the difference comes down to some combination of exactly the read(3, _, 32) [source] done to ensure the entire file was in fact read, and the close(3) to release the file handle, both not done by the hack version.

The impact of the file close can be tested by open-coding fs::read a bit to get

// fs::read
let start = Instant::now();
let end;
let bytes = {
    let mut file = File::open(path)?;
    let mut bytes = Vec::new();
    file.read_to_end(&mut bytes)?;
    Ok(bytes)
};
end = Instant::now();

// stop timing before closing the file
let start = Instant::now();
let end;
let bytes = {
    let mut file = File::open(path)?;
    let mut bytes = Vec::new();
    file.read_to_end(&mut bytes)?;
    end = Instant::now();
    Ok(bytes)
};

The second read can't really be avoided; the way that you determine that you've read the entire file is to get read(..) == 0. However, if you're willing to assume that the file contains the exact number of bytes the metadata says (not actually fully guaranteed), you can use something along the lines of

let start = Instant::now();
let end;
let bytes = {
    let mut file = File::open(path)?;
    let mut bytes = Vec::new();
    bytes.resize(file.metadata()?.len(), 0);
    file.read_exact(&mut bytes)?;
    Ok(bytes)
};
end = Instant::now();

to not do the extra EOF check. On nightly, we can use read_buf_exact to avoid the need to zero-fill the vector first, but then you also need to unsafely communicate that size increase back to the vec.

The reason for the "whichever first" behavior observed for @wesleywiser may be due to that the file handles used for the single-file test are never actually released. This means the first test is done on a file handle that is the only one using the actual file, whereas the second test is done with two file handles open to the same file. This may impact OS read performance.

4 Likes

Yes, I ran the results multiple times and consistently see the first test case being the faster one for the program you gave.


Here's my criterion setup:

Source
#![feature(allocator_api)]
#![feature(ptr_as_uninit)]

use std::{alloc::{Layout, Allocator}};

use nix::{fcntl::OFlag, sys::stat::Mode};
use rand::{Rng, distributions::Uniform};
use criterion::{black_box, criterion_group, criterion_main, Criterion};

pub fn criterion_benchmark(c: &mut Criterion) {
    let mut rng = rand::thread_rng();
    let mut s = Vec::<u8>::new();
    for _ in 0..40_000_000 {
        s.push(rng.sample(Uniform::new(0, 26)) + 'a' as u8);
    }
    let path = "testfile";
    std::fs::write(path, s).unwrap();

    c.bench_function("std::fs::read", |b| b.iter(|| {
        std::fs::read(path).unwrap();
    }));

    c.bench_function("terrible hack", |b| b.iter(|| {
        let f = std::fs::File::open(path).unwrap();
        let n = f.metadata().unwrap().len() as usize;
        let a = std::alloc::System;
        let layout = Layout::from_size_align(n, 32).unwrap();
        let v = a.allocate(layout).unwrap();
        let fd = nix::fcntl::open(path, OFlag::O_RDONLY, Mode::empty()).unwrap();
        let v1 = unsafe { std::mem::transmute::<_, &mut [u8]>(v.as_uninit_slice_mut()) };
        let x = nix::unistd::read(fd, v1).unwrap();
        assert_eq!(x, n); // I know this can fail
    }));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

I had to lower the amount of data substantially in order for the benchmark to not run out of memory.

The results I have are


Showing that the "terrible hack" approach is 50% slower. Interestingly, if I flip the order this time, the results remain constant.

I finished the implementation so that it closes the file handle and frees the allocated memory and this brings both implementations to essentially the same performance:

diff --git a/benches/my_benchmark.rs b/benches/my_benchmark.rs
index 7d6d60e..fac33b2 100644
--- a/benches/my_benchmark.rs
+++ b/benches/my_benchmark.rs
@@ -27,6 +27,8 @@ pub fn criterion_benchmark(c: &mut Criterion) {
         let v1 = unsafe { std::mem::transmute::<_, &mut [u8]>(v.as_uninit_slice_mut()) };
         let x = nix::unistd::read(fd, v1).unwrap();
         assert_eq!(x, n); // I know this can fail
+        nix::unistd::close(fd).unwrap();
+        unsafe { a.deallocate(v.as_non_null_ptr(), layout); }
     }));


2 Likes

Interesting! I wonder if Windows Defender could have an impact on this? Are you using WSL1 or 2?

In the benchmarked version I constructed a Vec from the pointer so it would deallocate it (but the optimizer also shouldn't get too trigger happy). See here.

I ran your exact benchmark code with those two changes applied (and a #![feature(slice_ptr_get)] that was required) and the difference persists:

std::fs::read           time:   [13.659 ms 13.687 ms 13.719 ms]                          
                        change: [-2.0217% -1.7758% -1.5177%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

terrible hack           time:   [4.7017 ms 4.7245 ms 4.7487 ms]                           
                        change: [+12.606% +13.292% +13.918%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

I've implemented the probing read done by read_to_end and closed the file-descriptor, but see no major change in performance.

Benchmarking std::fs::read: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.1s, or reduce sample count to 40.
std::fs::read           time:   [124.72 ms 125.37 ms 126.03 ms]                          
                        change: [-1.0061% -0.3735% +0.2542%] (p = 0.23 > 0.05)
                        No change in performance detected.

terrible hack           time:   [44.982 ms 45.112 ms 45.242 ms]                          
                        change: [-1.7500% -1.4790% -1.1734%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Benchmarking safe hack: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 12.2s, or reduce sample count to 40.
safe hack               time:   [121.75 ms 121.92 ms 122.10 ms]                      
                        change: [-5.5431% -5.2293% -4.9523%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  7 (7.00%) high mild

Don't mix System and Global (used by default by Vec).

This is at best relying on nobody using #[global_allocator], and likely actually UB.


I've reproduced your nix implementation and provided a std implementation that should be identical; it would be nice if you could compare the two for us, and even better if you can get some sort of flamegraph/profiler on it.

Note that I've put these into functions; keep them as such to ensure that the harness around them treats them as similarly as possible.

nix
fn nix_read(path: &str) -> Vec<u8> {
    let len = {
        let f = File::open(path).unwrap();
        f.metadata().unwrap().len() as usize
    };

    let layout = Layout::from_size_align(len, 1).unwrap();
    let ptr = Global.allocate(layout).unwrap();

    let fd = nix::fcntl::open(path, OFlag::O_RDONLY, Mode::empty()).unwrap();

    let did_read = nix::unistd::read(fd, unsafe { ptr.as_mut() }).unwrap();
    assert_eq!(did_read, len); // could fail

    let mut probe_buf = [0u8; 32];
    let did_read = nix::unistd::read(fd, &mut probe_buf).unwrap();
    assert_eq!(did_read, 0);

    nix::unistd::close(fd).unwrap();
    
    Box::from_raw(ptr).into()
}

std
fn std_read(path: &str) -> Vec<u8> {
    let mut f = File::open(path).unwrap();
    let len = f.metadata().unwrap().len() as usize;
    
    let mut buf = Vec::with_capacity(len);
    
    let p = ptr::slice_from_raw_parts_mut(buf.as_mut_ptr(), len);
    
    let did_read = f.read(unsafe { &mut *p }).unwrap();
    assert_eq!(did_read, len); // could fail
    
    let mut probe_buf = [0u8; 32];
    let did_read = f.read(&mut probe_buf).unwrap();
    assert_eq!(did_read, 0);

    drop(f);    

    unsafe { buf.set_len(len) };
    buf
}

These should be doing the exact same thing in both cases[1]. If they don't, investigating that is a good avenue to figure out what's going on.


  1. Caveat: the nix version actually opens the file twice, since we open it separately to get the metadata than to do the read. ↩︎

4 Likes

What an excellent and concise dive into how File::read_to_end works. I couldn't have done it better, and I wrote it (1, 2).

6 Likes

Yup, though you could pass the used allocator (System) as the A generic parameter of Vec.

I've had to make minor adjustments to the nix version to make it compile:

8c8
<     let ptr = Global.allocate(layout).unwrap();
---
>     let mut ptr = Global.allocate(layout).unwrap();
20,21c20,21
<     
<     Box::from_raw(ptr).into()
---
> 
>     unsafe { Box::from_raw(ptr.as_ptr()).into() }

The version you provided performs similarly to the std version:

nix_read                time:   [118.09 ms 118.20 ms 118.32 ms]                     
std_read                time:   [120.75 ms 121.20 ms 121.65 ms] 

But! When I change the alignment of the allocation back to 32 as in my original code I get the speed difference again!

<     let layout = Layout::from_size_align(len, 1).unwrap();
---
>     let layout = Layout::from_size_align(len, 32).unwrap();
nix_read                time:   [42.812 ms 42.884 ms 42.964 ms]                     
std_read                time:   [118.98 ms 119.07 ms 119.18 ms]                     

Alternatively, I can augment the std_read with a custom allocator that always aligns to at least 32 bytes:

pub struct MyAlloc;

unsafe impl Allocator for MyAlloc {
    fn allocate(&self, layout: Layout) -> Result<std::ptr::NonNull<[u8]>, std::alloc::AllocError> {
        Global.allocate(unsafe { Layout::from_size_align_unchecked(layout.size(), layout.align().max(32)) })
    }

    unsafe fn deallocate(&self, ptr: std::ptr::NonNull<u8>, layout: Layout) {
        Global.deallocate(ptr, unsafe { Layout::from_size_align_unchecked(layout.size(), layout.align().max(32)) })
    }
}

pub fn std_read(path: &str) -> Vec<u8, MyAlloc> {
    let mut f = File::open(path).unwrap();
    let len = f.metadata().unwrap().len() as usize;

    let mut buf = Vec::with_capacity_in(len, MyAlloc{});

    let p = std::ptr::slice_from_raw_parts_mut(buf.as_mut_ptr(), len);

    let did_read = f.read(unsafe { &mut *p }).unwrap();
    assert_eq!(did_read, len); // could fail

    let mut probe_buf = [0u8; 32];
    let did_read = f.read(&mut probe_buf).unwrap();
    assert_eq!(did_read, 0);

    drop(f);

    unsafe { buf.set_len(len) };
    buf
}

This yields the same improvement:

nix_read                time:   [42.189 ms 42.221 ms 42.255 ms]                     
                        change: [-1.3588% -1.1494% -0.9741%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

std_read                time:   [42.265 ms 42.317 ms 42.370 ms]                     
                        change: [-65.695% -65.606% -65.511%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

An alignment of 16 bytes is not sufficient:

std_read                time:   [121.59 ms 122.08 ms 122.57 ms]                     
                        change: [+187.19% +188.48% +189.64%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
1 Like

So I tested the std_read 1 byte aligned vs 32 byte aligned version on my desktop and laptop (commit), and on my laptop I cannot reproduce the difference in performance.

Desktop (AMD Ryzen 9 5900X)

std_read  1 byte aligned time: [133.72 ms 134.03 ms 134.37 ms]              
std_read 32 byte aligned time: [42.969 ms 43.137 ms 43.306 ms]

Laptop: (AMD Ryzen 7 PRO 3700U)

std_read  1 byte aligned time: [101.60 ms 101.86 ms 102.16 ms]
std_read 32 byte aligned time: [101.98 ms 102.41 ms 102.88 ms]

The systems are OS wise very similar, and both use amd-ucode.

Interestingly the 1 byte aligned version on the desktop is slower than the laptop!

I've also tested it on a Raspberry Pi and an Intel VPS, both show no major differences:

Raspberry Pi 3 Model B Rev 1.2

std_read  1 byte aligned time: [1.6450 s 1.6722 s 1.7206 s]                                     
std_read 32 byte aligned time: [1.6606 s 1.6759 s 1.6939 s]

VPS (Intel Xeon)

std_read  1 byte aligned time: [278.47 ms 291.78 ms 307.45 ms]                                    
std_read 32 byte aligned time: [276.78 ms 286.73 ms 297.45 ms]

Summary

It seems like only the system with the AMD Ryzen 9 5900X is affected - maybe others with a Ryzen 5000 series CPU could try the benchmark on their system and see if they get similar results?

The code is here: GitHub - ambiso/read_slow

1 Like

I tried it on Windows (native, no WSL) with the Ryzen 9 5900X and got no significant difference:

std_read  1 byte aligned time: [75.739 ms 77.823 ms 80.110 ms]
std_read 32 byte aligned time: [76.191 ms 78.025 ms 79.979 ms]

Interestingly, it's not close to any of the times seen before.

This made me suspect it might be related to the filesystem (btrfs) or encryption layer (cryptsetup).

I moved the project to a ramdisk:

mkdir /mnt/tmp
sudo mount -t tmpfs -o size=10g tmpfs /mnt/tmp
cp -r read_slow /mnt/tmp
cd /mnt/tmp/read_slow
cargo bench

But the performance difference persists:

std_read  1 byte aligned time: [121.24 ms 121.34 ms 121.45 ms]
std_read 32 byte aligned time: [46.605 ms 46.737 ms 46.877 ms]

Note that the ~45-ish milliseconds (~8.6GB/s) I keep getting with the 32 byte aligned version already far surpass the read throughput of my NVMe SSD (3.5GB/s, PCIe 3.0 x4 can also not deliver much more), so the file is most likely completely cached in any case.

The ~120ms are just about 3.3GB/s, so I thought maybe it's actually re-reading the file in that case, but I watched the output of sudo smartctl -A /dev/nvme0 and the value of Data Units Read stays constant throughout the benchmark.

Here's a flamegraph of both versions using cargo flamegraph -F 50000:

(edit: I have to download them and open them from the filesystem to get interactivity in firefox/chromium)

std_read

std_read_32_byte_aligned

This is profiling the following main:

use read_slow::{std_read, std_read_32_byte_aligned};

fn main() {
    let path = "testfile";

    for _ in 0..100 {
        // depending on the version...
        // std_read(path);
        std_read_32_byte_aligned(path);
    }
}

You can also grab the raw perf data here:

std_read perf.data

std_read_32_byte_aligned perf.data

1 Like

Okay so this is becoming quite the rabbit hole...

Alignment doesn't matter but offset into an allocation (or page?) does

This C code measures how long the read syscall takes, based on how far the buf pointer is offset.

I got the idea because Julia's read doesn't 32-byte align its allocations/pointers either, and still gets the faster 45ms reads.

#include <fcntl.h>
#include <stddef.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

int64_t millis()
{
    struct timespec now;
    timespec_get(&now, TIME_UTC);
    return ((int64_t) now.tv_sec) * 1000 + ((int64_t) now.tv_nsec) / 1000000;
}

int main() {
    puts("offset milliseconds");
    for (int off = 0; off < 100; ++off) {
        int fd = open("testfile", O_RDONLY);
        size_t n = 400000000;
        char *buf = malloc(n + 4096);
        int64_t a = millis();
        if (n != read(fd, buf+off, n)) {
            puts("oops");
        }
        int64_t b = millis();
        free(buf);
        close(fd);
        printf("% 3d % 5d\n", off, b - a);
    }
}

Output:

offset milliseconds
  0   133
  1   132
  2   132
  3   133
  4   131
  5   131
  6   133
  7   134
  8   134
  9   133
 10   133
 11   132
 12   132
 13   131
 14   130
 15   130
 16    46
 17    48
 18    47
 19    47
 20    47
 21    48
 22    48
 23    47
 24    47
 25    47
 26    48

So starting with an offset of 16 into the allocation, the read takes ~45ms instead of ~130ms.

What?!

Well that's not the best part yet! This pattern stays constant until shortly before the next page boundary!

offset milliseconds
 4071    45
 4072    45
 4073    45
 4074    45
 4075    45
 4076    46
 4077    46
 4078    46
 4079    46
 4080    46
 4081   129 //!<-- suddenly slow again!
 4082   127
 4083   125
 4084   125
 4085   125
 4086   125
 4087   125
 4088   125
 4089   126
 4090   126
 4091   128
 4092   126
 4093   126
 4094   125
 4095   125
 4096   126
 4097   126
 4098   126
 4099   129
 4100   129
 4101   130
 4102   131
 4103   131
 4104   132
 4105   131
 4106   131
 4107   133
 4108   133
 4109   134
 4110   133
 4111   131
 4112    45 //!<-- and fast again!
 4113    45
 4114    46
 4115    45
 4116    46
 4117    46
 4118    46
 4119    46
 4120    46
 4121    45
 4122    45
 4123    46

This is crazy to me. I guess it might be time to try to debug the Linux kernel?

4080 == 4096 - 16, and 4112 == 4096 + 16. Looks suspicious. Is the allocator maybe using the first/last 16 bytes of each 4k (?) page for bookkeeping? Smells like two pointers in a doubly-linked freelist…

And if that's indeed the case, perhaps your page is only 16-but-not-32-byte-aligned, and after all, it's the 32-byte alignment that matters (for some other, even deeper reason)? Or perhaps the prefetcher learned the pattern that it should keep all but the last 16 bytes in the cache (since allocations are cold relative to actual use of the allocated buffer), or maybe its exact opposite (it learned to keep the first 16 bytes hot, since it's the allocator's)?

3 Likes

Yes, 4KiB pages.

The page should always be "aligned" to 4096 bytes I think? But I'm also not sure. I think all pages should in principle at least have the same alignment.

Assuming that pages are 4096 byte aligned, the pointer I get from malloc is already offset by 16 bytes from the page boundary (probably two 8 byte pointers preceding the allocation)

So the read seems to become fast once the destination address is at least 32 bytes into a page.

I profiled the kernel with sudo perf record -g ./a.out.

The output of perf report can be seen here:

Left: fast version with a 32 byte offset into the page

Right: slow version with a 16 byte offset into the page

it seems the copy_user_enhanced_fast_string is responsible for the slow-down (it has a much higher Self percentage on the right).

That's this function: https://github.com/torvalds/linux/blob/master/arch/x86/lib/copy_user_64.S#L149-L176

2 Likes

Jemalloc can be used as a workaround, since it doesn't typically yield pointers that are close to a page boundary:

Left: malloc, being slow without adding an additional offset >= 16 to the buffer pointer before calling read()

Right: jemalloc being always fast

1 Like

I was able to reproduce the behavior on another Linux system using a Ryzen 9 5900X so it's not just me!

I wasn't able to reproduce it on a 7900X or an EPYC 7513.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.