Help understanding why io_uring IO performs worse than stdlib in a thread pool

Hi - I hope this is a good forum for this question; please let me know if there's a more appropriate channel. (And, apologies for lack of more links in the message - as a new user I am limited to 2 in this post.)

I'm trying to implement a program that scans a directory, and all its transitive nested directories, hashing all normal files along the way (repo here). Three versions are implemented under src/:

  • rayon_thread_pool: this version spawns a scoped job on the rayon thread pool for each directory that's visited and uses the stdlib's File/io copy implementations. There is a semaphore to limit the number of open file descriptors but it’s irrelevant since there are a fixed small number of threads. (This implementation is due to pkolaczk/fclones; thank you!)

  • io_uring: spawns a thread pool (of the same size as the rayon version) that accepts jobs to read and hash files. Each thread holds an io_uring ring of a parametrizable size (default 32). File read requests are done through the ring. Each thread only processes one file at a time. The io_uring implementation is tokio-rs/io-uring.

  • io_uring_batched: like io_uring, but each thread tries to process multiple files at a time when it can.

I'm sure that my use of io_uring is naive (and especially the batched implementation), but I'm struggling to figure out exactly why. In the experiments I've tried, rayon_thread_pool consistently performs the best, and io_uring_batched consistently performs the worst. These experiments, instructions on reproduction, and flamegraphs of perf profiles, are available here. The summaries are:

  • large-directory-with-symlinks-1-thread: A directory with 328155 total files. The parameters were
threads in thread pool: 1
io_uring buffer_size: 8192
io_uring ring_size: 32

And the results were

rayon_thread_pool: 328155 files in 38295ms

io_uring: 328155 files in 39651ms

io_uring_batched: 328155 files in 45088ms
  • large-directory-with-symlinks-2-thread: Same as above, but with a thread pool of size 2. The results were
rayon_thread_pool: 328155 files in 19548ms

io_uring: 328155 files in 21022ms

io_uring_batched: 328155 files in 23976ms

I am struggling to gleam anything meaningful from the perf flamegraphs of these experiments (available in the links above), as they seem to be very similar. The most notable difference I can see thus far is that the io_uring version has a deeper call stack in the kernel than the rayon_thread_pool version, because it must pass through the io_uring event loop. It's not clear to me that the io-uring event loop overhead is material here, but maybe that's incorrect?

If anyone has suggestions on how to understand what is going on here, or things that should be corrected or tried in the implementation, I would appreciate your help. Thank you!

Some notes:

  • All of this was tested on a 2 CPU, 2GB machine running a 5.15 Linux kernel
  • Earlier versions of this code elided the branches that traversed symlinks, and exhibited similar runtime characteristics to the descriptions above. I added the symlink version back because I didn't want to create large directory traversals that used only hard links.
  • I ran each experiment a few times and chose the best instance to try to alleviate variance. Though, this was all on an otherwise quiet machine, and I would expect the page cache to be continuously thrashed here.

io_uring buffer_size: 8192

std::io::copy may be performing larger reads than that if the writer side is a BufWriter. If your hasher is wrapped in that then that might be an uneven comparison. strace the rayon case to check the read sizes.

File read requests are done through the ring. Each thread only processes one file at a time.

Do you share a ring across threads, use one ring per thread or have a central IO thread that distributes hashing work across a threadpool?

Do you keep multiple read requests in flight (it should be at least double- or triple-buffered. 1 block in flight, 1 block currently being hashed and optionally one block sitting ready in the completion queue)? Do you batch the SQE refill + io_uring_submit() to avoid syscalls every time an item is drained? Refilling when half the in-flight requests have been fulfilled is probably a good strategy.

1 Like

The read sizes of the stdlib implementation look to also be 8192 bytes.

Do you share a ring across threads, use one ring per thread or have a central IO thread that distributes hashing work across a threadpool?

I'm using one ring per thread. Tried to have one IO thread that dispatches the hashing work, but that wasn't any better, and I suspect suboptimal because the central IO thread needs to keep the entire file contents in memory, or else add another level of message passing. Do you think I should try something else here?

Do you keep multiple read requests in flight?
Do you batch the SQE refill + io_uring_submit() to avoid syscalls every time an item is drained?

Yes, I believe so. My loop looks like

let hashing_states = ... // array mapping fd => Hasher
let ring = ... // IoUring
let receiver = ... // crossbeam unbounded channel receiver of files walked in the directory

while !no_more_files || hashing_states.len > 0 {
    // push new files to hash as SQEs while we have space
    while hashing_states.len < ring_size {
        let (entry, meta) = if hashing_states.len > 0 {
            match receiver.try_recv() {
                Ok(p) => p,
                // break or continue if nothing on the receiver
        } else {
            // do a blocking read since there's nothing on the uring
            match receiver.recv() {
                // same as above
        let file = File::open(entry.path()).unwrap();
        let fd = file.into_raw_fd();
        let state = ...;             
        hashing_states.insert(fd, state);
        let read_e = opcode::Read::new(
            hashing_states.get_buffer(fd).as_mut_ptr() as _,
        ).offset(0).build().user_data(fd as _);
        unsafe {
        wanted += 1;
    // before executing the submission, re-enter any partially read files done on the CQ                                                                                
    while !ring.completion().is_empty() {
        let cqe = ring.completion().next().unwrap();
        let fd = cqe.user_data() as i32;
        let ret = cqe.result();
        let mut state = hashing_states.remove(fd);
        state.offset += ret as u64;
        if state.offset == state.len {
            // EOF
            let _file = unsafe { File::from_raw_fd(fd) };
            let hash = state.hasher.finalize().into();              
            all_results.push(FileInfo { hash });
        } else {
            // Need to read again
            state.hasher.update(&state.buffer[..(ret as usize)]);
            let read_e = opcode::Read::new(
                state.buffer.as_mut_ptr() as _,
            ).offset(state.offset).build().user_data(fd as _);
            hashing_states.insert(fd, state);
            unsafe {
            wanted += 1;
    // perform (possibly batched) submit                                                                                
    if wanted != 0 {
    wanted = 0;

I also tried changing the submit condition to be more batched, to

    if wanted >= hashing_states.len / 2 || no_more_files {
        wanted = 0;

but that didn't seem to help the runtime.

Try a current kernel. I think they did some optimization work for async file access in the latest releases.

I tried a 6.2 kernel this morning, but sadly no great luck. The io uring version now performs 2% better on average, but that is well within the margin of error, and I would expect it to perform much better.

Well, 6.2 isn't even current. But the numbers aren't so bad if you reached parity? At this point I would try establishing rooflines to see if you're maxed out. Is it even possible to hash faster? How much faster could you even read from the disk?
And you should also figure out what the bottleneck is. Is it syscall overhead, the CPU cycles spent on the hashing, the IO, ...

I also took a quick look at your source code, it seems like you're opening files in the loop. That's a blocking operation. You might want to put that on the uring too.

What stands out noticeably is that you're opening the files synchronously. That's one cost that you would ideally avoid by submitting a linked OP where you open & read the first batch as one go.

The other thing to note is that for sequential reading, io_uring won't magically beat the read ahead & whatnot that the kernel will do and the cost of the I/O syscalls can easily be not that significant in your application.

One possible reason the times will be similar is that you're performance is likely dominated by your hashing speed rather than I/O speed (you've posted flamecharts for the kernel but not your process). blake3 according to smasher is ~1.2 GiB/s which is easily slower than modern NVME drives. Not sure what CPU & disk you're running this on but a cryptographic hash is going to be expensive (even AES-NI SHA2 is 1.5 GiB/s according to smhasher but all of these numbers are very workload & machine config dependent so don't rely too much on the absolute numbers from smhasher & read the relative performance with a grain of salt). If you can use something faster like xxh3 (I recommend xxhash-rust & make sure to use xxh3 64-bit not xxh64), then you would definitely outpace your I/O and maybe then syscalls become relatively more expensive enough to matter (but would still dwarf if you're processing decently sized files).

The final thing I'll note is that you can see a very large gap between two syscalls to submit the sqe. You'd ideally be setting up a maximal length sqe so that you don't have any gaps. I don't have any specific solutions here as I haven't thought deep about it, but it would look something like separating your I/O and CPU threads where your I/O threads keep your sqes full and you dispatch to background CPU threads to do the hashing. Probably want to make sure here that you're hashing a direct pointer to the read data rather than copying the buffer across threads - for maximal performance you'll need to control lifetimes carefully to avoid wrapping the buffer with Arc.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.