Fast IO for sparse files

Is there a fast way for doing IO sparsely within files? (including kernel-level, unsafe, etc.)

Specifically, I need to write small amounts of data to non-sequential parts of a file. An example use case is writing the transpose of one matrix-file to another target file (or doing it in place, I don't think that matters).

Both the default file writer and buffered file writer get killed on performance with syscall overhead when doing something like this. Also the buffered writer doesn't get any performance benefit since every time a seek is called the buffer is flushed to disk. Linux offers a write_at function and while this does reduce the syscalls by a half, still the timing is dominated by syscalls. And by dominated I mean roughly a 20-100x slowdown compared to doing the operation in memory and then writing the entire buffered contents to disk. For reference, my intended use case is to use this on an SSD where random writes shouldn't be as big of an issue as a spinning disk HDD. I've also tried running the same process on a ramdisk but the benefits were minimal and even that was significantly slower than transposing in memory and then writing. This leads me to believe the syscall overhead is the biggest slowdown.

Other alternatives are memmaps, io_urings, or making my own custom transposed-buffer.

  • The custom buffer does help a lot compared to the naive write_at implementation. In this case, I read in a row of the matrix, splice it up and append each element to a column buffer, and after so many rows flush the columns to their respective locations in the output file. Ultimately this gives ~2x speed up over the naive approach.
  • memmap also gives roughly the same performance benefits as the buffered column implementation. I expected this would be much faster since I assumed it could go around syscalls and map directly to the file itself.
  • I haven't yet tried an io_uring since the implementation is more involved than the previous options.

additional evidence

Transposing a matrix is a simplified version of my actual use case. For my actual program I have a flamegraph of the sparse-writing, which shows the computation is dwarfed by the file writes.


This operates over a 1GB file. When I do this in-memory the write barely even shows up on the flamegraph. I can't buffer the entire file to memory since I eventually have to work on files too large to be buffered.

I wrote a similar, dummy test using mmaps and saw similar issues with writing linearly (row-major) and column-major.


the linear_writes and non_linear_writes are the two implementations. In the linear writes, I also calculate the would-be offset to ensure that doesn't affect the timings.

Also this is my first post, let me know if I can make it better or provide more evidence/explanation :).

I'm surprised you didn't see much of an improvement with mmap. I guess one explanation is that the kernel only loads a page of data from disk into memory when you trigger a page fault by touching that address. That means if all your operations are only touching those parts of the file for the first time, essentially every write will trigger a page fault and jump into the kernel.

Maybe you could "warm up" the page cache by taking the &[u8] you get from mmapping the file and doing some trivial operation that forces the kernel to access the file. That's essentially the equivalent of reading the entire file into memory though, so on second thoughts I'm not sure if it'll help much.

Alternatively, std::fs::File should have an efficient implementation of Write::write_vectored() which lets you write multiple buffers into a file with a single syscall. I'm not sure how you are doing the writes, but maybe that could help?

Depending on your use case, you might be able to buffer up some of your writes in memory, then do some magic to condense nearby writes into one big write. So instead of doing writes like

XXXX-----XX---

(where X is a write with new data and - is a gap)

You might do

XXXXoooooXX---

(where o is rewriting the same data over top of itself)

It sounds like you're currently using a single thread, is that right? If so you may be able to speed it up by doing the writes in parallel in multiple threads. An SSD can handle huge numbers of parallel writes, and with a thread pool you can be doing a large number of writes (and therefore system calls) in parallel.

Each thread would need its own file handle to allow seeking separately and to avoid any synchronization in the file handle. Opening a file many times is cheap -- only the open of the first handle is expensive.

You would have to figure out how to divide up the work between threads of course, which may or may not be practical with your algorithm.


One other note is to make sure you're not doing an fsync on each write. This would apply to any approach you take that does multiple writes. Just do one fsync at the end if you need durability at that point.

Yeah, these are good suggestions, thanks for the input!

I do already warmup the file by writing 0s to the entire file when it's first opened. This definitely speeds up fileIO on small files that can fit entirely in memory but once I start to touch multiple pages it quickly decreases in it's improvement. I think it's feasible I could be getting bottlenecked by the page-table and page walking of the CPU as well. In general I'm worried that seeking around and caching so many pages is unavoidably slow.

Currently I'm doing the writes using the write_at call, which avoids a separate seek and subsequent write of two syscalls and also allows it to be parallelized. I don't think the write_vec would work since I don't remember (in the RAM sense) the data in between where I'm writing to (the os in the example). For sufficiently large files that could be an order of a gig or arbitrarily large that I'm seeking past and would be too large to keep around and probably too large to buffer in from reading.

I mention it's parallelizable because I previously tried doing all my work in one async thread and sending the data to be written to a different thread in a tokio runtime but the subsequent awaits on the file write were additional overhead that I didn't need and a single thread was faster than that. So, currently, I am using just one thread for both the transpose work and writing all around the file. However, I haven't tried having one dedicated thread for each transposed column to be writing to with its own syscall (or, at least having one thread for every couple of transposed columns to avoid having 2^30 threads but maybe 2^10).

One other alternative is to have a separate file for each column that I could decompose rows into and concatenate them all together at the end of the process. This would still have the issue of writing small bits of data across a large number of pages but the OS might be able to better cache each one since I'm writing sequentially.

Will report back with any findings. Thanks again for the input!

It shouldn't if you use write_at, though I don't know if using separate files would avoid some kernel-side locking.

You shouldn't be using tokio for this. Tokio's file IO has additional overhead due to thread offloading and using std file IO in tokio doesn't work well with the runtime since it's blocking.

Yeah you should be using a thread pool, threads have their own startup costs.

All those [unknown] entries suggest the profiles are probably missing kernel symbols. Maybe you need to install some package on your system or install debuginfod or something to get better symbolication.

Use /bin/time to see if time is spent in userspace or kernel space and on page faults.
Watch /proc/pressure/* to see if you're seeing IO saturation.
Use perf stat to get a high-level overview of what your CPU is bottlenecked on.

The pages still have to be faulted in on first access. If the size of your files exceeds your available RAM then pure random access will stress that. Ideally you should optimize your IO patterns to deal in larger blocks where possible
Since you also have to read the data that'll also create IO pressure too, not just the writes, so you have to balance your read and write access patterns. I don't know what the optimal read and write IO patterns for giant matrix transposition is... blockwise? z-order curve? There should be some prior art on that.


A few small tweaks you can try, though I would only expect them to provide modest speedups, if at all:

  • update to the most recent kernel version, sometimes they contain significant perf optimizations
  • applying POSIX_FADV_RANDOM to the file or MADV_RANDOM to the mmap
  • check if filesystems make a difference, ext4 and xfs should be faster than btrfs for example. though you said you tried a ramdisk, so that's probably not it
  • use MADV_POPULATE_WRITE to prefault a chunk of pages instead of writing 0s to them, perhaps on a separate thread
3 Likes

I made a tool that just transposes a very large matrix with a bunch of different methods to see what method was the fastest. You can find the repo here:

There are definitely some optimizations and improvements that I could make, feel free to PR if interested.
Notably I have not yet implemented io_urings.

It seems like buffering the writes with some parallel column buffers is by far the fastest method. MMap is a close second.
Other ways of doing this like buffering to temp files start to lag behind pretty quick.

There's also a pretty big discrepancy in speed from when the file is small enough to be cached in RAM vs when it's too large to fit in ram.
So, if you try this to replicate results, make sure to give it a larger file size than the available RAM you have, but be wary it starts to chew up a lot of disk space quite quickly.


@the8472

  • I did try to update but am already at the most up-to-date kernel available for my Ubuntu release.
  • I should try different methods of touching the files as well. But touching for page faults seems pretty difficult to time when the CPU usage is so low. Managing the thread communication between reading, writing, and page lookup might be more trouble than I'm willing to put in at the moment unfortunately.

Oh wow, it's reading and/or writing one byte at a time. You need to chunk your data. E.g. read a 4096x4096 block (or larger), transpose that in memory and write 4096-byte slices into the destination places. Then move on to the next block. That way you balance the number and size of IO operations.

This is probably a great time to read up about Cache-oblivious algorithm - Wikipedia -- matrix transpose is one of the go-to examples for why this structure helps.

1 Like

In which spot? Most of the read/writes are buffered within the function and/or have a BufRead/Writer attached to the file to mitigate most of that issue. Otherwise, that's intentionally a naive benchmark to be a discrepancy between the disk_io function and the buffered_disk_io function.

Correct me if I'm wrong but this doesn't seem to address the issue that the matrix is stored in row-major form on the disk for both the input and transpose

Can you change that? I think you're fundamentally going to be stuck with bad perf otherwise, even if you improve it a bit by avoiding syscalls.

(One of the wonderful things about cache-oblivious structures is that you can treat the disk as just another level of caching.)

I can change it from row-major to column-major by transposing it, which is my bottle neck.

I can't meaningfully treat the disk as another level of cache because disks fundamentally store and access data linearly and this is the bottleneck. Thankfully I am using an SSD and not a HDD or magnetic tape, but in those cases this performance would be even worse. Cache oblivious transposes rely on accesses to sub-matrices to happen in a matrix-like way, but that's not what happens on a disk: I have to seek row-by-row, column-by-column, or another way (like how JPEG stores smaller squares of the image), but I can't do more than one access method on the same stored matrix without incurring the costs of the underlying storage method. If I could initially preprocess the input matrix to be more like JPEG, then it would be faster but unfortunately in this specific case I do have to include the preprocessing time.

I don't mean to come off as rude, and I do appreciate the suggestion, I just wanted to clarify that optimizing random access to the disk is really what I wanted to know about. Transposing a row-major-stored matrix was just a good example of the kind of disk-access I need to have but I was wondering if there's a better general way to access very-large files that is IO-bound by the disk and not syscalls or caches.

If redesigning the output file format is on the table, and you're doing a row-major to column-major transformation, also make sure to check out "columnar data formats" like Apache Arrow (preferred) and Parquet, which have extra features like built-in compression .

Quick overview: You're expected to output one or more RowGroups, which should be 1+ million records each. Inside a RowGroup are the storages for each column, which contain the data for each row.

Typically you write one RowGroup per file, and multi-rowgroup files are produced with the cat utility.

I suppose I should more clearly state the assumptions I'm making of the program:

  • The process is run as a user process (i.e. not root)
  • It can use multiple threads
  • The matrix file it's receiving is written in row-major format, any preprocessing that is done on the input file is counted towards the total time
  • The input file might be arbitrarily large and the computer's memory is a fixed size (i.e. 1TB input matrix with only 32 GB ram); It's also a dense, random matrix (i.e. compression won't help)
  • The output file has to be a column-major matrix file in spirit, but the specifics of how it's stored can be altered
  • The program can make and manipulate any number of additional files as helper files (within reason)

I did try a version where I outputted all the would-be columns to separate /tmp/ files and then concatenated these together and the performance was significantly worse than other methods I prototyped. That being said I'm sure Arrow has a lot better optimizations and is better thought out than my thrown together prototype, I'll check it out.

Consider this: If your limiting factor is disk read/write speed, any compression at all will give exact ratio speedups.

If the matrix is made from random elements, compression will not reduce the size at all though

My point is that SSD block sizes are on the order of 512 KiB. So yes you can read those blocks randomly, but if you're only reading or writing 4 bytes -- even a whole classic 512-byte sector -- out of the half-a-meg block, it's going to perform poorly no matter how low you make the software overheads, since you can only use a tiny fraction of the disk throughput.

So the usual answer is that you just don't store the whole thing in one giant row-major or column-major thing ever, but in blocks with more useful locality. Like really large images are usually stored not as one huge image, but as a whole bunch of tens-of-megabytes-each square chunks. (Sometimes slightly overlapping, depending on the expected needs of processing to be done.)

Yes, this is why https://en.wikipedia.org/wiki/Z-order_curve#Linear_algebra is a classic thing to do as matrices get large. The https://dl.acm.org/doi/10.1145/1583991.1584053 paper it links looks particularly relevant, even if your case is dense.

Well, what I'd probably do then is run a pre-pass to convert the row-major input file into a bunch of fits-easily-in-ram chunks. That's easy, because each chunk fits in ram, so you can just do linear reads of the row-major file where you use all of the data that you read, and then write out the whole chunk at once that's also disk-efficient.

Then if you need to convert those chunks to a column-major output format that's easy too, since you can read a bunch of chunks at once (a different set from the ones you had while reading row-major) that all fit in RAM to write out the columns from those chunks into exactly the place they go in the final file.

Each of those should fairly easily be able to saturate the disk, so with the two steps you'd get roughly 45% of disk through put for the operation, but that's way better than the 1% of disk throughput from trying to do really small writes into random blocks.

(And then I'd go try to convince the consumer of the columnar file that they actually want it chunked instead, and try to convince the producer to just give it to me chunked in the first place.)

6 Likes