I have a binary file which records information in chunks. For example, each chunk is 128 byte, and I can read the file at any position that is the multiple of 128.
When parsing a large file (up to 10GB+) which is organized as above, I think reading in parallel is a good choice since each two chunks are not related. My question is, what is the best way to do such things?
The most intuitive way is like, if I have 8 threads available and 8GB file, each thread shall read one GB file, and in each thread, we can utilize BufReader, or some other techniques, which reads an appropriate size of data into memory (instead of 128 bytes each time) to minimize syscall count and not to OOM.
The first thing came into my mind is using rayon and itertools, while in my understanding, read -> chunks -> parallel may lead to unoptimized memory usage, while read -> parallel -> chunks may lead to unoptimized syscall count.
You could try memory mapping the file, which would prevent running out of memory, and leave the problem of reading and caching to the OS. But consider that mmap makes I/O error handling nearly impossible, and in Rust it can't safely be used with slices if any other process can modify the file.
You probably should read data in smaller chunks, something between 64KB and 1MB. Benchmark which size works best. It will vary between OSes, file systems, and disks. It's a balance between having too many syscalls, too much overhead in distributing work to a thread pool, and copying buffers larger than CPU caches.
If you care about HDDs, you'll probably want to stick to sequential read (read chunks on one thread, and send them via channel to a thread pool for processing).
On Unix systems you can use read_at or read_exact_at. The best choice for your application is going to depend on your hardware, the page size on the disk, and the balance of I/O to computation.
That should also be able to saturate SSD bandwidth if the read requests are large enough and IO is all the thread does and the CPU work is handed off to other threads. So yeah, this should be the default strategy unless there are special considerations.
You could use vectored IO to have a large read split into a bunch of smaller allocations if you want individual allocations to be released sooner.
But considering FIFO work queues I don't think that should be necessary. You can probably just do a big allocation for each read and then split that into smaller chunks (backed by a shared Arc<Vec<[u8]>>). In the end you should only need to have ~2-3 large allocations in flight at any give time.