Hello!
I am looking for a concurrency construct that allows multiple writers to append to a file, while also enabling readers to read from it concurrently, with minimal contention and synchronization overhead. So, it is something like a (maybe lockless) queue but is actually a file. The only thing that a writer can do is append. And I don't want to take a Read-Write lock on it as I want to allow readers to read. What concurrency solution in rust can be used to support these requirements? I feel like lockless constructs, maybe, are required here.
These requirements are from the Google File System paper. The extracts from the paper that motivate my question are:
Third, most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent. Once written, the files are only read, and often only sequentially. A variety of data share these characteristics. Some may constitute large repositories that data analysis programs scan through. Some may be data streams continuously generated by running applications. Some may be archival data. Some may be intermediate results produced on one machine and processed on another, whether simultaneously or later in time. Given this access pattern on huge files, appending becomes the focus of performance optimization and atomicity guarantees, while caching data blocks in the client loses its appeal.
The system must efficiently implement well-defined semantics for multiple clients that concurrently append to the same file. Our files are often used as producer-consumer
queues or for many-way merging. Hundreds of producers, running one per machine, will concurrently append to a file. Atomicity with minimal synchronization overhead is essential. The file may be read later, or a consumer may be reading through the
file simultaneously.
I feel that this question is not rust specific, but I don't know any other place which in my opinion is more suitable than here, to ask the question (Please do point me to a better place for the question if you have any in mind).
For me this problem has some similarity to sequential memory allocation and cache coherency protocols. The major difficulties for the parallel append operations seem to be
the thread-safe space-reservation mechanism, which allocates the logical file block for a write and records the length (and maybe start) of the allocation in a shared ring buffer for use during the following reader-EOF update phase, and
the thread-safe completed-write update mechanism, which advances the current end-of-file (EOF) for readers, in which the updating thread has to advance the logical reader-EOF not only for itself but also iteratively advance that EOF point past completed writes from other threads.
My initial approach for each would be to use atomic CAS updates in spinlock loops. (See std/sync/atomic/ for the basics.) The two obvious complications are
some systems (e.g., some architecture variants of ARM, and of RISC-V and other higher-tier architectures) don't support atomics, and
when atomics are supported, the largest atomic may be too small to update the file position in a single CAS operation.
I'm not sure all of your requirements are possible to satisfy in a portable way, but barc achieves part of this:
Single-writer sessions are guaranteed safe with N concurrent readers (in or out of process).
So it supports more concurrency than a RwLock by allowing readers in parallel with a single writer. The writer only appends. Records include a state-aware header and the readers have additional logic to ignore partial records at the end of a file, and later continue when those records are complete/flushed.
One way to extend this might be to allow multiple writer threads (if not handles) would be to queue up writes via a MPSC channel and have a single background writer thread.
I've thought some more about this problem. The critical decision seems to be one of specifying the ordering of subsequent reads for each set of concurrent writes. The naïve approach is to specify that reads occur in the order in which file space was reserved, which would allow sequential reading of the released portion of the file by each reading thread. However, that approach is susceptible to starvation if a writing thread errors out unrecoverably after successfully making the reservation but before completing and releasing the target write region (e.g., due to a cosmic ray hit that puts the underlying thread/core hardware into an unrecoverable state).
The alternative approach is to order subsequent reads in the order that the completed writes are posted to some thread-safe list. That approach avoids reader starvation, but at the cost of a separately-maintained list of completely-written file regions that may not be sequential. In this latter case there are a number of obvious optimizations that reduce the size of that auxiliary list, but I don't see any way to eliminate the list completely.
For this latter approach, if you are willing to
impose just a partial order, but not a complete ordering, on how different threads read the records, so that different reading threads sometimes read the records with mutually-inconsistent orderings, and
allocate some space at the beginning of each record and write that part of the record twice
then the thread-safe list can be intrinsic to the file. Otherwise it appears that you need to use an mpmc channel to convey record completion information from writers to readers, or implement a separate parallel thread-safe write-buffer-status list whose span is more extensive than the set of allocated-but-not-yet-released write buffers.
Usually these designs has a max cap on the file size, after which new data goes in new files.
So, if you restrict the file size to say 1 GB, you can pre-reserve that disk space.
After that any writer, need to increment the offset counter and start writing to that file offset.
You will need multiple file streams for each writer to write to different offset of same files.
Readers should always be reading data in order in which it is written. This maintains the read order for future readers as well.
This means a slow writer will stop read progress on the readers.
Which thread does this marking, and where is the mark recorded? Remember the problem: the writing thread that has claimed this part of the file is about to have an unexpected stroke (e.g., unrecoverable hardware error). If a separate monitoring thread is required, then each write transaction will need to have a maximum delay from claiming a part of the file to completing the write, and will need to use an mpsc channel to notify the monitoring thread 1) of its claim of part of the file, and 2) of its completion of the write. Such a mechanism leads to obvious scaling problems – what if the hardware is a later, larger version of the new AMD Ryzen Threadripper 3990X 64-core 128-thread 40G-transistor single-package CPU.
Addendum: Even the simple monitoring-thread approach does not work, because the writing thread can have its "stroke" immediately after the atomic operation that initiates the claim of its file region, before signaling the monitoring thread. The same timing problem applies if the writing thread protocol is 1) claim a region, 2) write an "incomplete" marker to the region, 3) write the rest of the claimed region, 4) update the initially-written marker to "complete"; a "stroke" between steps 1) and 2) leaves the region claimed and unmarked.
One obvious solution is to use a single write-region allocator shared across all threads that also serves as the monitoring thread. However, that likely becomes the concurency bottleneck.
What happens at unrecoverable hardware error ? Is writing thread stuck or exits unexpectedly or returns with an error code ?
Some thoughts on handling these situations:
Expectation:
If a write fails, we want readers to ignore that part of the file and move to next block.
Error handling:
if a write fails in good manner with error code, we can write ignore block in that file block. If this also fails, we should just stop everything [assert / fail fast] because system is already in some bad state.
if a write is just stuck.. then eventually waiting readers / time - monitoring thread will assert after timeout. Again fail fast.
That works if 1) the file regions being written are all fixed-size blocks, and 2) there is a maximum time limit for any thread to write a block. I've been sketching a state-machine approach for the writer that works even when claimed file regions are different sizes (because they are writing different-sized records to the file). The key underlying realization is that any thread that fails to complete has claimed [Edit: or was in the process of claiming] only a single region of the file (of whatever size) when it hangs.
Whatever approach is chosen, I come back to the issue I raised in my prior post about
If reads of file regions have to be sequential in the order the file space was claimed (which ordering is implied in your immediately-prior post), and a writing thread hangs or simply takes too long to complete the write, then either all readers wait forever (are starved) or the readers have to 1) know an upper limit on how long a write should take, and 2) know the size of the incomplete record and skip over it.
We don't need this if we write header information about the record in file as well. Readers will be reading in chunks of 4k page size for efficiency and then deserialize the records.
[Edited]
I think what i have proposed now looks like an append only database, with header info. The original requirement was append only file.
Thinking about it a little more, since underlying filesystem write in block size, if our writes are not aligned to that, we will probably corrupt data. So, fixed block write is more of a requirement than a choice.