Okay so after reading the SO answer a few times, I think I understand. I'll detail my findings. I'm going to just assume x86_64-unknown-linux-gnu
for architecture, libc, and operating system.
write(2)
write(2)
accepts three arguments:
- a file descriptor
- a pointer
- a count of bytes
write(2)
is perhaps the simplest way of writing data. As with any syscall, we setup the arguments and the syscall ID and then trigger an interrupt to context switch into the kernel. Once the kernel is running, it attempts to read the data from the userspace memory via the pointer address and the amount of bytes, and to insert that data into the file-like object that the file descriptor points to.
What it does beyond that is implementation details that aren't worth going into, but when write(2)
returns, it will set errno
if there was an error and will return the amount of bytes that were successfully written.
No guarantees about concurrency are made, if other processes/threads have a file descriptor to the same file-like object, bad thingsβ’ can happen if multiple writes occur.
Due to the API surface, the data we'd like to write into the file-like object has to be contiguous. If we want to write multiple different areas of our memory into the file-like object, we need multiple calls to write(2)
with different pointers, or we need to copy data into one contiguous block of memory to pass to write(2)
.
As with any syscall, there's overhead involved in the context switch. Userspace registers need to be saved, kernel state needs to be loaded, the kernel does its thing, then does the inverse: restores the userspace register state and save the kernel register state, and then context switches back to userspace.
For this reason, it's best to call any syscall as few times as possible. Calling write(2)
1024 times with one byte of data involves 1023 more syscalls than calling write(2)
with 1024 bytes of data.
writev(2)
writev(2)
is a bit more complicated that write(2)
, and allows additional flexibility. It accepts the following arguments:
- a file descriptor
- a pointer to an
iovec
struct
- a count of how many
iovec
structs to process
Each iovec
struct has two fields:
- a pointer to the beginning of some data
- a count of bytes to read from that memory
It returns the amount of bytes written if successful, and sets errno
if not successful.
Just based on the API alone, it's clear that instead of using multiple calls to write(2)
with multiple areas of memory to write, writev(2)
can write many different chunks of memory to the file-like object in one syscall. Additionally, writev(2)
is atomic, meaning that its write operation does not intermingle with other writes from other processes or threads.
Now, let's use this info to make an informed decision.
When should I use writev(2)
?
You might want to use writev(2)
(or in Rust, Write::write_vectored
) if either of the following conditions are met:
- you need the write operation to be atomic
- you have multiple, non-contiguous areas of memory to be written to the file-like object
In my specific use-case, I might have benefited from vectored writes, but performance profiling should have been used to make this determination.
In my case, I had to write a u32
and then a &[u32]
to a file, in little-endian. Write::write_all_vectored
is not stable yet, so that wasn't an option. Write::write_vectored
is a complicated method to use and there unique errors to handle, so I deferred to using Write::write_all
and by allocating a Vec<u8>
to hold my data:
let mut buffer = Vec::with_capacity(size_of::<u32>() + (size_of::<u32>() * self.0.len()));
// insert the count
buffer.extend_from_slice(&(self.0.len() as u32).to_le_bytes());
// insert the offsets
for offset in &self.0 {
buffer.extend_from_slice(&offset.to_le_bytes());
}
// dump
output.write_all(buffer.as_slice())
The specific Vec<u32>
in question will generally hold ~6,000 u32
s, which is around 24 kilobytes (base 10, not base 2), so the memory overhead isn't a big deal.
Other Solutions
An improvement on my solution would be to always allocate a fixed-size buffer, e.g. 4096 bytes, and loop through all the values, calling Write::write
every time the buffer is full and at the end of the method. This would mean that the method would only use around 4096 bytes of memory over baseline.
A further improvement would involve doing some unsafe work with vectored writes if we're on little-endian hardware. I could construct two std::io::IoSlice
instances, one pointing to 4 bytes on the stack where I'd coerce the usize
into a u32
and one pointer to the first element in the Vec
, and this would only add a constant amount of memory overhead.
How much overhead? Well, not much.
For the first IoSlice
, we'd need two pointer-size variables, the first for the actual pointer and the second to track the count of bytes, which would be four. This first IoSlice
would simply point to the u32
representation of the vector size and have a length of four bytes. Thus, for the first IoSlice
, we'd need a total of 16-20 bytes in the best case scenario, which would be on little-endian hardware.
The second IoSlice
would be much bigger, and would contain a pointer to the first u32
in the underlying vector storage and a usize
set to the amount of elements in that vector, times four, as each element is four bytes long. Again, this would only be possible on little-endian hardware.
Conclusion
For my use-case, I wouldn't benefit very much from the optimizations that Write::write_vectored
would provide, and the amount of work it would take to get it right would not be trivial. Hopefully this has been illuminating, it's been fascinating for me to learn about how this all works