Write::write vs. Write::write_vectored

I'm reading the man pages and the Rust docs now to try to understand the differences between Write::write and Write::write_vectored and I'm having trouble discerning why one would want Write::write_vectored. The docs:

My definitely-limited understanding is that with write(2), we syscall the kernel with a pointer to the first byte and a size_t and the kernel reads our userspace memory and does the internal write dance, e.g. write(my_fd, &my_buffer, my_buffer_length).

The difference I'm seeing with writev(2) (and Write::write_vectored) is that instead of a pointer and a length, it's essentially a variable number of buffers, each having their own length, and a count of how many buffers there are, e.g. writev(my_fd, &my_iovecs, iovec_count). Each entry in the &my_iovecs is an iovec struct with a pointer to the beginning of the data and a size_t count of how many bytes are there.

The main difference I can find:

The data transfers performed by readv () and writev () are atomic: the data written by writev () is written as a single block that is not intermingled with output from writes in other processes

If there's atomicity, I'd imagine that performance for writev(2) would be worse than write(2) due to requisite locking mechanisms.

Other than atomicity, when should I choose to use Write::write_vectored? In a previous question, it was recommended to me that a potential performance increase could be obtained by using Write::write_vectored and I'm struggling to understand how it would help.

1 Like

I believe the primary way it helps is by limiting the number of syscalls you need to make.

Every call into the kernel has a non-trivial amount of overhead (you need to save registers, switch stacks, etc.) and by using write_vectored() you can avoid multiple syscalls when you already have all the data available (imagine you only need to write bits and pieces of a file that's been read into memory).

This ability to improve performance by writing in bulk is similar to why you might use a std::io::BufWriter when writing to a file.

The atomicity guarantee is also useful for correctness. Although I'd argue you've got bigger problems if multiple threads/processes are trying to write to the same thing at the same time.

You may find this StackOverflow answer useful:

https://stackoverflow.com/a/10520793/7149940

4 Likes

Okay so after reading the SO answer a few times, I think I understand. I'll detail my findings. I'm going to just assume x86_64-unknown-linux-gnu for architecture, libc, and operating system.

write(2)

write(2) accepts three arguments:

  • a file descriptor
  • a pointer
  • a count of bytes

write(2) is perhaps the simplest way of writing data. As with any syscall, we setup the arguments and the syscall ID and then trigger an interrupt to context switch into the kernel. Once the kernel is running, it attempts to read the data from the userspace memory via the pointer address and the amount of bytes, and to insert that data into the file-like object that the file descriptor points to.

What it does beyond that is implementation details that aren't worth going into, but when write(2) returns, it will set errno if there was an error and will return the amount of bytes that were successfully written.

No guarantees about concurrency are made, if other processes/threads have a file descriptor to the same file-like object, bad thingsβ„’ can happen if multiple writes occur.

Due to the API surface, the data we'd like to write into the file-like object has to be contiguous. If we want to write multiple different areas of our memory into the file-like object, we need multiple calls to write(2) with different pointers, or we need to copy data into one contiguous block of memory to pass to write(2).

As with any syscall, there's overhead involved in the context switch. Userspace registers need to be saved, kernel state needs to be loaded, the kernel does its thing, then does the inverse: restores the userspace register state and save the kernel register state, and then context switches back to userspace.

For this reason, it's best to call any syscall as few times as possible. Calling write(2) 1024 times with one byte of data involves 1023 more syscalls than calling write(2) with 1024 bytes of data.

writev(2)

writev(2) is a bit more complicated that write(2), and allows additional flexibility. It accepts the following arguments:

  • a file descriptor
  • a pointer to an iovec struct
  • a count of how many iovec structs to process

Each iovec struct has two fields:

  • a pointer to the beginning of some data
  • a count of bytes to read from that memory

It returns the amount of bytes written if successful, and sets errno if not successful.

Just based on the API alone, it's clear that instead of using multiple calls to write(2) with multiple areas of memory to write, writev(2) can write many different chunks of memory to the file-like object in one syscall. Additionally, writev(2) is atomic, meaning that its write operation does not intermingle with other writes from other processes or threads.


Now, let's use this info to make an informed decision.

When should I use writev(2)?

You might want to use writev(2) (or in Rust, Write::write_vectored) if either of the following conditions are met:

  • you need the write operation to be atomic
  • you have multiple, non-contiguous areas of memory to be written to the file-like object

In my specific use-case, I might have benefited from vectored writes, but performance profiling should have been used to make this determination.

In my case, I had to write a u32 and then a &[u32] to a file, in little-endian. Write::write_all_vectored is not stable yet, so that wasn't an option. Write::write_vectored is a complicated method to use and there unique errors to handle, so I deferred to using Write::write_all and by allocating a Vec<u8> to hold my data:

let mut buffer = Vec::with_capacity(size_of::<u32>() + (size_of::<u32>() * self.0.len()));

// insert the count
buffer.extend_from_slice(&(self.0.len() as u32).to_le_bytes());

// insert the offsets
for offset in &self.0 {
   buffer.extend_from_slice(&offset.to_le_bytes());
}

// dump
output.write_all(buffer.as_slice())

The specific Vec<u32> in question will generally hold ~6,000 u32s, which is around 24 kilobytes (base 10, not base 2), so the memory overhead isn't a big deal.

Other Solutions

An improvement on my solution would be to always allocate a fixed-size buffer, e.g. 4096 bytes, and loop through all the values, calling Write::write every time the buffer is full and at the end of the method. This would mean that the method would only use around 4096 bytes of memory over baseline.

A further improvement would involve doing some unsafe work with vectored writes if we're on little-endian hardware. I could construct two std::io::IoSlice instances, one pointing to 4 bytes on the stack where I'd coerce the usize into a u32 and one pointer to the first element in the Vec, and this would only add a constant amount of memory overhead.

How much overhead? Well, not much.

For the first IoSlice, we'd need two pointer-size variables, the first for the actual pointer and the second to track the count of bytes, which would be four. This first IoSlice would simply point to the u32 representation of the vector size and have a length of four bytes. Thus, for the first IoSlice, we'd need a total of 16-20 bytes in the best case scenario, which would be on little-endian hardware.

The second IoSlice would be much bigger, and would contain a pointer to the first u32 in the underlying vector storage and a usize set to the amount of elements in that vector, times four, as each element is four bytes long. Again, this would only be possible on little-endian hardware.

Conclusion

For my use-case, I wouldn't benefit very much from the optimizations that Write::write_vectored would provide, and the amount of work it would take to get it right would not be trivial. Hopefully this has been illuminating, it's been fascinating for me to learn about how this all works :tada:

2 Likes

There are 2 hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. :grin:

4 Likes

@TomP d'oh!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.