Joining fixed size arrays into a flattened slice?

I have a legacy system which I'm writing a new Rust utility around. One of the many items to support is a binary database format consisting of little-endian unsigned 32-bit integers.

The file format in a simple table:

+-------+-----------+
| count | offset... |
+-------+-----------+

Here's some example code where I'm trying to write the serialization logic:

use std::io;
use std::mem::size_of;

type Offset = u32;

const OFFSET_SIZE: usize = size_of::<u32>();

struct Offsets(Vec<Offset>);


impl Offsets {
    fn serialize(&self, output: &mut impl io::Write) -> io::Result<usize> {
        // write out the length as u32 in little-endian, then each u32 in the
        // vector in little-endian
        output.write([self.0.len() as u32]
            .iter()
            .chain(self.0.as_slice())
            .map(|i| i.to_le_bytes())
            .flatten()) // FIXME can I get a &[u8] out of this?
    }
}

fn main() {
    let offsets = Offsets(vec![0, 8, 16]);
    
    let mut output: Vec<u8> = Vec::new();
    let bytes_written = offsets.serialize(&mut output).expect("certain death");
    
    assert_eq!(16. bytes_written);
    assert_eq!(16, output.len());
    
    assert_eq!(3u32.to_le_bytes(), output.as_slice()[0..4]);
    assert_eq!(0u32.to_le_bytes(), output.as_slice()[4..8]);
    assert_eq!(8u32.to_le_bytes(), output.as_slice()[8..16]);
    assert_eq!(16u32.to_le_bytes(), output.as_slice()[16..20]);
}

playground link

Essentially, I'd like to take the little-endian bytes of the length and each entry of the internal vector and write them to the output: &mut impl io::Write with as little overhead as possible.

This fails to compile with the following error message:

error[E0277]: `[u8; 4]` is not an iterator
  --> src/main.rs:19:14
   |
19 |             .flatten()) // FIXME can I get a &[u8] out of this?
   |              ^^^^^^^ borrow the array with `&` or call `.iter()` on it to iterate over it
   |
   = help: the trait `std::iter::Iterator` is not implemented for `[u8; 4]`
   = note: arrays are not iterators, but slices like the following are: `&[1, 2, 3]`
   = note: required because of the requirements on the impl of `std::iter::IntoIterator` for `[u8; 4]`

error[E0308]: mismatched types
  --> src/main.rs:15:22
   |
15 |           output.write([self.0.len() as u32]
   |  ______________________^
16 | |             .iter()
17 | |             .chain(self.0.as_slice())
18 | |             .map(|i| i.to_le_bytes())
19 | |             .flatten()) // FIXME can I get a &[u8] out of this?
   | |______________________^ expected `&[u8]`, found struct `std::iter::Flatten`
   |
   = note: expected reference `&[u8]`
                 found struct `std::iter::Flatten<std::iter::Map<std::iter::Chain<std::slice::Iter<'_, u32>, std::slice::Iter<'_, u32>>, [closure@src/main.rs:18:18: 18:37]>>`

error: aborting due to 2 previous errors

Some errors have detailed explanations: E0277, E0308.
For more information about an error, try `rustc --explain E0277`.
error: could not compile `playground`.

To learn more, run the command again with --verbose.

How can I get a Vec<u8> out of these values, or, better yet, an &[u8] that I can pass to the write method of the output? How can I convert u32::from_le_bytes's array results into slices that can be joined into a &[u8]?

Bonus Points: I've been thinking through it, and I'm not sure if there's a way to do this without allocating.

If I'm on a little-endian system, in theory I could do some unsafe sorcery to create a slice that would reference the underlying data in the Vec probably using Vec::as_ptr, but I would still need to prefix this with the length.

At the very least, if I have to allocate a temporary vector, I could at least allocate Vec::with_capacity to get the exact size that I need.

Can anyone see a way to optimize this? I'm assuming that zero-copy optimization isn't possible.

You can't get a slice out of an iterator without allocating.

In this case it's even hard to get an iterator over u8, because to_le_bytes returns a fixed-size array, and arrays don't implement IntoIterator. If they had a working IntoIterator, then flat_map(|i| i.to_le_bytes()) would give you an iterator.

A simpler solution is to make many write_all calls, one for each number. To avoid overhead, you can also use buffered writer.

BTW: were you aware that .write() may not write the data you give it, and using it without a retry loop is pretty much a bug?

2 Likes

&[u8] must be contiguous, so you can't get it out of chained iterators without copying.

To write your data with the least copying and the fewest system calls, you would build an array of IoSlices and call write_vectored with them, over and over again until all the data has been written, with correct handling of ErrorKind::Interrupted. Until something like IoSlice::advance is stable, this will be horrible to get working right.

Or as kornel said, you can use a buffered writer with write_all and only copy each slice once.

1 Like

You can't get a slice out of an iterator without allocating.

Gotcha. Yes, this does make logical sense, Iterator doesn't internally allocate anything, it's just something that produces a bunch of values.

In this case it's even hard to get an iterator over u8 , because to_le_bytes returns a fixed-size array, and arrays don't implement IntoIterator . If they had a working IntoIterator , then flat_map(|i| i.to_le_bytes()) would give you an iterator.

Interesting. I see that IntoIterator is implemented for &'a [T; N] and &'a mut [T; N], does that mean that it's explicitly not implemented for [T; N] directly?

A simpler solution is to make many write_all calls, one for each number. To avoid overhead, you can also use buffered writer.

If I'm going to have to allocate anyway, should I just allocate a mutable vector, fill it with data, and then pass as_slice() to the write_all method, which should result in as few writes as possible, right?

If I accept BufWriter<impl Write> as a parameter, that would work, I guess, but then I'd be forcing callers to use only a buffered writer, where in reality I don't really care about what the caller's underlying implementation is.

BTW: were you aware that .write() may not write the data you give it, and using it without a retry loop is pretty much a bug?

Yikes, I'm glad you caught this!

Based on feedback, it seems that the main viable option is to allocate a Vec, write values into it, and then pass it to Write::write_all as a slice:

use std::io;
use std::mem::size_of;

type Offset = u32;

struct Offsets(Vec<Offset>);

impl Offsets {
    fn serialize(&self, output: &mut impl io::Write) -> io::Result<()> {
        let mut buffer = Vec::with_capacity(size_of::<u32>() + (size_of::<u32>() * self.0.len()));

        // insert the count
        buffer.extend_from_slice(&(self.0.len() as u32).to_le_bytes());

        // insert the offsets
        for offset in &self.0 {
            buffer.extend_from_slice(&offset.to_le_bytes());
        }

        // dump
        output.write_all(buffer.as_slice())
    }
}

fn main() {
    let offsets = Offsets(vec![0, 8, 16]);

    let mut output: Vec<u8> = Vec::new();
    offsets.serialize(&mut output).expect("certain death");

    assert_eq!(16, output.len());

    assert_eq!(3u32.to_le_bytes(), output.as_slice()[0..4]);
    assert_eq!(0u32.to_le_bytes(), output.as_slice()[4..8]);
    assert_eq!(8u32.to_le_bytes(), output.as_slice()[8..12]);
    assert_eq!(16u32.to_le_bytes(), output.as_slice()[12..16]);
}

Performance characteristics:

  • Internally, u32::to_le_bytes is optimized so that on little-endian machines, no endian conversion happens.
  • Within what we can control, only one allocation happens and then the reference to its underlying slice is passed to Write::write_all.

Are there any other optimizations that can be made within reason, or perhaps changes to make it more idiomatic?

To write your data with the least copying and the fewest system calls, you would build an array of IoSlice s and call write_vectored with them, over and over again until all the data has been written, with correct handling of ErrorKind::Interrupted .

I'm interested, I'll start reading the docs and try to understand how to do this. Is the performance profile somehow better than my solution I just implemented, wherein I just allocate a properly sized Vec, fill it, and then pass as a slice to Write::write_all?

Using write_vectored will mean no copying of bytes from the data, and usually only one system call, which would be faster in theory. In practice, if you're writing [[3, 0, 0, 0], [65, 66, 67], [2, 0, 0, 0], [68, 69]] then write_vectored might return with any number of bytes actually written, in which case you need to figure out which IoSlice was partially written, and recreate the array.

For example, if write_vectored returns Ok(8) for the data above, you need to call it again with [[0, 0, 0], [68, 69]], and repeat that until all bytes have been written completely.

I wouldn't want to write that code until I knew that copying into a Vec or using BufWrite was too slow.

3 Likes

The primary use-case of write_vectored is when you have multiple buffers that you want to write, and you don't want to copy the data into a single buffer. For example, if you are on a little-endian machine, you can actually use the memory of the Vec<Offset> directly as the data to write, but you also want the four byte length to be written first.

Consider the following example that uses the zerocopy crate to safely transmute a slice into the vector as a byte array. The example also uses conditional compilation to only use this strategy on little-endian machines.

use std::io;
use std::mem::size_of;

type Offset = u32;

struct Offsets(Vec<Offset>);

impl Offsets {
    #[cfg(target_endian = "little")]
    fn serialize(&self, mut output: impl io::Write) -> io::Result<()> {
        let len = (self.0.len() as u32).to_le_bytes();

        let offsets: &[u8] = zerocopy::AsBytes::as_bytes(self.0.as_slice());

        let to_write = len.len() + offsets.len();
        let mut written = 0;
        while written < to_write {
            if written < 4 {
                let io_slices = [
                    io::IoSlice::new(&len[written..]),
                    io::IoSlice::new(offsets),
                ];

                written += output.write_vectored(&io_slices)?;
            } else {
                written += output.write(&offsets[(written-4)..])?;
            }
        }

        Ok(())
    }
    #[cfg(target_endian = "big")]
    fn serialize(&self, mut output: impl io::Write) -> io::Result<()> {
        let mut buffer = Vec::with_capacity(size_of::<u32>() + (size_of::<u32>() * self.0.len()));

        // insert the count
        buffer.extend_from_slice(&(self.0.len() as u32).to_le_bytes());

        // insert the offsets
        for offset in &self.0 {
            buffer.extend_from_slice(&offset.to_le_bytes());
        }

        // dump
        output.write_all(buffer.as_slice())
    }
}

I believe that this is a decent example of where write_vectored makes sense to use.

I also removed the mutable reference on the output argument. It is not necessary, because mutable references to a Write also implements Write itself directly.

1 Like

Can't you just do:

.chain(self.0.as_slice()).map(|i| i.to_le_bytes().iter().cloned()).flatten()

Bytes can be cloned easily, so you don't need to into_iter it.

There's no IntoIterator on [T; N] directly, and this is a problem. It's tricky to add it now, because of deref to slice it's a breaking change.

1 Like

I'd use write_all directly instead of extend_from_slice. Buffering could be done with BufWriter instead.

thank you so much for your reply! this is super helpful!

I'll probably just create an extension trait for Write and copy the write_all_vectored implementation from the merged PR and use that.

I just wanted to add that since BufWriter<W>: Write if W: Write you could stick with the &mut impl io::Write signature and still pass in a BufWriter without the inner implementation knowing about the buffering. Meaning that from a perspective of composability, calling write_all separately for each offset and making sure the outer layers take care of the buffering if system call costs are an issue seems nicer IMHO. This would also have the benefit of reusing the buffer allocation through the serialization process instead of allocating a one-off temporary buffer for the offsets.

EDIT: Depending on the numbers, buffering I/O could even become more effective than vectored I/O as a single system call can write out multiple instances of Offset whereas even vectored I/O would need at least one system call per instance.

Yes, this is why I accept impl Write as a parameter to not care about the implementation. However, I designed the internal usage to not necessarily expect that the writer is buffered and try to ship everything in one syscall if possible at the expense of memory overhead. Other solutions are possible, but one Offsets will be written to exactly one file, and we're talking about roughly 24kb of data :stuck_out_tongue:

The options I considered:

  • one big, linear allocation and then one call to write all
  • a fixed size allocation and a number of calls to write all
  • one vectored write with no allocations

If you want more info, I collected a lot of stuff into this gist.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.