Signaling partial read/write of a caller-supplied buffer


#1

I am trying to design an API for implementation in Rust such that the API is convenient to call both from C and Rust. I know what I want the C API to look like, but I don’t have enough Rust background to design the Rust mapping properly.

I want an API function that reads from and writes to caller-allocated buffers. The API can read only the beginning of a supplied input buffer or write only to the beginning of a supplied output buffer. Thus, the API function needs to be able to signal how much was read or written. Additionally, after such a partial read or write it should be convenient to the caller to pass the unread and unwritten tails of the buffers to the API again.

On the C side, I’d represent the input as two arguments: const uint8_t** src, const uint8_t* src_end.

Upon calling into the API function src would point to a pointer that points to first byte in the input buffer and upon return it would point to a pointer that points to the first byte that was not read. src_end would point to the first byte that is not part of the input buffer.

Likewise, I’d use uint8_t** dst, uint8_t* dst_end for output.

Therefore:
void foo(const uint8_t** src, const uint8_t* src_end, uint8_t** dst, uint8_t* dst_end)

That is, logically the API call partitions the input buffer into a read head and an unread tail and the output buffer to a written head and an unwritten tail. It is easy to call the API again with the tails as the new buffers.

Looking at std::io::Read, it seems that the concept of caller-supplied output buffer is a concept that isn’t foreign to Rust. However, returning the number of bytes read leaves it to the caller to split the slice if the caller wants to read some more data into the part of the buffer that wasn’t already filled.

Is the answer for idiomatic Rust that the function foo in Rust should return the number of bytes read and written and leave it to the caller to readjust the buffers? That is:
fn foo(src: &[u8], dst: &mut [u8]) -> (usize, usize);

Or is it appropriate to have function foo split the slices? Maybe something like:
fn foo<'a, 'b>(src: &'a [u8], dst: &'b mut [u8]) -> (&'a [u8], &'a [u8], &'b mut [u8], &'b mut [u8]);

The problem I see with this is that eventually after some number of calls to foo such that the slices passed in are views into the same underlying buffer eventually one wants to identify a slice that is the original full buffer with the last tail slice returned by foo removed. As far as I can tell, you can’t perform an operation like that with slices, since sizes don’t know that they are views into the same large buffer. Is that correct?

I’d appreciate any advice regarding how to map the C function foo above into Rust in an idiomatic way.


#2

As you say, it’s the approach the standard library uses, so it’s not unidiomatic, at the very least.

Another approach that’s essentially isomorphic would be fn foo(src: &mut &[u8], dist: &mut &mut [u8]), which rewrites the modifications to those pointers (closer to the C version). This alternate version is possibly a bit nicer to use in loops, since it doesn’t require manually reassigning variables.

It is correct that slices don’t know the extent of the memory they come from, but one can implement it manually with slicing (it may require looking at the raw memory locations, although only in a read-only way, i.e. it should be possible without unsafe), although this probably only works with &[u8].

One could make a slice type that knows the size of the buffer it came from, exposing the operations you want.


#3

You can count how many bytes are left, and slice those out of the original buffer:

fn foo(src: &mut &[u8], dst: &mut &mut [u8]) { ... }

fn bar(src: &[u8], dst: &mut [u8]) {
    let remaining_dst_len;
    {  // Limit the scope of remaining_* borrows
        let remaining_src = &src[..];
        let remaining_dst = &mut dst[..];
        while something {
            foo(&mut src, &mut dst)
        }
        remaining_dst_len = remaining_dst.len();
    }
    let written = &mut dst[..dst.len() - remaining_dst_len];
    ...
}

I don’t see how it’s necessary here, but you can write a variation of the subslice_offset method of the core::str::StrExt trait:


#4

I hadn’t realized that Rust had reference references just like C has pointer pointers. Thank you.

I looks like the

fn foo(src: &mut &[u8], dst: &mut &mut [u8]) { ... }

option makes the loop look nice in the sense that there is no visible index math in the body of the loop, but then you end up having that sort of math later, because the function won’t directly tell you the part that was written–only the part that was not.

Now I’m not sure if making the loop look nice is the right thing. Also, I guess I’ll need to consider how much FFI boiler plate it takes to turn C const uint8_t** src, const uint8_t* src_end into Rust src: &mut &[u8]. Superficially to a newbie like me, it seems that just dealing with a pointer and a length on the C side of FFI makes the FFI stuff more obvious. :-/

Also, there’s an aspect that I didn’t mention in the question: my use case will need a variant where the destination is a UTF-8 buffer. It seems to me that the options are either to design for a non-slice destination or using unsafe to turn a mut [u8] into what safe code considers UTF-8, since it seems that there’s no way to write to a string slice in the safe mode.

As for pointer math and @huon suggesting that it would only work with u8, is casting to usize as in @SimonSapin’s example the only way to do pointer comparisons and distances in Rust? I.e. will the size of what the pointer points to always be lost for the purpose of computing the distance between two pointers?


#5

The function could also return the relevant usizes (or even slices pointing to the parts of the buffers that were just read/written), if they’re often used.

let src: *mut *const u8 = ...;
let src_end: *const u8 = ...;

let len = *src as usize - src_end as usize;
let slice = std::slice::from_raw_parts(*src, len);

or, the reverse,

let slice: &[u8] = ...;

let mut ptr = slice.as_ptr();
let src_end = ptr.offset(slice.len());
let src = &mut ptr;

Yeah, operating on a byte slice and then using from_utf8 and/or ..._unchecked are probably the best approach.

(Various parts of std::str assume valid UTF-8, and so it is undefined behaviour to have a str that contains invalid data, which is why writing raw bytes is disallowed in safe code.)

Sorry, I was quite unclear: it should work with a slice of any type &[T], I was just talking about the mutability (the u8 was only because that was the element type you were using). That is, it probably won’t work with &mut [T].


#6

Unfortunately, due to the variable-width nature of UTF-8 and str having a strong UTF-8 invariant, writing to a user-allocated str buffer is more difficult than with bytes. &mut str is a valid type, but there is not much you can do with it in safe code. The standard library provides slicing, AsciiExt::make_ascii_lowercase / make_ascii_uppercase, and I think that’s it.

&mut str can be transmuted to &mut [u8], but then you have to preserve UTF-8 well-formedness. So after* writing at the beginning you need to write up to 3 more bytes (e.g. with zeros) in case you partially erased a code point.

(*) Or even before, to be exception-safe if you call during writing user code that can panic.

Even for the user getting a UTF-8 buffer is not trivial. Something like std::iter::repeat('\0').take(SIZE).collect::<String>().

Another option might be to ask the user for a &' a mut [u8] buffer, user str::from_utf8_unchecked on a subslice you know is UTF-8 by construction, and return that &'a str.

This was discussed in my proposal for Unicode stream similar to the std::io::Read and Write byte streams. Read has exactly this problem, and I don’t know of a great solution yet.


#7

I realize that string slices have to preserve UTF-8 validity, but as you note, it can be achieved by zeroing out at most 3 bytes after the sequence that was written.

As for the original question, since Rust wants to work with array indeces and lengths instead of incremented pointers and sentinel pointers, I think I should just go with returning the number of code units read and written and leave it to C++ callers to deal with that.

My current sketch is:

enum DecoderResult {
   Overflow,
   Underflow,
   Malformed,
}

#[no_mangle]
pub extern fn Decoder_decode_to_utf16(decoder: &mut Decoder, src: *const u8, src_len: *mut usize, dst: *mut u16, dst_len: *mut usize, last: bool) -> DecoderResult {
    let src_slice = unsafe { std::slice::from_raw_parts(src, *src_len) };
    let dst_slice = unsafe { std::slice::from_raw_parts_mut(dst, *dst_len) };
    let (result, read, written) = decoder.decode_to_utf16(src_slice, dst_slice, last);
    unsafe {
        *src_len = read;
        *dst_len = written;
    }
    result
}

trait UtfUnit {}

impl UtfUnit for u8 {}

impl UtfUnit for u16 {}

trait Decoder {
    fn decode_to_utf16(&mut self, src: &[u8], dst: &mut [u16], last: bool) -> (DecoderResult, usize, usize) {
        self.decode(src, dst, last)
    }

    fn decode_to_utf8(&mut self, src: &[u8], dst: &mut [u8], last: bool) -> (DecoderResult, usize, usize) {
        self.decode(src, dst, last)
    }

    fn decode_to_str(&mut self, src: &[u8], dst: &mut str, last: bool) -> (DecoderResult, usize, usize) {
        let bytes: &mut [u8] = unsafe { std::mem::transmute(dst) };
        let (result, read, written) = self.decode_to_utf8(src, bytes, last);
        let len = bytes.len();
        let mut trail = written;
        while trail < len && ((bytes[trail] & 0xC0) == 0x80) {
            bytes[trail] = 0;
            trail += 1;
        }
        (result, read, written)
    }

    fn decode_to_string(&mut self, src: &[u8], dst: &mut String, last: bool) -> (DecoderResult, usize) {
        unsafe {
            let vec = dst.as_mut_vec();
            let old_len = vec.len();
            let capacity = vec.capacity();
            vec.set_len(capacity);
            let (result, read, written) = self.decode_to_utf8(src, &mut vec[old_len..], last);
            vec.set_len(old_len + written);
            (result, read)
        }
    }

    fn decode(&mut self, src: &[u8], dst: &mut [T], last: bool) -> (DecoderResult, usize, usize);
}

This doesn’t actually compile due to the combination of type erasure and generic parameters, but I guess resolving that is a new question topic.

Also, I’m not sure if all the output variants have use cases.

Thank you.


#8

(The new topic follow-up topic.)


#9

Note that slices are mostly defined in the standard library (src/libcore/slice.rs, src/libcollections/slice.rs). If you want to work with pointer pairs, you can define a type (or two, with a *Mut flavor) to hold them and implement Index and other traits. slice::Iter does some of this already.