Idiomatic way of reading lines in a safe manner

I'm writing a network server (with a text-based protocol for now). I want to retrieve the next line from an io::Read or io::BufRead. Unfortunately, BufRead::read_line potentially allocates an unlimited amount of memory if there is a lot of data retrieved without any \n. I found Read::take, which, in combination with Read::by_ref (or prior borrowing), can be used to limit the maximum number of bytes during a read operation from a reader implementing Read. But combining it with read_line won't work, of course:

use std::io;
fn serve<R: io::BufRead, W: io::Write>(
    &self,
    mut reader: R,
    mut writer: W,
) -> Result<(), Box<dyn std::error::Error>> {
    writeln!(writer, "Welcome.")?;
    let mut input = String::new();
    loop {
        input.clear();
        // the following line does not compile
        reader.by_ref().take(1024).read_line(&mut input)?;
        let line = input.trim();
        writeln!(writer, "You entered: {}", line)?;
    }
    Ok(())
}

That is because Read::take will return an io::Take, which implements io::Read but not io::BufRead. Also, take might break between Unicode code-point boundaries.

I cannot convert the Read into another (nested, temporary) BufRead, as then bytes might remain in that (temporary) buffer.

So my question is: What is the idiomatic way to read a line from a reader that implements io::Read or io::BufRead when it is not ensured that a new-line character occurs after a certain amount of data has been read?

Writing my own implementation of read_line with a maximum limit seems possible but requires to manage buffers. I would prefer a way to safely read lines without having to worry about handling buffers manually (preferrably by using only functions available in the standard library.

Many thanks for your help.

AFAICT, io::Take does implement BufRead, your code actually compiles. However it’s not unicode-aware, so it may chop up a codepoint with the effect that a call to read_line will fail. If that wasn’t a problem, you could just use read_line on the Take as you do. You can even distinguish whether the 1024 limit was reached by checking for the presence of \n on the end of the resulting string.

My first instinct here is that it should be somewhat straightforward to implement a (BufRead-only) utf8-char-boundary aware Take-like wrapper that shortens its size by up to 3 bytes trying to find a valid char boundary.

4 Likes

Alternately, you could use BufRead::read_until to read into a Vec<u8>, and let the caller decide how to handle partial code-points, either by trimming them, reassembling them across multiple reads, or skipping UTF-8 validation completely.

(Playground)

2 Likes

Oh, I wasn't aware of that. You are right. Adding "use std::io::{self, Read, BufRead};" did the trick for me; I should have read the compiler error more closely. Sorry for the noise.

Yes, that should work too, as Take<T> indeed implements BufRead (at least where T: BufRead). Thank you for this alternative approach, I might use that.

Eh, “somewhat straightforward” turns out to be somewhat non-straightforward after all. I might try to continue my attempt of implementing such a wrapper tomorrow or later.

When my line exceeds the maximum number of bytes, I do not want to return two different errors depending on circumstances. Instead, I want to return the same error ("Line length exceeded") in either case, whether it happened on a code-point boundary, or not. So this is what I came up with in my second attempt:

fn serve<R: io::BufRead, W: io::Write>(
    &self,
    mut reader: R,
    mut writer: W,
) -> Result<(), Box<dyn std::error::Error>> {
    writeln!(writer, "Welcome.")?;
    let mut input: Vec<u8> = Vec::new();
    loop {
        input.clear();
        reader
            .by_ref()
            .take(16)
            .read_until(b'\n' as u8, &mut input)?;
        let line: &str = (if input.ends_with(b"\n") {
            std::str::from_utf8(&input).ok()
        } else {
            None
        })
        .ok_or("Line length exceeded")?;
        let line = line.trim();
        writeln!(writer, "You entered: {}", line)?;
    }
}

Here, whenever the line (including '\n') exceeds 16 bytes, the error "Line length exceeded" will be returned (which could be replaced with any other error implementing std::error::Error). This also holds when there is no valid code-point boundary after the maximum number of bytes have been read. In addition, the effort of checking Unicode code-point boundaries is only executed if a new-line byte was received within the first 16 bytes.

However, that is a lot of code ("Vec::new", ".clear()", ".by_ref()", ".take(…)", ".read_until(…)", ".ends_with(…)", "from_utf8(…)", ".ok_or(…)") for the pretty common task of reading a line from an untrusted data stream in a safe manner. Maybe it is worth to add something to the standard library to simplify this task, or am I missing another approach that allows to reach the same goal (i.e. a unified error on exceeded line length) with less custom code?

Just for comparison, here's a zero-allocation version I wrote for some real-world code:

https://github.com/mbrubeck/agate/blob/v3.1.0/src/main.rs#L354-L380

I’m seeing the b'\n' as u8 so I cannot help but mention the fact that b'\n' already is of type u8.

It seems pretty weird to need an allocated Vec only to store no more than 16 bytes anyways. Thinking about how to solve this problem, I suppose a TakeUntil-adapter (for a BufRead reader) might be useful. This one really is somewhat straightforward (unlike the unicode-aware-Take that I still haven’t fully implemented yet). The following code might work (untested)

use std::io::{self, BufRead, Read, Write};

use memchr::memchr; // external crate for fast searching for a byte

pub struct TakeUntil<T> {
    inner: T,
    delim: u8,
    found_delim_before: Option<usize>,
}
impl<T> TakeUntil<T> {
    pub fn into_inner(self) -> T {
        self.inner
    }
    pub fn found_delim(&self) -> bool {
        self.found_delim_before.is_some()
    }
}

pub trait BufReadExt: BufRead + Sized {
    fn take_until(self, delim: u8) -> TakeUntil<Self> {
        TakeUntil {
            inner: self,
            delim,
            found_delim_before: None,
        }
    }
}
impl<T: BufRead> BufReadExt for T {}
impl<T: BufRead> Read for TakeUntil<T> {
    fn read(&mut self, dest: &mut [u8]) -> io::Result<usize> {
        let src = self.fill_buf()?;
        let consumed = { dest }.write(src).unwrap(); // writing to &mut [u8] never fails
        self.consume(consumed);
        Ok(consumed)
    }
}
impl<T: BufRead> BufRead for TakeUntil<T> {
    fn fill_buf(&mut self) -> io::Result<&[u8]> {
        match self.found_delim_before {
            Some(0) => Ok(&[]),
            Some(delim_before) => {
                let buf = self.inner.fill_buf()?;
                Ok(&buf[..delim_before])
            }
            None => {
                let buf = self.inner.fill_buf()?;
                if let Some(ix) = memchr(self.delim, buf) {
                    let delim_before = ix + 1;
                    self.found_delim_before = Some(delim_before);
                    Ok(&buf[..delim_before])
                } else {
                    Ok(buf)
                }
            }
        }
    }

    fn consume(&mut self, amt: usize) {
        match &mut self.found_delim_before {
            Some(delim_before) => {
                assert!(
                    amt <= *delim_before,
                    "amount of bytes to be consumed too large"
                );
                self.inner.consume(amt);
                *delim_before -= amt;
            }
            None => self.inner.consume(amt),
        }
    }
}

Then your use-case can just use a [u8; 16] buffer and write to it. Initially I thought about using io::copy to do that but that one discards data from the reader when the writer is full. Which we probably don’t want. I suppose we need a BufRead to Write version of copy that doesn’t do that (also untested):

pub fn buf_copy<R: ?Sized, W: ?Sized>(reader: &mut R, writer: &mut W) -> io::Result<u64>
where
    R: BufRead,
    W: Write,
{
    let mut read = 0;
    loop {
        match reader.fill_buf() {
            Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
            Err(e) => return Err(e),
            Ok([]) => return Ok(read),
            Ok(buf) => match writer.write(buf) {
                Err(ref e) if e.kind() == ErrorKind::Interrupted => continue,
                Err(e) => return Err(e),
                Ok(0) => {
                    return Err(ErrorKind::WriteZero.into());
                }
                Ok(n) => {
                    reader.consume(n);
                    read += n as u64;
                }
            },
        };
    }
}

Putting things together (also untested):


fn serve<R: io::BufRead, W: io::Write>(
    mut reader: R,
    mut writer: W,
) -> Result<(), Box<dyn std::error::Error>> {
    writeln!(writer, "Welcome.")?;
    let mut input = [0_u8; 16];
    loop {
        match buf_copy(&mut reader.by_ref().take_until(b'\n'), &mut &mut input[..]) {
            // ErrorKind::WriteZero happens if `input` is full while `reader` has bytes left
            Err(ref e) if e.kind() == ErrorKind::WriteZero => {
                return Err("Line length exceeded".into())
            }
            Err(e) => return Err(e.into()),
            Ok(len) => {
                let line = std::str::from_utf8(&input[..len as usize])?.trim();
                writeln!(writer, "You entered: {}", line)?;
            }
        }
    }
}

I’m curious if there are any existing crates that offer additional Reader and/or Writer adapters and extension traits, since it seems like std provides surprisingly very little help.

It seems like your real-world code assume that no data comes after the CRLF.

1 Like

Yes, that's correct. This is parsing a Gemini request, which must be a single line ending with CRLF. The server stops reading at the end of the request, because checking for more bytes would simply wait indefinitely in the expected case where there are none.

Yes, my point is that the code you linked doesn't work if there are any bytes after the CRLF, because it could miss it due to it being in the middle of a read, so it doesn't directly port to other use cases.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.