Writing to Stdout/Stderr on Windows

I don't have a Windows machine to test on, and I'm worried that my code may cause an unexpected error when writing to StdoutLock<'_>/StderrLock<'_> on Windows. According to the docs:

When operating in a console, the Windows implementation of this stream does not support non-UTF-8 byte sequences. Attempting to write bytes that are not valid UTF-8 will return an error.

Do I have to worry about the internal buffer of StdoutLock<'_> or BufWriter<StderrLock<'_>> getting flushed to the underlying stream in a way that splits a multi-byte Unicode scalar value (USV) causing an error when I only write strs? Even if I were to use LineWriter for StderrLock<'_>, the docs seem to suggest that it's possible the internal buffer is flushed before a newline seemingly opening up the possibility of splitting a USV (emphasis added):

Like BufWriter, a LineWriter’s buffer will also be flushed when the LineWriter goes out of scope or when its internal buffer is full.

Must I use my own buffer that guarantees any flushes that occur do so on a USV boundary?

It looks like no. I tried flushing half a character at a time, and this didn't cause any problems. Only flushing an invalid byte caused issues.

The source for writing to stdout in Windows is here: rust/library/std/src/sys/stdio/windows.rs at 3d8c1c1fc077d04658de63261d8ce2903546db13 · rust-lang/rust · GitHub

fn write_console_utf16(
    data: &[u8],
    incomplete_utf8: &mut IncompleteUtf8,
    handle: c::HANDLE,
) -> io::Result<usize> {
    if incomplete_utf8.len > 0 {
        assert!(
            incomplete_utf8.len < 4,
            "Unexpected number of bytes for incomplete UTF-8 codepoint."
        );
        if data[0] >> 6 != 0b10 {
            // not a continuation byte - reject
            incomplete_utf8.len = 0;
            return Err(io::const_error!(
                io::ErrorKind::InvalidData,
                "Windows stdio in console mode does not support writing non-UTF-8 byte sequences",
            ));
        }
        incomplete_utf8.bytes[incomplete_utf8.len as usize] = data[0];
        incomplete_utf8.len += 1;
        let char_width = utf8_char_width(incomplete_utf8.bytes[0]);
        if (incomplete_utf8.len as usize) < char_width {
            // more bytes needed
            return Ok(1);
        }
        let s = str::from_utf8(&incomplete_utf8.bytes[0..incomplete_utf8.len as usize]);
        incomplete_utf8.len = 0;
        match s {
            Ok(s) => {
                assert_eq!(char_width, s.len());
                let written = write_valid_utf8_to_console(handle, s)?;
                assert_eq!(written, s.len()); // guaranteed by write_valid_utf8_to_console() for single codepoint writes
                return Ok(1);
            }
            Err(_) => {
                return Err(io::const_error!(
                    io::ErrorKind::InvalidData,
                    "Windows stdio in console mode does not support writing non-UTF-8 byte sequences",
                ));
            }
        }
    }

    // As the console is meant for presenting text, we assume bytes of `data` are encoded as UTF-8,
    // which needs to be encoded as UTF-16.
    //
    // If the data is not valid UTF-8 we write out as many bytes as are valid.
    // If the first byte is invalid it is either first byte of a multi-byte sequence but the
    // provided byte slice is too short or it is the first byte of an invalid multi-byte sequence.
    let len = cmp::min(data.len(), MAX_BUFFER_SIZE / 2);
    let utf8 = match str::from_utf8(&data[..len]) {
        Ok(s) => s,
        Err(ref e) if e.valid_up_to() == 0 => {
            let first_byte_char_width = utf8_char_width(data[0]);
            if first_byte_char_width > 1 && data.len() < first_byte_char_width {
                incomplete_utf8.bytes[0] = data[0];
                incomplete_utf8.len = 1;
                return Ok(1);
            } else {
                return Err(io::const_error!(
                    io::ErrorKind::InvalidData,
                    "Windows stdio in console mode does not support writing non-UTF-8 byte sequences",
                ));
            }
        }
        Err(e) => str::from_utf8(&data[..e.valid_up_to()]).unwrap(),
    };

    write_valid_utf8_to_console(handle, utf8)
}

There is a process-global IncompleteUtf8 that stores any partial UTF-8 characters. Any full characters are written normally (lines 170, 171, and 185), which puts any partial characters at the beginning of the next write. If the entire slice is a partial character, the first character is stored in the IncompleteUtf8 and the function returns without anything else happening (lines 172-184). When you write again, this function appends bytes from the new write into the IncompleteUtf8 one at a time until either the character is complete or an invalid byte is found. If the character is complete, it is written to the real stdout as UTF-16. If an invalid byte is found, an error is returned. Either way, IncompleteUtf8 is reset.

I found one way to make this happen somewhat accidentally: If you're writing to unlocked stdout from multiple threads, they may race flushes, causing an error if the flush is in the middle of a character. But this is going to look bad on any OS.

8 Likes

This is such a great answer filled with the actual source code, an explanation of the code, and confirmation of the behavior when actually run on a machine. Thank you very much. I appreciate it.

1 Like