Stdin, stdout, stderr and encoding


#1

What are the best practices to handle the encoding of the terminal?

In the docs, writing to the stdout is done by calling its write method with a bytes arguments. Therefore I guess that rust is unaware of the terminal encoding.

But I have seen a lot of code using the write! macro to write to stdout/stderr. Then I guess this is assuming a UTF-8 terminal. Is that the case?

How about reading from stdin.?


#2

Yes, the String/str based I/O in libstd assumes UTF-8. That includes using write!() with strings on any Write value.


#3

Anything that implements Read or Write, including stdin and stdout, takes byte slices/byte vecs to read into or write from. I/O is considered to be binary, and any higher level encoding or decoding needs to happen at a level above the I/O routines. For convenience, and because I/O on many platforms and protocols is now defined to be UTF-8, strings can be written out or read into directly as UTF-8, which doesn’t require any conversion as UTF-8 is the native internal string encoding.

The rust-encoding crate offers methods to encode and decode various different encodings, if you need to do any encoding or decoding of something other than UTF-8.

I don’t know of any libraries that exist for querying the current locale, to find if you need to use a different encoding than UTF-8; on Unix-like platforms, you can obtain a string representing the current character set using setlocale (Rust docs) and nl_langinfo (Rust docs), and then you would need to map that to the appropriate encoding in the encoding crate.

Actually, probably better would be to use newlocale and nl_langinfo_l, as they don’t set or query any process-global state like setlocal and nl_langinfo do. They are not standardized in POSIX, but seem to be available on most platforms.

These functions are apparently available on Windows as well, though there are probably other native Windows equivalents that would be better to use there. I don’t know much about properly detecting or interacting with locales on Windows, and in particular the intricacies of determining or changing character sets supported by the Windows terminal.

Besides I/O, the other place where encoding comes into play is in other OS and environment dependent places like environment variables and filenames. Rust provides OsString and OsStr to represent strings encoded in a platform dependent manner that may not be UTF-8 compatible, while providing convenience methods for treating them as UTF-8 if they are entirely UTF-8 compatible. If your locale is not using UTF-8, however, on Unix you can extract the underlying byte vector with OsStringExt::into_vec, and then decode them using the rust-encoding crate.


#4

stdin and stdout are byte streams because it’s important to support use cases like gzip(1) in filter mode where raw binary data is being piped in or out of a program. Where Rust needs to assume an encoding on POSIX bytes (write! macro, filenames, etc) it always picks UTF-8 (but see OsStr for a way to bypass the decoding if that’s a problem).


#5

Thanks everybody for the info.

In summary: if it is called from the “wrong” terminal, an application using write!() to print messages to stdout will fail. It might be a good idea to add this to the docs. Most (all?) code that I have seen in the wild are doing this.

Additionally, wouldn’t be good to have something like and EncodeWriter?

let sout = stdout.lock();
let mut ew = EncodeWriter(sout, encoding::WINDOWS_949);
// Transforms "Hello" from UTF-8 to WINDOWS_949 bytes and write them to sout
ew.write_str("Hello")