Is there a canonical way to read a unicode file one 'char' at a time, respecting BOM

I'm very new to Rust so I may be thinking about this wrong or missing something simple.

Does Rust have a facility such that I can open a file as (Unicode) 'text' and it will read/respect the BOM (if available) and allow me to iterate over the contents of that file returning a Rust char type for each 'character' (Unicode Scalar Value) found?

I'm imagining something like:

​let mut chr: char;
// 'AsUnicode()' would indicate dev expects the file is a Unicode encoding, but may
// not know the number of bytes per scalar value or endianness
​let f = File::open(filename).AsUnicode(); 
​let mut rdr = BufReader::new(f);
​while (true) {
    ​let mut x: [char; 1] = [0];
    ​let n = rdr.read(&mut x);
    ​chr = x[0];
   ​ ...

I know the above is not valid Rust but hopefully it's close enough to get the idea. Most examples I see will read the input/file as a line. This makes sense given that Rust strings are vector of u8 which can be enumerated as byte or char, but what happens if I have a file with three 200 GB lines (i.e. lines longer than reasonable fit in memory)?

I think as I understand Rust better I may be able to extent File and Read to do what I want, but I want to make sure that there isn't already a way to do this.

Well, the standard library includes the functionality to parse UTF-16, but it do the actual parsing part with the BOM — it just takes an array of u16, and how you build the u16 from the individual bytes depends on the BOM.

You can certainly do this without reading the file line-by-line. Just read the data into a buffer and start decoding.

I'm not aware of a crate that does this, but it could definitely exist.

1 Like

There's no such thing as a Unicode file. Unicode is an abstract concept that has many different byte representations, with UTF-8 and UTF-16 being the most common.

For UTF-8 files you don't need anything special. read_to_string will read UTF-8 data.

Reading of char will almost give invalid results. It will de-facto interpret the data as UTF-32, and nobody writes files using that encoding.

Generally processing by char is a bad idea, because it's not really a character. If you care about reading limited amount of data, then work with bytes. If you care about displaying or editing human-recognizable characters, then in Unicode that's a big and complex and messy subject, and you will at least need grapheme clusters. Do not use char if you can avoid it.

2 Likes

I'm afraid you might be reading a bit too literal here. While their may be no such thing as a "Unicode file", I'm not aware of a term which means "a binary file of bytes that contains Unicode encoded data". US English can be loose in this regard, with phrases like "party pack" or "family bucket" indicating something other than the literal interpretation. By "Unicode file" I mean a file that may or may not contain a BOM and until the file is opened and read, you don't know if it's UTF8/16/32 (this would be determined by the BOM or lack thereof).

The intent of the invalid sample code was to imply that 'read' would return a char, not by reading in 4 bytes, but by reading one byte at a time until a valid Unicode Scalar Value was formed.

I'm sure that you are correct that it would be better to work in grapheme clusters, but that still leaves the basic problem of knowing how many bytes to read in to form a complete 'user-perceived character'.

Unicode is certainly big and complex and often hard to get right. The less I have to code around specific idiosyncrasies the few mistake I'll make.

You can't unambiguously differentiate UTF-16LE from UTF-32LE based on BOM: FF FE 00 00 is valid in both and encodes different texts.

That is true and not ideal, but that's how the BOM is defined, and to be fair it's only really supposed to indicate the byte order and not the encoding. Under the assumption that starting a text file with a null is probably rare, that could be an accepted limitation. A more explicit AsUTF16()/AsUTF32() could be provided.

In any case I think the answer to my question is "no, there is no canonical way of doing this". Hopefully, I can play with Rust this weekend and perhaps learn how to implement this in a 'rusty' way.

Maybe you are looking for the encoding crate? It is used by servo to handle html pages in any encoding a browser needs to support according to the whatwg standard. This includes utf8 and utf16, but seemingly not utf32. It has support for autodetecting which character encoding it used.

1 Like

So, to answer your question directly: no, there's no such feature in the Rust's standard library. You would have to implement entirety of this logic yourself, the hard way, by loading the file byte by byte, and applying detection and conversion yourself.

But what you expect to be able to do is also not a good idea. UTF-8/16/32-encoded files are not required to start with a BOM. You could try to guess which one is it based on bytes in the file, but there are cases where it's ambiguous, and poor guesses can lead to hilarious errors:

If you can dictate what encoding files should have, then require UTF-8. You can then fill up an arbitrarily-long buffer (as long as it's over 4 bytes), and use str::from_utf8 to parse it as UTF-8. It may fail on an invalid sequence or an incomplete character at the end of the buffer, but when it fails, it will give you Utf8Error that has error_len that you can use to spot encoding errors, and valid_up_to that you can use to create UTF-8 strings (by taking a slice of that length) and advance the buffer.

2 Likes

I liked the "Bush hid the facts" story (reminds me a little of this epic SO answer html - RegEx match open tags except XHTML self-contained tags - Stack Overflow)

That last paragraph is excellent and solves one of my immediate problems. I had previously looked at 'utf8_char_width' but it's marked unstable. This thread has certain given me things to consider.

Thanks!