Reading larger than memory file as char

Is there a nice rustean way to read a file larger than memory as characters (char, utf32)?

I've written Rust code to do this but it's ~ 100 lines and I feel like there should be something simpler (I'm very new to Rust).

That code uses ReadBuf to get the file in blocks as utf8, then uses str::from_utf8 to check if the string is valid (cause the block might end in the middle of a unicode char), adjusts the slice if there is a problem (move the invalid bytes to the head of the next buffer), then uses the chars() enumeration to get the data has char.

It seems like there should be some-sort of slick pipeline to do this and I'm afraid not really leveraging "the way of Rust" and I'd rather not develop bad habits from the get-go.

I don't think there is a super simple way, but maybe you can take inspiration from this code.

Since you have a buffered reader, you can do a somewhat cleaner solution as follows: Don't read in blocks, rather reads exact number of bytes. If you read the definition section of the UTF8 RFC, then you'll see that the number of bytes to be read per char is fixed and determined by the first byte. So you can do something on the following line:

fn read_file<P: AsRef<Path>>(path: P) -> Result<(), Box<dyn Error>> {
    let mut file = File::open(read_file)?;
    let mut reader = BufReader::new(file);
    let mut buf = [0_u8; 4]; // Max 4 bytes per char in UTF-8
    loop {
        // Get the bytes for the char
        reader.read_exact(&buf[..1])?;
        let char_len = find_char_len(buf[0]); // This is a simple lookup
        reader.read_exact(&buf[1..char_len])?;
        // Now decode the char
        buf[char_len..].fill(0);
        let ch = str::from_utf8(&buf)?.chars().next().unwrap();
        // Now do something with this char
    }
}

Edit: I have changed the decoding logic wrt the comments from @bradleyharden and @H2CO3

The utf8-chars crate looks like it might do the trick here.

@RedDocMD, that code is incorrect. You can't just pad UTF-8 bytes with zeros and get UTF-32. You converted the bytes to u32, but you didn't extract the code point.

For example, the following code gives [233, 0, 0, 0] for the char but [195, 169] for the UTF-8 bytes.

fn main() {
    dbg!(('é' as u32).to_le_bytes());
    dbg!("é".as_bytes());
}
1 Like

Hmm, looks like I misunderstood what the from_u32 method does.
Could you please tell me what you think about the following two statements:

  1. Any char can be encoded in UTF-8 with at most 4 bytes
  2. If so, how does one convert 4 bytes to a char using the std-lib

That would be more like str::from_utf8(bytes)?.chars().next(). Read more about UTF-8 encoding here. 4 bytes are only sufficient because code points are actually only 21 bits wide; if you wanted to express the full 32-bit space, that would need 6 bytes.

@RedDocMD, interestingly I just recently saw an article with pretty graphics explaining UTF-8. It's a really clever encoding that is backwards compatible with ASCII. I won't try to explain here, as this or other articles would do far better.

Thanks, this a lot more readable than the RFC I had previously linked!

If the file won't fit into your computer's available memory then you obviously need to process it as you are reading.

If so, why don't you just read/process one line at a time? You shouldn't need to do the nitty gritty UTF-8 decoding by hand.

let f = File::open("...")?;
let reader = BufReader::new(f);

for line in reader.lines() {
  ...
}

// or if you want to avoid the extra allocations

let mut buffer = String::new();

loop {
  buffer.clear();
  let bytes_read = reader.read_line(&mut buffer)?;

  if bytes_read == 0 { break; }

  do_something_with_text(&buffer);
}


In the case of such big files, I wouldn't necessarily be brave enough to assume that it is at all composed of lines, or lines small enough to fit into memory. I'm not sure what OP was doing, but the original approach seems quite workable to me:

However:

It certainly doesn't require that much code. Here is my solution for the exact same logic, the body of the main loop is around ~20 lines long:

fn for_each_utf8_char<R, F>(mut reader: R, mut f: F) -> io::Result<()>
    where
        R: io::Read,
        F: FnMut(char)
{
    // 0x10000 would be more realistic; this is artificially
    // small, so as to achieve more than a single iteration.
    const BUFSIZE: usize = 0x10;
    
    let mut buf = vec![0x00_u8; BUFSIZE];
    let mut stub = [0x00_u8; 4];
    let mut stub_len = 0;
    
    loop {
        let slice = match reader.read(&mut buf[stub_len..]) {
            Ok(0) if stub_len > 0 => return Err(io::Error::new(io::ErrorKind::Other, "end of file is invalid UTF-8")),
            Ok(0) => break Ok(()),
            Ok(n) => &mut buf[..stub_len + n],
            Err(e) if e.kind() == io::ErrorKind::Interrupted => continue, // retry
            Err(e) => return Err(e),
        };

        slice[..stub_len].copy_from_slice(&stub[..stub_len]);
        stub_len = 0; // in case `from_utf8()` succeeds at the first try
        
        let s = str::from_utf8(slice).or_else(|e| {
            stub_len = slice.len() - e.valid_up_to();
            stub[..stub_len].copy_from_slice(&slice[e.valid_up_to()..]);
            str::from_utf8(&slice[..e.valid_up_to()])
        }).unwrap();
        
        dbg!(s.len(), stub_len);
        
        s.chars().for_each(&mut f);
    }
}
3 Likes

To all thank you.

@H2CO3 I can already see places where I was doing more work than needed. I am going to study your code to learn "Tao Rust" from it.

This works for valid UTF8 input where you will only hit non-UTF8 of 1-3 bytes at the end of your buffer, but can panic here on invalid UTF8:

            stub_len = slice.len() - e.valid_up_to();
            stub[..stub_len].copy_from_slice(&slice[e.valid_up_to()..]);

E.g. if I pass in b"\xffaaaa" it will try to copy 5 bytes into the 4-btye stub. I believe this covers it:

+           if stub_len < 4 {
                stub[..stub_len].copy_from_slice(&slice[e.valid_up_to()..]);
+           }
         s.chars().for_each(&mut f);

+        if stub_len >= 4 {
+            return Err(io::Error::new(io::ErrorKind::Other, "middle of file contains invalid UTF-8"));
+        }
3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.