Checking for UTF-8 Parse-ability


#1

I am trying to load up a large text file and scan it line-by-line. I’ve run into a problem where sometimes, there appears to be invalid Utf-8 in one of the lines. This causes the program to panic with:

thread 'main' panicked at 'Oops! Can not read the file...: Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }

The way I am currently tackling this is shown below:

let mut data = String::new();

   file.read_to_string(&mut data) 
     .expect("Oops! Can not read the file...");

    // The idea here is to remove duplicate lines then re-collect the input
    let mut data_vec: Vec<&str> = data.split("\n").collect();

This was my “first attempt” and I’m sure this is not the best way to load and work with a giant text file 15MB-5GB in size, which is what I need.

The program is panicking on the file.read_to_string() step on certain files due to inability to parse Utf-8 of every byte of the large file; not surprising. Files which do not contain invalid Utf-8 work fine.

My initial thought is to use read_to_end() instead since this is just placing the data in a byte (u8) Vector as-is and thus there’s not really any way for it to “fail” to interpret Utf-8.

But at some point, I do need to try and parse these “lines” as strings otherwise there will be no concept of a “line” and my program won’t work. Do you have advice on tackling the problem of loading the entire file as bytes (either as a byte vector or even a mmap if I need to go that way later on) and then “trying each line” to parse as a string? TL;DR: The roadblock for me is, the definition of a “line” assumes already that the data is a string, so if it’s not a valid string, how will we know where the newline char is?

Thanks.


#2

If you’re working with 5 GB sized files, you’ll have to pay attention to your allocations because it’s generally not possible to load the entire file into memory.

You can use BufReader to read chunks of data from a large file one by one. Its read_until function will allow you to get a Vec<u8> containing a single line if used with b'\n' separator. Note that you can reuse the same Vec for all read_until calls to avoid too much allocations. The vector will be cleared on each call, but its allocated buffer will only grow to accommodate the longest line.

Next, you can use std::str::from_utf8 function on the vector and check its result. If the line was correct UTF-8, you’ll get a &str (note that this doesn’t do an allocation). You can then process the string as you want.

When you’re done with a line, you’ll have to abandon the &str because this buffer is going to be reused for the next line. You can use to_string to get and owned copy of the string if you want to store it for later.