I am trying to load up a large text file and scan it line-by-line. I've run into a problem where sometimes, there appears to be invalid Utf-8 in one of the lines. This causes the program to panic with:
thread 'main' panicked at 'Oops! Can not read the file...: Custom { kind: InvalidData, error: StringError("stream did not contain valid UTF-8") }
The way I am currently tackling this is shown below:
let mut data = String::new();
file.read_to_string(&mut data)
.expect("Oops! Can not read the file...");
// The idea here is to remove duplicate lines then re-collect the input
let mut data_vec: Vec<&str> = data.split("\n").collect();
This was my "first attempt" and I'm sure this is not the best way to load and work with a giant text file 15MB-5GB in size, which is what I need.
The program is panicking on the file.read_to_string() step on certain files due to inability to parse Utf-8 of every byte of the large file; not surprising. Files which do not contain invalid Utf-8 work fine.
My initial thought is to use read_to_end() instead since this is just placing the data in a byte (u8) Vector as-is and thus there's not really any way for it to "fail" to interpret Utf-8.
But at some point, I do need to try and parse these "lines" as strings otherwise there will be no concept of a "line" and my program won't work. Do you have advice on tackling the problem of loading the entire file as bytes (either as a byte vector or even a mmap if I need to go that way later on) and then "trying each line" to parse as a string? TL;DR: The roadblock for me is, the definition of a "line" assumes already that the data is a string, so if it's not a valid string, how will we know where the newline char is?
Thanks.