How to read an ANSI or UTF-8 text file?

Hi,

I have written a function to read a file and convert the lines to Vec<String>. It works fine for UTF-8 but not for ANSI. My text files are user generated, so I can't gaurantee the encoding. On simple search, I could understand String is UTF-8. So, is it possible to covert this function to support both ANSI and UTF-8 in efficient way?

pub fn read_file(path: &String) -> Vec<String> {
    let f = fs::File::open(path).expect("no file found");
    let br = BufReader::new(f);
    let lines: Vec<String> = br
        .lines()
        .collect::<Result<_, _>>()
        .unwrap_or_else(|_| panic!("Failed converting file into lines. Path: {}", path));
    return lines;
}

You can't use a String or String-based methods then, you'll have to use Vec<u8> like so:

pub fn read_file(path: &String) -> Vec<Vec<u8>> {
    let f = fs::File::open(path).expect("no file found");
    let br = BufReader::new(f);
    let lines: Vec<Vec<u8>> = br
        .split(b'\n')
        .map(|mut line| {
            // Remove the CR from CRLF Windows-style line breaks
            if line.ends_with(b"\r") {
                line.pop();
            }
            line
        })
        .collect::<Result<_, _>>()
        .unwrap_or_else(|_| panic!("Failed converting file into lines. Path: {}", path));
    return lines;
}

Small nit: You should never use &String; take &str instead. &String is far more restrictive and incurs a double indirection. Also, since it's the last statement you can replace return lines; with just lines.

1 Like

Expanded nit: Paths don't have to be UTF8 either. I suggest:

pub fn read_file<P: AsRef<Path>>(path: P) -> Vec<Vec<u8>> {
  ...
}
2 Likes

Thanks. I am using lot of String based operation on this file and would like to have them as String (or &str). Can we convert the encoding?

encoding_rs can convert to String from Windows code pages (I'm guessing that's what you mean by ANSI here).

3 Likes

There's a set of string functions that ignores Unicode requirements, so it will work for ANSI:

BTW: you can use std::io::read(path) to read the whole file without splitting and joining lines needlessly.

1 Like

With the caveat that you can't assume anything about what the individual bytes represent if using an arbitrary Windows code page. You can't even assume they are an ASCII superset or that they are a byte based encoding.

Decoding to Unicode, if possible, makes it easier to handle these strings consistently unless what you're doing really is encoding agnostic.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.