Csv + serde: How to ignore non-utf8 field

Currently, I'm using csv and serde to process a csv file on windows. The structure of that file is like this:

|    A   |    B   |   C    | <- headers
| data_A | data_B | data_C |

I only use data from A and C with the code look like this:

#[derive(Deserialize)]
struct Record {
    #[serde(rename="A")]
    field_a: String,

    #[serde(rename="C")]
    field_c: String,
}

But some rows of B (not the header) contain non-utf8 character, and the program throws an error at runtime saying it won't accept non-utf8 character.

It seems strange to me because I only use A and C. Normally Deserialize should leave B untouch right? Even deserializing the data from ByteRecord does not fix the error.

Right now I'm using encoding_rs to decode the whole file before feeding it to csv. But I'm still feeling this approach is not the right one. I mean, why do I need to fix something I don't use?

Try explicitly setting the header during deserialization if you haven't already:

    let headers = csv::ByteRecord::from(vec!["A", "B", "C"]);
    let record: Record = byte_record.deserialize(Some(&headers)).unwrap();

The additional pipes on both sides of each row might also be an issue. Because of the leading and trailing pipes, you have two additional always-empty fields you have to take into account in the header:

    let headers = csv::ByteRecord::from(vec!["", "A", "B", "C", ""]);
    let record: Record = byte_record.deserialize(Some(&headers)).unwrap();

Please provide a complete program along with sample input that reproduces your problem. When you don't do this, it's harder for folks to help.

1 Like

Since this is a private file, I can't share. But here is a mockup file: https://gofile.io/d/imnAK8
And this is a simple mockup code:

use csv::ReaderBuilder;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct Record {
    id: String,
    bar: String,
}

fn main() {
    let mut rdr = ReaderBuilder::new().has_headers(true).from_path("bar.csv").expect("file not found");
    for result in rdr.deserialize() {
        let record: Record = result.expect("Something wrong");
    }
}

I've tried ByteRecord with headers and all but for no avail.
Please note that the problem only happens if you save the file with Excel. The moment you make edit and save it with, say VSCode, the problem will disappear.

The ByteRecord technique works, but only as of the 1.1.5 release that came out a few days ago. Here's a program that works with the input file you've given and the 1.1.5 release of the csv crate:

use csv::ReaderBuilder;
use serde::Deserialize;

#[derive(Debug, Deserialize)]
struct Record {
    id: String,
    bar: String,
}

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let mut rdr = ReaderBuilder::new()
        .has_headers(true)
        .from_path("bar.csv")?;
    let mut raw = csv::ByteRecord::new();
    while rdr.read_byte_record(&mut raw)? {
        let record: Record = raw.deserialize(rdr.byte_headers().ok())?;
        println!("{:?}", record);
    }
    Ok(())
}

Non-UTF-8 csv data is actually what's strange. :slight_smile: It's an uncommon (but not unheard of) case. The csv crate prioritizes working well with UTF-8 data, but does make it possible to deal with non-UTF-8 data. As for why your code didn't work out of the box, it's because of performance reasons. See this PR where someone did try to make it work: https://github.com/BurntSushi/rust-csv/pull/198

If you use the encoding_rs_io crate, then it will do this for you in a streaming fashion so that you don't need to load the entire file into memory first.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.