Csv + serde vs non-utf8 (easily)


#1

I’m using the CSV crate to parse a bunch of log formats conveniently into structs with serde_derive. It works great, and is super convenient. But, in a small number of records (that I’ve just been ignoring until now), there is non-UTF8 data in one of the fields, and of course they fail to parse into a String.

Looking at the docs, I can see easily how to use the ByteRecord struct for lower-level handling, but there’s also the advice: If you are using the Serde (de)serialization APIs, then you probably never need to interact with a ByteRecord or a StringRecord . That sounds great, but I can’t figure out how to get the non-UTF8 data into a struct field - ie, what type to make the field.

  • I thought I’d use a Vec, but that runs afoul of the (slightly odd, but well described) CSV field flattening https://docs.rs/csv/1.0.2/csv/struct.Reader.html#rules
  • I thought I’d use an OsString, but that wants to parse “Unix” or “Windows” enum variants, so it’s not quite as simple as the brief description of that type suggests.
  • I thought I’d use a [u8; n], but that only implements Deserialise for rather small values of n (and n needs to be large for only very few records, wasting lots of memory for most of them).
  • I know I could use a &[u8], but that seems to mean using ByteRecord and switching to the zero-allocation borrowing pattern described near the end of the tutorial: https://docs.rs/csv/1.0.2/csv/tutorial/index.html#serde-and-zero-allocation

The latter is probably what I will end up doing (and probably would have eventually, for optimisation) but I feel like I’m missing something simple. I “shouldn’t need” to use ByteRecord with serde, and especially as it’s only one field that can have non-utf8 string data, I just want a field type to capture it so I can deal with fixing the record afterwards.

What is the type I want?


#2

Great question! The code that should work follows:

use serde_derive::Deserialize;

#[derive(Debug, Deserialize)]
struct Row {
    h1: String,
    #[serde(with = "serde_bytes")]
    h2: Vec<u8>,
    h3: String,
}

fn main() -> Result<(), csv::Error> {
    let data = b"\
h1,h2,h3
baz,foo\xFFbar,quux
";

    let mut rdr = csv::Reader::from_reader(&data[..]);
    let mut raw_record = csv::ByteRecord::new();
    let headers = rdr.byte_headers()?.clone();

    while rdr.read_byte_record(&mut raw_record)? {
        let row: Row = raw_record.deserialize(Some(&headers))?;
        println!("{:?}", row);
    }

    Ok(())
}

where

[dependencies]
csv = "1"
serde = "1"
serde_bytes = "0.10.4"
serde_derive = "1"

Specifically, this uses serde_bytes to treat &[u8]/Vec<u8> specially.

Unfortunately, due to a bug in the csv deserializer, this still confusingly caused an invalid UTF-8 error, which is a bug! I’ve fixed it on master: https://github.com/BurntSushi/rust-csv/commit/9e644e66db0aa0b931758de1c2b7da555fb632b7

Some notes:

  • This could use better coverage in the tutorial, or at least, a cookbook entry.
  • You can’t use the standard deserializer iterators, because they internally required UTF-8 records. This is, by far, the common case. In particular, it is faster to read into a StringRecord than it is to read into a ByteRecord and individually extract UTF-8 fields (because you can run UTF-8 validation on all of the fields of a StringRecord all at once). This isn’t a fundamental limitation, but is an artifact of the implementation and would either require API complications or internal contortions to fix.
  • The docs “If you are using the Serde (de)serialization APIs, then you probably never need to interact with a ByteRecord or a StringRecord.” are indeed correct, but the weasel word “probably” in there is relevant in your case. Since you can’t use the deserializer interators, you will indeed need to read into a ByteRecord, and then run Serde deserialization from there. This is how the deserializer iterators work internally. In fact, the deserializer iterators are just convenience APIs for this exact process. I’ve updated the docs there.

Thanks for the good question!


#3

Brilliant, thankyou!

Right, that’s the “simple trick” I was missing. Aside: it sort-of feels more like “stop csv treating Vec<u8> specially”, since in other formats I’d use #[serde(flatten)] to request it to slurp the remaining fields like this.

Great, and thanks also for the other one related to comments.

About what I suspected, I was just holding on to that convenience until it became necessary to deal with the few noisy records (and the performance and memory use, I’ll be adding a string interner at the same time as I start doing the clones from the borrowed input). I might still just pipe things through iconv on the way in until then :slight_smile:


#4

Nice. Note that encoding_rs_io is designed for exactly that use case. You can cause it to decode UTF-8 lossily and it is very fast, thanks to encoding_rs. (ripgrep uses it to guarantee valid UTF-8 when searching with PCRE2’s JIT with Unicode mode enabled.)


#5

Coming from Perl, where layering encodings on top of file handles is the norm, this is more of the same convenience, and will do nicely until I can no longer put off the more detailed field-by-field handling. Thanks again!