What is best for sparse data: 200+ fields in struct or a HashMap?

I current have a couple of structs that are parsed from some XML data. These all have the same 200+ fields coming from the XML attributes, but mostly the majority of the fields are going to be None. I don't have an exact number, but let's assume that max 10% of the fields are going to be Some() at any time time.

Currently the the have my data in the following form:

pub struct SparseData {
    field_1: Option<i32>
    field_2: Option<f64>
    field_3: Option<String>
    field_4: Option<MyEnum>
    ...
    field_199: Option<f64>
    field_200: Option<i32>
}

My concern is that having many of these structs will take up a lot of memory and fill the cache with large amounts of None fields that are not going to be used.

I have considered having a struct with a single hashmap of enums wrapping the data:

enum ValueWrapper {
    I32<i32>,
    F64<f64>,
    String<String>,
    MyEnum<MyEnum>,
    ...
}

pub struct SparseData {
    fields_map: HashMap<String, ValueWrapper>
}

or maybe just struct with a single vec, where i will have to implement some logic preventing duplicate enum variants:

enum Wrapper {
    Field1(i32),
    Field2(f64),
    Field3(String),
    ...
    Field199(i32),
    Field200(MyEnum),
}

pub struct SparseData {
    fields_map: Vec<Wrapper>
}

What would be the most idomatic and/or performant method?

I'd go with the hash map in this case, except that I'd make an enum for the keys, I wouldn't use a string. This ensures that you can't typo the "field name", because enum variant names are also checked at compile-time.

6 Likes

Keep in mind that your enum wrapper will pad every value to the size of the largest variant. For example, String is 24 bytes on 64-bit targets, and when you add the variant tag that means it has to be at least 32 bytes. Your smaller integer variants will also be padded to match.

If your data is sufficiently sparse, that's probably still okay, but you should be aware of that overhead.

1 Like

The HashMap version is redundant, because you have to check whether the value's variant is consistent with the key, and you store the variant id in the value, while it can be deduced from the key. It is also possible to insert the wrong variant to a key that expects another variant.

Is there a way to avoid that? A bit like using the key as a variant id for the value. But I don't know how the interface would be, probably a lot of macros...

1 Like

If you really want to do that, you could write Eq and Hash for the enum to only consider the determinant, and then store them in a HashSet. I don’t recommend it, though, as the confusion it’s likely to cause is probably not worth the storage savings.

Edit: Apparently, this does satisfy all the requirements for Eq

I'd suggest rather that you box any field that's likely to be missing (all fields?), and is at least 8 bytes, so that the Option only takes 8 bytes of space in the case where it's missing. If you're likely to see lots of duplicated values, I'd also consider interning your data, especially strings, but anything that might be large. Either way, box or intern, a None field will take only 8 bytes (assuming a 64 bit system), which is going to be hard to beat if you've got anything close to 10% of the girls occupied.

If you went with your HashMap you'd want to do the same with your ValueWrapper (boxing larger types), to avoid the enum taking more than 16 bytes.

There always is a way with some fancy macro.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=cc28ec31626c83d5572a7ee97c79710e

sparse! {
    pub struct SparseData {
        field1: i32,
        field2: String,
        field3: std::net::IpAddr,
    }
}

fn main() {
    let mut data = SparseData::new();
    
    dbg!(data.replace_field1(Some(42))); // returns Option<i32>
    dbg!(data.field1());                 // returns Option<&i32>
    dbg!(data.field2_mut());             // returns Option<&mut String>
}
2 Likes