How to represent missing dataset value in Rust?

I would like to create a DataFrame library in Rust, and dataframes, as we know, can contain missing values. What is the best way of representing the missing values in Rust? I've been thinking about Option, but Option consumes more memory than the usual T and this can be a problem if the dataset is large. Is there a less memory-consuming way of representing ,issing values?

Well, "value is missing" by necessity consumes at least one additional bit of information, there's no way around it. Note that in some cases Option can actually exploit so-called niche values in its contents: bit patterns that represent no valid value of the wrapped type. Thus, for instance, sizeof::<Option<&T>>() == sizeof::<&T>() because the zero value can be used to denote None.

One space optimization that might be useful for records of several optional values would be bit-packing the discriminant values into a common header. At least currently rustc cannot make such an optimization for std::Options but there might well be a crate available that offers a "multi-option" type.

2 Likes

The problem is that any kind of rust arrays should contain a single datatype and if I have missing value then all the non-missing values will have to be Option instead of just T. If every single value in column is wrapped in Option, the column will take way more memory

Yeah. In that case it might be useful to bitpack the "missing" info in a separate array, even a sparse array if it's known that missing values are rare. Another way would be to simply decide that all your column types must have a value that represents "missing", restricting the range of valid values. For example, it might be a reasonable decision that 0xFFFF_FFFF means "missing" for u32, as long as you're careful to validate your data before insertion. (Newtypes are your friend here. The std's NonZero* wrapper types are a good example, and implement the "niche optimization" I mentioned, but obviously zero is not the most convenient integer value to represent "missing".)

1 Like

The optional crate is an old crate made mostly irrelevant by new niche optimizations, but illustrates the approach well. Optional<T> is like Option<T>, but instead of an "external null", it uses a reserved value for an "internal null."

But what you actually want is probably something more like

struct Frame {
    data: Vec<MaybeUninit<Data>>,
    init: BitVec,
}

where you have a vector with holes and a packed bit vector of external initialization flags.

4 Likes

But you have to represent missingness somehow. And since the values can change dynamically, there's no possibility to represent the missingness statically. I'm not sure what you are expecting, but it seems to me you are confusing values with types.

Anyway, I wouldn't worry about memory consumption of Option unless it is proved by a benchmark or by a real use case that it is an actual problem. There are way more important optimizations that you could perform instead.

1 Like

Sorry for the late reply, but I would like to ask you why in the second example you used MaybeUninit if we still have to use vector where it is said wether this element is init. Wouldn't it be easier to do the same thing, but with the usual Data? It looks very overcomplicated.
And where can I read about how the optional crate is implemented (or something like this) and general idea?

If a value is uninitialized, its fields may contain invalid values (e.g. a bool set to value 17), and that would be undefined behavior. Moreover, if the memory is truly uninitialized (hasn't been written to), reading the memory multiple times may even give back varying values. Read more here.

If every missing value is initialized, including the missing ones [1], then you wouldn't need MaybeUninit. But you would have to do that initialization for the missing values.


  1. e.g. they're initialized to some default set of values ↩︎

2 Likes

You might find it helpful to separate "how will my library express absent values in its interfaces," and "how will my library represent absent values in its data structures."

The natural choice, I think, for the interface is to use Option<T>. There's a large body of existing Rust practice around these, and you will surprise approximately nobody by doing so.

For storage, you have three major options:

  • Use the interface representation (Option<T>, in this case) directly. For values with no usable niche, this will add between a byte and a whole qword to the size of each value, in line, to store the state of the Option. For values with a niche (pointers, for example, where 0 is never a valid value), Option will optimize out to using the niche value to represent None, adding no overhead. Using the interface type for the implementation can simplify the implementation.
  • Use some property of the dataset to find your own niche value to use to signal "no value for this point," and store them in line. This may be necessary if Rust isn't able to determine a niche value - f64, for example, has no niches, but if you know that your specific values will never contain NaNs, you can use NaNs as a makeshift niche value and convert to Option at the interface of your library. This complicates the implementation with conversion logic, in order to conserve memory.
  • Use a separate occupancy bitmap to track which cells in your dataframe are occupied and which aren't. This adds storage overhead, but stores it separately, which may be more cache-friendly than using Option in place. Again, you'll spend complexity on converting between the occupancy bitmap and the interface type, and in managing that occupancy bitmap over the life of a dataframe.

There isn't a single right answer. Of the three, I'd probably start with the one that saves me (the implementor) time, and use Option<T> initially, but I'd profile the use cases I care about to determine if it's worth investigating the other approaches or not.

6 Likes

It also has the possibility of being more space efficient if the stored values have no niche: In this case, Option will waste 7-31 bits of per value as padding, to maintain the value’s memory alignment. Storing the flag separately lets you avoid this overhead.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.