Standard approach for handling data volumes in Rust


#1

When you need to manipulate volumes of data (bytes, kB, KiB, MB, etc…), is there a standard abstraction or crate which you usually go for? Or does everyone roll their own struct/enum?


#2

I guess the more general question is: are there recommended units of measure crates?


#3

I found this for formatting: number_prefix

Searching Cargo for KB and KiB yields a few more:


#4

I guess the validity of this generalization depends on the extent to which one can build a good general-purpose library for all kinds of units of measurements.

I’m torn on this. On one hand, when it comes to IO, I would expect the logic for parsing or formatting dates (chrono) or durations (humantime) aimed at human consumption to be rather different from that used for parsing/formatting data volumes. On the other hand, when it comes to actually building incompatible measurement types to avoid accidentally mixing these up in code, I would expect generic libraries like uom to do a good job at that…


#5

Right - that’s a good point: did you want something that can just format a byte size into a more human friendly output or actually want something that enforces type safety?


#6

The ideal crate would let me…

  • Parse data volumes initially aimed for human consumption (with possibly misused or ambiguous kB/KB/KiB-like units)
  • Keep these data volumes around in code (including in containers and such)
  • Do basic arithmetic on them (add/sub as a minimum, mul and division by integers and division to integer would also be nice)
  • Reason about them in a generic way (i.e. if possible without needing to write code in terms of a certain unit at the risk of later needing to review it very carefully in the event of a unit change)
  • And at the end pretty-print the processed data volumes again for user consumption

Of course, this “ideal crate” could also be multiple crates which interoperate nicely with one another :slight_smile:


#7

I feel like @paholg might have some insightful thoughts on this matter :slight_smile:.


#8

So, as it’s been a couple of days without a reply, and hence I think it is time to review what has been said so far.

@G2P has been helpfully proposing a set of crates (thanks!), so let’s take a tour of them:

  • number_prefix is a formatter for floating-point data volumes expressed in bytes.
  • unbytify provides both a parser and a formatter for integer data volumes in bytes.
  • bytesize provides dedicated types and constructors for data volumes, and formatting facilities.
  • pretty-bytes seems to do something similar to number_prefix, but is much less well documented. It also provides a CLI utility, which I personally do not need but others may find useful.
  • human-size provides a dedicated type and constructors for data volumes, and parsing and formatting facilities. Interestingly enough, its internal representation uses u32, which means that it cannot manipulate sizes larger than 4 GB with perfect accuracy.

On his side, @vitalyd mentioned that @paholg (author of the dimensioned dimensional analysis library) might have something to say on this matter. But so far, we have not heard from him. So at this point, I’ll try to review a bit more the previous crate suggestions.

I dislike the idea of using floats for representing quantities which are known exactly and representable using a 64-bit integer (which most machines have native support for, and those which don’t can emulate quite efficiently). For the same reason, I’m not a big fan of human-size’s decision to use an u32-and-multiple fixed-point representation, especially as that also makes size arithmetic (which human-size does not currently implement) quite a bit more complex.

In contrast, both unbytify and bytesize represent sizes as a 64-bit counter of bytes [1], which is perfectly precise, easy to process on current hardware, and good enough for any storage size available today (although the ZFS authors would probably contend that it might not be enough for the “big data” of tomorrow, I don’t have these use cases in mind). To me, that’s the right starting point.

On top of this, unbytify provides a parser for human-readable sizes, and bytesize provides a dedicated ByteSize type which can be used to write better self-documenting code. Since both libraries use the same internal representation, they should be able to interoperate efficiently, in the sense that I could use unbytify’s parser alongside bytesize’s data type. Or even propose the former for integration into the later.

So at this point, without extra information, I would probably go for bytesize with either unbytify’s parser or a parser of my own. I’ll probably need to do the later anyhow as the source which I want to parse (/proc/meminfo) abuses SI units by using “kB” (= 1000 bytes) when they mean “KiB” (= 1024 bytes), which is not something which I can reasonably expect a general-purpose library to account for.


[1] Technically, ByteSize uses an usize, not an u64, so on 32-bit platforms or under the Linux x32 ABI, it will use 32-bit integers. I think this is a mistake, as the representable sizes should not vary depending on the host architecture, especially as 32-bit is too little to represent modern storage. I will thus propose this to be changed.