I am currently working a data acquisition system that receives a stream of (textual) data samples with some internal structure, parses them, and accumulates the results in a container using a structure-of-array data layout.
For those unfamiliar with this terminology, it means that instead of using the obvious data layout...
struct Sample {
x: i32,
y: f64,
}
type Container = Vec<Sample>;
...I use instead the following less obvious layout...
struct Container {
x: Vec<i32>,
y: Vec<f64>,
}
...which has many performance benefits including:
- Better cache locality when only a subset of the sampled data is later accessed
- Less memory wasted on data alignment issues
- Faster vectorized data post-processing when the sampled quantities are manipulated independently
- Going from O(Nsamples) to amortized O(1) heap allocations when the amount of sampled quantities is only known at initialization time (and remains constant throughout the data acquisition process).
- Going from O(Nsamples) to O(1) run-time checks on readout when the sampled data has some properties that are only known at initialization time (e.g. an optional member, which if present in the initial data sample is guarantee to always remain there).
One drawback of SoA data layout, however, is that as far as I can see it intrinsically introduces tighter coupling between the parser and the data container. In the code snippet above, you may notice that I do not have a notion of isolated sample anymore. That is because manipulating them may not be practical from a performance point of view. It can, for example, inadvertently cause expensive memory transposes, or re-introduce the O(N) heap allocations that I tried to avoid to begin with.
Instead, I reached the conclusion that the parser should probably be responsible for defining the container's initial data schema, pushing data into it, and ensuring its integrity (e.g. failed pushes should be rolled back so that at any point in times, all vectors have the same length and each "line" of vector elements represents one data sample).
Which finally leads me to the architectural question: how would you express this relationship between the parser and the data container that it manages in Rust?
Would you keep the parser and the data container as two entirely separate types, which feels more satisfying from the point of view of separation and concerns, but causes a lot of interface coupling and redundancy in practice?
// Note that this interface is missing error handling as a simplification
trait Parser {
// A parser is associated with a certain kind of data (container)
type Container;
// Only the parser can tell what the container's data schema is,
// by parsing an initial sample of data (that may not be recorded)
fn new(initial_data: &str) -> (Self, Self::Container);
// The parser will push parsed data samples into the container
fn parse_and_push(&mut self, new_data: &str, target: &mut Self::Container);
// Note that although the parser and container are created
// together and likely to remain associated throughout the data
// acquisition process, the user must refer to both in the interface
}
Would you bite the bullet and make the parser own the container it's writing into, only sharing it via borrows (and perhaps a destructive move)?
trait ParserAndContainer {
// As before, we do fill a certain kind of data container
type Container;
// But this time, we own it, so it does not appear explicitly
// in method signatures
fn new(initial_data: &str) -> Self;
fn parse_and_push(&mut self, new_data: &str);
// We do need, however, to provide clients with a way to access it.
// We could do this via either borrow- or move-based interfaces.
fn borrow_data(&self) -> &Self::Container;
fn extract_data(self) -> Self::Container;
}
Would you go for another design entirely?
Thanks for your thoughts!