Looking for some architectural advice

HadrienG · August 17, 2017, 8:52am

I am currently working a data acquisition system that receives a stream of (textual) data samples with some internal structure, parses them, and accumulates the results in a container using a structure-of-array data layout.

For those unfamiliar with this terminology, it means that instead of using the obvious data layout...

struct Sample {
    x: i32,
    y: f64,
}

type Container = Vec<Sample>;

...I use instead the following less obvious layout...

struct Container {
    x: Vec<i32>,
    y: Vec<f64>,
}

...which has many performance benefits including:

Better cache locality when only a subset of the sampled data is later accessed
Less memory wasted on data alignment issues
Faster vectorized data post-processing when the sampled quantities are manipulated independently
Going from O(Nsamples) to amortized O(1) heap allocations when the amount of sampled quantities is only known at initialization time (and remains constant throughout the data acquisition process).
Going from O(Nsamples) to O(1) run-time checks on readout when the sampled data has some properties that are only known at initialization time (e.g. an optional member, which if present in the initial data sample is guarantee to always remain there).

One drawback of SoA data layout, however, is that as far as I can see it intrinsically introduces tighter coupling between the parser and the data container. In the code snippet above, you may notice that I do not have a notion of isolated sample anymore. That is because manipulating them may not be practical from a performance point of view. It can, for example, inadvertently cause expensive memory transposes, or re-introduce the O(N) heap allocations that I tried to avoid to begin with.

Instead, I reached the conclusion that the parser should probably be responsible for defining the container's initial data schema, pushing data into it, and ensuring its integrity (e.g. failed pushes should be rolled back so that at any point in times, all vectors have the same length and each "line" of vector elements represents one data sample).

Which finally leads me to the architectural question: how would you express this relationship between the parser and the data container that it manages in Rust?

Would you keep the parser and the data container as two entirely separate types, which feels more satisfying from the point of view of separation and concerns, but causes a lot of interface coupling and redundancy in practice?

// Note that this interface is missing error handling as a simplification
trait Parser {
    // A parser is associated with a certain kind of data (container)
    type Container;

    // Only the parser can tell what the container's data schema is,
    // by parsing an initial sample of data (that may not be recorded)
    fn new(initial_data: &str) -> (Self, Self::Container);

    // The parser will push parsed data samples into the container
    fn parse_and_push(&mut self, new_data: &str, target: &mut Self::Container);

    // Note that although the parser and container are created
    // together and likely to remain associated throughout the data
    // acquisition process, the user must refer to both in the interface
}

Would you bite the bullet and make the parser own the container it's writing into, only sharing it via borrows (and perhaps a destructive move)?

trait ParserAndContainer {
    // As before, we do fill a certain kind of data container
    type Container;

    // But this time, we own it, so it does not appear explicitly
    // in method signatures
    fn new(initial_data: &str) -> Self;
    fn parse_and_push(&mut self, new_data: &str);

    // We do need, however, to provide clients with a way to access it.
    // We could do this via either borrow- or move-based interfaces.
    fn borrow_data(&self) -> &Self::Container;
    fn extract_data(self) -> Self::Container;
}

Would you go for another design entirely?

Thanks for your thoughts!

matklad · August 17, 2017, 10:09am

I think it is possible (not necessary necessary) to keep parser and container absolutely orthogonal, if the parser emits fine-grained events. One approach is to treat the parser like Iterator<Item=Event> and construct a SOA (or AOS) by pulling from a parser:

enum Event {
    Config { dim: usize },
    StartSample,
    Value(f32),
    EndSample
}


struct Samples {
    dim: usize,
    n_samples: usize,
    /// `dim` vectors of `n_samples` length
    data: Vec<Vec<f32>>
}

impl Samples {
    fn read<P>(mut parser: P) -> Option<Samples>
        where P: Iterator<Item=Event>
    {
        let dim = match parser.next() {
            Some(Event::Config { dim }) => dim,
            _ => return None,
        };

        let mut result = Samples {
            dim: dim,
            n_samples: 0,
            data: vec![Vec::new(); dim]
        };

        while let Some(event) = parser.next() {
            result.n_samples += 1;
            match event {
                Event::StartSample => (),
                _ => return None
            }

            for i in 0..dim {
                match parser.next() {
                    Some(Event::Value(v)) => { result.data[i].push(v) }
                    _ => return None
                }
            }

            match parser.next() {
                Some(Event::EndSample) => (),
                _ => return None
            }
        }
        Some(result)
    }
}

Another approach is to push events from a parser into FnMut(Event).

vitalyd · August 17, 2017, 10:12am

I'd keep the parser and the container as distinct types. The container should hide its internal representation details - whether it's SoA or AoS or something else should ideally not leak out.

The parser can have a "schema" which can be represented as an associated type. The container trait should be able to accept (add) such a type. The parser can provide a size hint to the container to indicate how many elements will be produced. When the container receives the type, it can destructure it into scalar values and populate the individual arrays/vecs. Or another container impl, using AoS layout, can add the aggregate object to an array/vec.

If you squint, the parser now starts looking like an Iterator and the container like a FromIterator impl (so populating the container is essentially calling collect on your parser).

Once parsing is done, the parser is consumed and you get back the collection - I'm assuming parsing is a one shot deal and there's no good reason to keep the parser around anymore?

matklad · August 17, 2017, 10:15am

Hm, I think for parsing you can have a notion of an isolated sample and avoid O(N) allocations, if you reuse allocation, like this: fn read_next_sample(parser: &mut Parser, buff: &mut Sample).

HadrienG · August 17, 2017, 9:02pm

Many thanks for your fresh look and ideas to decouple the parser and container more. It's a track which I had more or less given up on, for reasons partially outlined above, but if there is a way to get back on it, it would be an interesting development!

@matklad: This approach based on fine-grained events looks very nice. As presented, it does look like a good fit for N-dimensional homogeneous datasets, where each value has the same type (f32 in your case).

On the other hand, I am less convinced that such a parser-container communication protocol could scale very well to samples that have an arbitrarily complex heterogeneous internal structure, such as this:

use std::time::Duration;

struct Sample {
    x: bool,
    y: Duration,
    z: usize,
}

It seems to me that for each parser/container communication scenario, you would then need to either come with an arbitrarily complex and application-specific type erasure layer...

enum ErasedValue {
    EBool(bool),
    EDuration(Duration),
    EUSize(usize),
}

...or just give up and go for the Box<Any> sledgehammer, which is generic enough for every parsing use case, but comes with a hefty price (dynamic allocation, memory overhead and receiver-side checks from RTTI...) that would then have to be paid for every single fine-grained value...

Am I missing a clever way to take a third option?

@vitalyd: I am not sure if I fully understood your proposal, so let me try to describe it in my own words.

If I understood correctly, what you would be proposing here is to keep the idea of an isolated sample type, such as this one...

struct NontrivialSample {
    boot_timestamp_ms: u64,
    counters: HashMap<String, u64>,
    extension: Option<[f64; 2]>,
}

...and put that at the heart of the communication protocol between the parser and the container, for example by making the parser an iterator of Sample. The sample type would then be the data schema which a compatible container must accept as input, and it would be the only thing which a parser and a container must agree on in order to be compatible with one another.

The goal of this design would be to have a parser that can feed any target data container. It would not be to also have downstream clients of the data be also oblivious to the container layout, which is generally a bad idea as far as data analysis performance is concerned.

Am I correct so far?

If so, I think there is some fine print in there that would warrant further discussion.

First of all, a type that can carry an individual sample is not a complete characterization of the data stream. For example, if you are only allowed to take into consideration the NontrivialSample type above, there is no way for you to know about the following performance-critical metadata of the underlying data stream:

Can I assume that NontrivialSample::boot_timestamp_ms will have the same value whenever I parse the data, and thus only parse this field once and skip it afterwards?
Can I assume that the NontrivialSample::counters hashmap will feature the same keys for every sample?
Can I assume that if the optional NontrivialSample::extension field is present/absent for one sample, it will be present/absent for every other sample?

There is a possible container data layout optimization hiding behind each of these questions. If the answer to all of these questions was a resounding "no", it means that the best SoA container layout that I can come up is likely to look like this:

struct MinimalSoA {
    boot_timestamp_ms: Vec<Duration>,  // One timestamp per sample
    counters:  Vec<HashMap<String, u64>>,  // Set of counters for each sample
    extension: Vec<Option<[f64; 2]>>,  // Vector of optional data
}

Whereas if the answer to all of these questions is "yes" for the duration of the sampling period, I could greatly benefit from using the following container layout instead:

struct FullSoA {
    boot_timestamp_ms: Duration,  // Record the boot timestamp once and keep it forever
    counters:  HashMap<String, Vec<u64>>,  // Keep a vector of samples for each counter
    extension: Option<Vec<[f64; 2]>>,  // Optional vector of data
}

I think we will easily agree that when applicable, the second container layout is be likely to much more efficient from the point of view of memory usage, cache locality, and CPU efficiency of both sampling and analysis, showing that storage layout optimization based on the characteristics of the whole data stream is worthwhile. But how could I exchange this kind of metadata about the data stream between the parser and the container without building something which is effectively more complex than a container-specific parser?

A related concern that I have is that by striving to build self-contained data samples, one can easily make the communication between the parser and the container unnecessarily redundant. For example, if samples were treated as values that are moved from the parser to the container, a parser would need to repeat, for every NontrivialSample...

The (constant) boot timestamp
The full list of counters, with the associated HashMap keys (requiring plenty of memory allocation and copies)
The boolean tag of the optional "extension" field

As @matklad correctly pointed out, most of this overhead can be avoided by reusing samples on the parser side, and only letting the container borrow the sample. However, this would preclude any use of a value-oriented design, including the Iterator-based ones that you had in mind.

And this is where I get stuck, and would gladly welcome further ideas!

vitalyd · August 17, 2017, 10:16pm

It was this latter style that I had in mind for the SoA layout. But, and I'm probably missing something, I thought it makes sense to decouple the parser and the container, certainly in terms of storage impl details. In effect, the parser and the container just need to agree on what the parsed data records look like - the container receives these records either via values or in a streaming manner, reusing a mutable "record buffer" as @matklad had. The record the container receives is merely a typed representation (ie schema) of what the parser is pulling.

When the container is given a record to "accept", it can destructure it into scalar values and place them into internal vecs (or whatever "vector" representation). It can also use an AoS internally, that's its choice. Since the container now has vector storage, it can conceivably do nice linear walks over some of the fields. In database terms, it's like a column-oriented database table - each vec stores a particular column of a record (ie a field of a type). It can also present a "flyweight" view of an entire record by returning a struct that holds an index of the record in the vecs, and "assemble" a view of the record that way, if that's needed.

So, that's sort of what I had in mind. It doesn't look like it'll support your case of "single boot timestamp across all records" cleanly because your container obviously can't know that on its own. I suppose you can make the parser and container extend their "protocol" a bit. For example, the parser can be allowed to provide a "header" record or something that would indicate data shared across all records. The container could then keep those values out of the vecs since they just repeat. Then the real population starts where the parser feeds unique records.

Let me stop here alas I'm going way into the weeds due to misunderstanding your requirements/question.

HadrienG · August 18, 2017, 9:24am

@vitalyd: Thanks for these clarifications, and sorry for expressing myself in an unclear way. Let me try to clarify the obscure parts.

As my use case seems a bit obscure, I'll start by expanding on it a bit. As both a Rust learning exercise and a preparation towards a more ambitious longer-term project (namely building a pleasant GUI for Linux perf), I'm trying to build a Linux system monitor on steroids that can sample system activity from procfs files with very low overhead. It should ideally be fully bottlenecked by the intrinsic inefficiency of procfs as a (questionable) operating system API.

Surprisingly enough, this is not the case of some system monitors that I know of. For example, gnome-system-monitor will merrily hit ~20% CPU load on three CPU cores when taking system activity samples at 4 Hz, which is quite ridiculous when you consider that it takes only tens of microseconds to load data from each of its information sources (namely /proc/stat, /proc/meminfo, and /proc/net/dev), and that drawing three simple graphs cannot possibly consume that much CPU cycles (especially when a fraction of the drawing work is GPU-accelerated).

Hints of what's going and how to improve upon this situation on can be found when looking at what the procfs API is actually like. As it turns out, that quirky OS interface is actually full of performance pitfalls for the uninitiated. Here are some examples, that map quite directly to my NontrivialSample example above (collapsed text, click to expand):

/proc/stat

/proc/stat is the main source of data on system-wide CPU activity provided by the public Linux kernel API. For legacy reasons, it also provides other useful data that is more rarely exposed by system monitors, such as the activity of various hardware IRQ sources or some statistics on system-wide process activity (number of forks since system boot, number of running/blocked processes...). It is, at this point, unclear what the future of this data is: will it be moved elsewhere in a future kernel version, like block device statistics were moved to /proc/diskstats from Linux 2.4 to Linux 2.6? No one can tell. At the moment, this is AFAIK the only place where these statistics can be found.

In what is likely an attempt to prepare clients for such future schema evolutions, the procfs documentation provides exactly zero guarantees about which fields of /proc/stat will be present. To quote man 5 proc, the content of this pseudo-file "Varies with architecture. Common entries include: [...]". This basically means that a cautious interface to /proc/stat (i.e. one that focuses on public documentation rather than looking up the kernel source code for current implementation behaviour) must make every field optional. The only assumption that can safely be made is that if a field was present in a measurement, it will also be present on further measurements on the same hardware running the same kernel version.

As a result, a naive /proc/stat sampler would produce this kind of output on every readout:

struct StatSample {
    cpu_info: Option<CPUInfo>,
    paging: Option<PagingInfo>,
    swapping: Option<SwappingInfo>,
    /* ... */
}

...whereas a clever /proc/stat sampler can instead leverage the above assumption and avoid duplication of the Option tag on every sample by aggregating the optional nature of the data in the final container:

struct StatSamples {
    cpu_info: Option<Vec<CPUInfo>>,
    paging: Option<Vec<PagingInfo>>,
    swapping: Option<Vec<SwappingInfo>>,
    /* ... */
}

This is not the only crazy thing that /proc/stat does. It also mixes together fluctuating data (such as the amount of time that each CPU core spent in each Linux scheduler state) and invariant data (such as the boot time of the host). Hence my boot_timestamp_ms example above: I actually found that one in /proc/stat as well

/proc/meminfo

/proc/meminfo, which is the main source of data about memory usage in the public Linux API, is basically a dictionary (in Python terms) mapping string keys to various data volumes and counters.

There are plenty of keys in there (46 on my machine), and the man 5 proc documentation is pretty clear that which keys are present will depend on the kernel version and configuration. So it probably makes most sense to interface /proc/meminfo as a HashMap<String, Data> rather than a struct with one member per possible meminfo key. That being said, you definitely don't want to allocate one HashMap per meminfo readout like this...

struct NaiveMeminfoSamples {
    samples: Vec<HashMap<String, Data>>,
}

...when you can use the assumption that the kernel configuration is unlikely to change under your feet during a system monitoring run, and store your meminfo data samples like this instead:

struct OptimizedMeminfoSamples {
    samples: HashMap<String, Vec<Data>>,
}

As you can see, this maps quite directly to the "counters" member of the example NontrivialSample that I provided in my previous post.

Hopefully, that should clarify my motivation for building SoA containers that can account for acquisition-wide metadata. Now, back to the core topic of that discussion: can I take this metadata into account without coupling the procfs pseudo-file parser too tightly with its target container?

I like your idea of adding a header to the parser-container communication protocol, which contains all the acquisition-wide metadata. In the case of /proc/stat, that header could indicate which fields are present and what is the system boot time, whereas in the case of /proc/meminfo that header could indicate which keys are present and what kind of data they hold (either data volumes or raw integer counters, according to the contents of that file on Linux 4.11.8 running on x86_64).

struct ProcStatHeader {
    fields_present: Vec<SomeKindOfEnum>,
    boot_time_ms: Option<u64>, // Should actually be in the enum above
    /* Other data like the amount of CPUs could be there as well */
}

struct MemInfoHeader {
    structure: HashMap<String, TypeOfValue>,
}

The parser would initiate the commucation with the container by providing it with that kind of header somehow. In further communication, the parser would then only transmit the "new" data that changes from one sample to another. A naive way to do this for /proc/meminfo would be:

struct MemInfoSample {
    new_data: Vec<TypeErasedValue>,
}

However, doing it that way in a design where the parser sends one MemInfoSample value to the container on each transaction would still cause one dynamic memory allocation per sample. We have discussed two general directions to avoid this so far. Either iterate over (type-erased) individual values, like @matklad initially proposed, or go for a streaming design that reuses the sample struct from one measurement to the next, like @matklad proposed later and you concurred.

Does this clarify things so far? If so, we can probably move on and discuss the specifics of each of these designs (iterating over fine-grained values or reusing a single sample struct).

vitalyd · August 18, 2017, 10:54am

That does clarify things @HadrienG, thanks. In particular, it didn't dawn on me initially that parsing performance was quite important - I thought you were parsing once and then doing analysis over the collected data many times. But now that you've clarified, I understand the scenario better.

So the iteration over individual values proposal is very similar to streaming xml readers. They're readers though, not really parsers although the line is a bit blurry. The benefit of this approach is flexibility and laziness - the reader doesn't have to yield any values unless the caller requests them. It also doesn't need to know about the "big picture" schema - it just walks the underlying structure via the most primitive elements.

I think the streaming reader is worth further exploration for your usecase. It might actually be a better fit for what you're doing. Since it's lazy in terms of yielding values, there's less concern about it allocating anything unnecessarily. It also allows you to skip stuff entirely - an iterator, streaming or not, still has to fill out some structured record for you on each iteration.

Need to think more about this.

HadrienG · August 18, 2017, 2:40pm

The streaming reader concept that you are discussing reminds me of something. I'm pretty sure I already have something like it in my current parser implementation. Namely a shared parser building block which implements the common procfs pseudo-file parsing task of splitting the contents of a file into lines made of spaces-separated columns.

If you give it a multi-line input string that looks like this...

line1  col1.1  col1.2  col1.3
line2  col2.1  col2.2

...it will behave like a streaming iterator which successively outputs...

An Iterator<Item=&str> that, it iterated through, produces the strings "line1", "col1.1", "col1.2", "col1.3" and a terminating None, representing the first line.
An Iterator<Item=&str> that, if iterated through, produces the strings "line2", "col2.1", "col2.2", and a terminating None, representing the second line.
A terminating None signalling the end of the file.

I started by implementing a more classic file iteration strategy based on composing the Lines and SplitWhitespace iterators from the standard library, but later ended up implementing this optimized version while trying to get the speed up.

Is this the kind of thing which you have in mind, but for the final parser output (which has heterogeneous type, and hence runs into type erasure issues)?

vitalyd · August 18, 2017, 3:13pm

Sort of, but not quite what I had in mind for the reader. The reader, as opposed to an iterator, doesn't yield anything automatically. Instead, you call something like reader.move_next() on it. This advances the reader to the next basic element in the file. In this case, it would either advance to the next column value of the current line or, if st end of line, move to the next line. You can then ask the reader what type of element it's on. That type might be "value" or it might be "label" (e.g. proc/meminfo has a key:value line, where the key is the "label") or whatever. Once you know the type of element you're on, you can ask to parse that value as some type. The type will be caller provided since it'll know the type to expect (string or integer or float).

As mentioned, this is somewhat similar to streaming xml readers - they walk the xml structure, entering and exiting elements (recursively), their values, etc and allow pulling a value out of the current element you're on. So the abstraction/utility they provide is just a structured walk over the basic primitives of the underlying representation - they don't actually know how to parse it into any semantic object, and thus don't attempt to materialize some object on each move (as an iterator does). In some ways, it maps to what you'd like to do.

In your case, however, you're probably going to parse an entire line of every line right? So the individual element walk isn't actually adding any benefit? It seems like basic tokenization/split of a line is enough and then you parse and write the values into your collection. So the interface is a line reader with each split line? It seems you started with that but gave it up for efficiency reasons?

HadrienG · August 18, 2017, 4:11pm

Just a small note, which I don't think is related to your main point. I have learned through previous implementation work that when parsing procfs pseudo-files, you really want to mark a clear distinction between moving to the next line and moving to the next column, and should avoid moving to the next line automagically in the column iterator.

Here are some reasons why:

Debugging the parser is easier. If, due to a bug, you have a mistaken view of how many columns a certain line of the file should have, silently moving to the next line in the column iterator will needlessly obscure that bug.
Compatibility with future kernel revisions. Suppose that linux 5.13 suddenly decides to add a new line to /proc/stat, following a new schema that my parser does not yet support. In release builds, what I really want to do is to ignore that new line, no matter how many columns it has. This requires a distinct "next line" operation alongside my "next column" operation.
Ignoring the end of a line is often useful. This is partly related to my previous point (e.g. the CPU stats in /proc/stat have often been extended by adding a new counter in the past, increasing the amount of columns), and partly related to information which is only there for schema disambiguation and is not useful after the first sample (like the "kB" suffix for data volumes in /proc/meminfo).

Again, I don't think this is a showstopper for the design that you are discussing. We can use an iterator-of-iterator strategy similar to that followed by my file splitter for mapping the hierarchical line -> columns structure of the file into code.

Do I understand correctly that the main goals of this design are to...

Avoid using type-erased values (the reader is queried for data of the correct type)
Avoid parsing values which may not be needed in the end

...in which case the client would only need to check the value type for consistency checking (e.g. as an assertion that the parsed file contents are as expected), since it should already know what to expect from the reader. Am I correct?

Note that there also is some schema-aware parsing work to be done at some point. For example, at a low level, /proc/stat is currently just an array of (named) arrays of integers, but from a high-level view, these integers represent different quantities and should be parsed as data of different types (e.g. Duration for CPU stats), or even just as different integer types for the sake of memory bandwidth efficiency.

I did not understand this part very well, but here is why I gave up on the Lines + SplitWhitespace combo in favor of a custom variant (since I understand this is what you are wondering about in the end):

By design, such a combo must parse each line of text twice, once in Lines to extract a line from the input string, and once again in SplitWhitespace in order to separate it into space-separated columns. A more careful streaming design can halve the memory bandwidth usage by doing all the work in a single pass through the input text.
Lines and SplitWhitespace are pretty generic and designed for handling different line endings, different whitespace characters, etc. I don't need this generality because I know in advance what my separator characters are, and can reduce the amount of branching in my code by only looking for these specific separators.
Lines and SplitWhitespace and their internals are Unicode-aware, which in general is a good thing, but in my case causes more unnecessary run-time overhead since I know that my input is ASCII. Being able to assume that one byte of input is one character, in particular, turned out to greatly increase my file-splitting performance.

vitalyd · August 18, 2017, 4:28pm

Yeah, pretty much. The reader doesn't know the semantic schema, only allows traversing the lines and columns of the text you're processing. The caller would need to know the schema, but I think that's unavoidable if structure is going to be put on top of the data. So there would be a user of the reader (parser?) that knows it's reading, e.g., /proc/meminfo, and thus knows the schema of its output.

Yeah, fair enough. I didn't think of finding line endings and whitespace separators as "parsing", but it is processing.

Honestly, and I know this is not what you're trying to do, I would instead look at using syscalls/libc to gather most of the info that the pseudos return, and maybe only enhance it with textual output from there for things that aren't easily gotten from the native interfaces . But it's a digression, and I'm sure you have a good reason for doing this the way you're planning.

HadrienG · August 18, 2017, 8:26pm

The main reason, really, is that I could not find any other stable and public API for doing this on Linux:

Online sources on building Linux system monitors start from procfs and sysfs pseudo-files
Existing system monitors that I know of fetch their input from there (checked via strace)
"Official" Linux syscall lists do not seem to feature any interface for system performance tracking (I did not check glibc, but AFAIK it just builds on top of Linux syscalls)
IIRC, during my search, I even found a person or two bragging about this state of affairs because use of the procfs filesystem as a kernel API is true to the Unix design (everything is a text file) and avoids a proliferation of binary system call interfaces.
The only suggested alternatives involved instrumenting the kernel with tools like Systemtap, which seems unnecessarily dangerous, is usually privileged, and involves relying on kernel implementation specifics rather than stable public APIs, all of which I would rather avoid if I can afford to.

Maybe I did not look hard enough, though So if someone knows more, please make yourself known!

In the meantime, I will try to think some more about how practical it would be to refactor my parsers towards the interface style that we just discussed.

Topic		Replies	Views
Structuring CSV Data in Rust	3	377	January 30, 2024
"nested" lifetimes in structs? Is this possible to express? help	7	683	October 18, 2020
Best way to work with structured "view" on a slice of u8 help	4	1948	August 19, 2020
Datastructure: two vectors vs one vector of two elements help	4	288	May 6, 2023
Cache line optimization methodology for structs	1	1743	November 5, 2019

Looking for some architectural advice

Related Topics