How should I handle dynamic data?

Lets say my program listens on a socket. It gets two payloads. How it demarcates which is the first or second is not important. We could say the first x bytes are payload #1. This is not important.

What is important is that the first payload will contain data of some sequence of types. Integer, String, Bool etc. For example, the payload could be tuple of (u32, String, bool) OR (String, u32) OR (bool, bool, bool, bool, bool, bool, i32) etc.

The second payload will describe the actual types of each data item from the first payload. The second payload is understood by the program based on an agreed upon protocol. That is, the second payload can be de-serialized safely because it's type is statically understood. A dumb example of a second payload could be a Vec where each String item is a type describing the first payload's types. Where the index of the vec correlates to the "index" of the first payload's tuple.

["isString", "isi32", "isBool"] -> could be looped through and used to "convert" the first payload's sequence of tuples to (String, i32, bool).

The first payload can vary with any number of permutations, so statically covering all possible permutations is tediously out of the question.

My question is, how should I go turning the first payload into a rust type (maybe a tuple), given that I have to figure out what it's type is from the second payload dynamically? I don't think any special type tricks are possible here to do this statically at all? Maybe variadic generics? In any case, is the only way forward here is to transmute memory or use unsafe rust? Or thwart the type system all together? Is there any prior art?

[EDIT] Oh I forgot to mention.. No enums. The result data from payload one should not be wrapped in enums, at least at the time I need to use the data. 100s of millions of tuples, each needing to be pattern matched is a performance hit I'd like to avoid. They are either of the type described by payload two or the program panics. No compromise.

It sounds like the language elements you're looking for is the enum, which allows you to work with data elements whose type is only known at runtime.

Nothing unsafe should be required. The process of turning a sequence of bytes into a sequence of values is called deserialization, and the widely used crate for that is serde. The precise method you use would depend on the format of the data that was transmitted. You could of course write code yourself to deserialize, but the details of that will depend on the format the data is in.

Oh I forgot to mention.. No enums. The result data from payload one should not be wrapped in enums, at least at the time I need to use the data. 100s of millions of tuples, each needing to be pattern matched is a performance hit I'd like to avoid. They are either of the type described by payload two or the program panics. No compromise of the individual elements.

Ah I think you omitted to mention that payload 2 does not have one entry per data item in payload 1.

Payload two guarantees that each tuple in the data stream will be of the type it describes. So think of it as a one-to-many description

Payload#2 -> "isString", "isi32"

Means that each "tuple" in payload#1 will be of that type. So I can guarantee that much.

Can you describe what you want your code to be able to do with this data after it's been converted? Based on your generic description it sounds like your code couldn't do anything with it, since it couldn't know what the type is, but presumably there is some sense in which you wish to be able to manipulate the resulting data?

1 Like

I would like to manipulate the data. Test equalities, compare strings, format them, display them maybe. Maybe some mutations. But I would like to eventually normalize each of the tuple elements into some type, so each data manipulation that is done, isn't having to pattern match it's type against a enum of possible types.

If you have dynamic types, there is no way around checking their types. You can artificially "avoid enums" but any substitute you find will eventually contain a type check at some point. So you might as well use enums in the first place. You can sort all the messages into separate containers based on their type, and then process those – homogeneous, statically-typed – containers in one go, reducing the number of type checks to the absolute minimum of 1 per message.


By the way, if you are comparing string contents, then checking the discriminant of an enum (which is a single integer comparison) likely isn't going to be significant at all, because reading the buffer of each string will likely put great pressure on the L1 cache so dereferencing all the Strings will make your CPU busy-wait for RAM most of the time anyway.

2 Likes

I'm looking for something more specific: do you want to write a single function that is going to be able to operate on every possible type of input? If so, that function is going to need to have some restriction on those inputs, which will presumably be expressed in terms of a trait.

You describe the data as a sequence of tuples, but the number of elements in each tuple is only known at runtime. It sounds like this hypothetical function that would be able to manipulate a sequence of data which consist of an indeterminate number of elements with indeterminate type isn't going to be able to do any of those operations on it.

Ultimately you can't express the gate conversion you want until you have expressed what you want to convert it into, and that's where I'm struggling. Is that your question? How to express in rust a data type that is only known at runtime? Again, we need to know what you can do with that type, otherwise the answer is that it's a type that you can't do anything with, and may as well be a zero sized type to save space.

If you have dynamic types, there is no way around checking their types. You can artificially "avoid enums" but any substitute you find will eventually contain a type check at some point. So you might as well use enums in the first place.

Well of course there is going to have to be some kind of type conversion at some point. I'm asking, how should I go about that in a way that

  1. Front loads the type conversion for all data as much as possible so subsequent operations on the data isn't repeatedly checking the types for each and every single operation. I want the type to be "settled" at some point.
  2. Not have this conversion be tediously written for every possible permutation of types that are possible.

I was hoping I could use some unsafe type conversion in combination with the frunk crate to handle the variadic nature of the types maybe?

@droundy The number of elements are variadic yes. And they are dynamic in the sense that they are only known at runtime. it won't be useful to have all the possible types implement a trait and operate on a trait object, if that's what you're thinking. They don't really have common shared trait behavior. I used ints and strings in the example, but the real types might have dates and other things. Some types might need to be checked for equality, and others simply can't be checked for equality. Some might need to be incremented because they are ints and some not. No total shared behavior.

Maybe you're thinking to optimize

enum Any {
  S(String),
  I(i32),
  ...
}
let data: Vec<Any> = ...

where each elements could beat different type for the case where all elements must be the same type into

enum AnyVec {
  S(Vec<String>),
  I(iVec<32>),
  ...
}
let data: AnyVec ...

Except that you are looking at tuples, so you're wanting something like

enum AnyTupleVec {
  SI(Vec<(String, i32)>),
  ISS(Vec<(32, String, String)>),
  ...
}
let data: AnyTupleVec ...

Except that you don't want to treat every possible tuple separately, do you'd rather have something more like

enum AnyVec {
  S(Vec<String>),
  I(iVec<32>),
  ...
}
let data: Vec<AnyVec> = ...

so you can have just one discriminant per type but many elements. Except that this would mean that the data type allows different numbers of elements for each element of the "tuple" which is not ideal. Also if you're operating on one "tuple" at a time you're lacking in spatial locality.

In theory you could create a data type that interleaves the different elements, so you have the data layout of AnyTupleVec without the inconvenience of treating separately all possible tuples of interest.

Is this describing what you're asking?

Do you need to maintain the sequence or can you use something like

struct Something {
    nums: Vec<u32>,
    strings: Vec<String>,
}

? Even if you do, maybe storing an index with the item like strings: Vec<(usize, String)> would be acceptable?

@droundy @Heliozoa That's an interesting direction. What I describe as "tuples" just means sequence of data and the key thing is that order of the data isn't important. So maybe I could just change how I'm operating on it by splitting up all the data by their individual types and place them into a series of homogeneous vecs like you showed. Stuff those vecs into an enum and go from there. If ordering is important, I could even include a vec of labels/indexes representing the original ordering.

It's a solution, but I will have to redesign all operations to work on this less natural, more weird representation, but maybe a workable one. Is there really no way to settle on a type for all the "tuples" pulled out of the socket at runtime and generically operate on them? crates like tuple_list and frunk seemed promising to maybe allow some static guarantee. I know I can't do this 100% generically as these are runtime determined types. Logically impossible. Just looking to get away with as much as I can.

// example op
fn operation_one<Data: (T, T2, ...), T: Eq, T2: Eq>(d: Data) {
     // knows about T and T2 and can T == T2, rest of tuple is variadic so it can ignore the rest
}

That's exactly what I was answering here:


I'm not sure what unsafe has to do with any of this. There's nothing inherently unsafe here, unless you are trying to directly transmute a bunch of bytes into a String, for example – however, that is highly unlikely, because it can't be transmitted between different processes/computers/etc. through a socket, then. So you probably need to use some sort of serialization/deserialization format, but that is orthogonal to optimizing type checking, and ultimately, you have to decide on what format you send the data in.

You're right unsafe has nothing to do with this. I meant serialize/de-serialize.

No, there is no way to resolve types at runtime. When the compiler compiles generic functions and types, it finds every possible instantiation and produces only the versions that match the types it found at compile time. Also, you say you'd like to avoid pattern-matching against types, but at some point you'll have to pattern-match against your payload 2 to determine what you're trying to work with.

If you are looking for a fast and/or space efficient serialization format, you could have a look at binary formats such as:

  • Bincode
  • MessagePack
  • BSON (Binary JSON)
  • Protobuf and derived formats.

I haven't come up with an idea for your tuple-like type with static guarantees, but here's a sketch if you're looking at data like a set of typed columns:

struct Table {
  strings: HashMap<Key, Vec<String>>,
  i32s: HashMap<Key, Vec<i32>>,
  ...
}
impl Table {
  fn get(&self, key: Key) -> AnySlice {
    if let Some(v) = self.strings.get(key) {
        AnySlice::S(v)
    } else if ... {
        ...
    }
  }
  fn index(&self, size) -> Row { ... }
}
struct Row {
  strings: HashMap<Key, String>,
  i32s: HashMap<Key, i32>,
  ...
} 

where Key is whatever type you want to use to label your columns, and AnySlice is like AnyVec above but not owning its data.

The Row type is a little silly, as it's quite heap allocation heavy to perform, but illustrates one way you could access a single row at a time if needed.

2 Likes

Thank you and thanks to all for helping me brainstorm. This is an alternative path for me to think about and it's promising.

1 Like

And thanks for the interesting problem to think about!

1 Like