Laying out tuples at runtime

Suppose, at compile time, we have types
T1: Copy, T2: Copy, T3: Copy, ..., Tn: Copy

Suppose, at runtime, we pick some subset, say (T2, T3, T5) and construct a Vec<(T2, T3, T5)> (except we don't know how many elements and which elements until runtime).

Suppose now we want to compute the size of the tuple as well as the offsets.

If we do this manually, we need to take care of the issue of alignment. Is there some crate designed for this problem? (or even a name for this problem?)

The goal is to lay out tuples / structs at runtime from pre-defined fields.

You could use std::alloc::Layout, with Layout::extend you also get the offset of the second layout. Using this repeatedly you can lay out all your fields accordingly.

2 Likes

What do you mean by "construct" here? If you haven't monomorphized that type into the binary, you can't literally construct it. And if you have monomorphized it, then you have all the layout stuff dealt with already.

What's the interface to this that you're attempting to provide?

Note also that tuples are repr(rust), so you cannot predict their layout, and thus there's no way to compute what the layout would be at runtime in order to give references to instances you've constructed yourself...

2 Likes

This is a great point. Let's take a step back.

Suppose we are building a mini-SQL database.

Suppose we are storing data in memory, row oriented, so a single record is (ignoring alignment gaps) continuous in memory.

For simplicity, let us further limit all columns to only fixed-size types.

For a given table, we support may types of columns:

i32, f32, i64, 
struct Point { x: f32, y: f32 }
struct Rect { p0: Point, p1: Point }
etc ...

so each column is either a rust primitive, rust enum or rust struct

Suppose now, we want to store each table as Vec<Record> where Record is some 'type' that we create by concatenating column types, at run time.

How would you solve this problem?

I hope this better clarifies what I mean by "we don't know the fields until runtime".

I was just fighting with the issue of how to combine N permutations of a set number of Sized types into a single ?Sized tuple in one box (i.e. reduce the heap pointers of a Vec<Box<dyn Trait>>). I started building from a desired state and went backwards from there until I reached this monstrosity [playground]

Also, the big issue I came across was that as soon as something is put in a Box all associated types need to be scrubbed from it or else it still carries a concrete type with it, so Vec<Box<dyn Trait>> can only ever be monomorphized to not allow reconstruction. What you describe sounds like you're on your way to a giant enum...

1 Like
  1. Our problems are definitely very similar in nature.

  2. I was thinking something something lower level. I can deal with the constraint of all the column types being Copy -- so I'm trying to think of solutions that involves:

let X = arbitrary list of types

X => u8 byte array
X.get(selector, u8 byte array) -> right type
X.set(selector, u8 byte array, value) -> ()

where behind the scenes we are using transmute to do properly aligned byte slice <-> Rust struct/enum/...

Also, if all the sets / permutations you care about are determined at compile time (mine are not), you might be able to get away with GitHub - agnes-rs/lhlist: Labeled heterogeneous lists for Rust , The way you are building the selectors reminded me of the Nil / Cons list / lookup-by-label of lhlist/lookup.rs at master · agnes-rs/lhlist · GitHub

Well, the alternative option I started working towards was to make my own block allocation and divide it up to use as I see fit. That had the nice benefit that only "viewer" structs (defining locations of the data) needed to be put into an 'enum'. Also, the match to create would only require knowing N items taking up S size. It's still a very rough work in progress, but [playground]

1 Like

You're writing a rigid body physics engine right?

Where does your need for permutations come from?

It seems you have:


#[derive(Copy, Clone)]
#[derive(PartialEq)]
pub struct Euler<T, const DOF: usize> {
    current_position: *const [T; DOF],
    current_velocity: *const [T; DOF],
    solved_acceleration: *const [T; DOF],
    solved_position: *const [T; DOF],
    solved_velocity: *const [T; DOF],
}

and that is the only set/order that matters.

3-6 DOF Euler, Verlet, Leapfrog, RK2, RK4 (+other RK tableaus?) in any flavor of float type. The range of sizes for my enum solution was well over 10x, and seemed wasteful when 3DOF Euler will cover most common use cases.

1 Like

Actually, I think on this point, our problems differ quite a bit. It seems like you have a bunch of different types, of possibly different size, all implementing the same trait, that you want to stuff into one continuous area of memory instead of fragmenting the heap.

My problem is a bit different in that after we fix a table, all records of a given table have the exact same size and exact same layout, and I stored a Vec where within a Vec, all the Records have precisely the same size / layout (whereas it seems you want a Vec of objects of possibly different sizes).

In my case, two different table will likely have different sized / layout Records, but within a single table, all the Records are precisely the same layout / size.

1 Like

Ahh, interesting... Well, unless I'm still missing something, it all comes down to not being able to return different types from match arms. Have you seen the tuple_list crate? As long as you can declaratively write what you want to put in it, it works (but that`s a critical limitation).

1 Like

I think there's still quite a lot of similarity; I just happen to have the escape route of being able to fall back on traits. Eventually I'd like to reach a similar point as you - a runtime-selectable, fixed-layout tuple of structs (with all required compile-time guarantees) that can be further collected (as a uniform type) into an independent runtime manager.

I'm now wondering if procedural macros might offer some help in generating boilerplate code for building such variadic collections. I've been meaning to get around to learning those...

1 Like

There are three "structures" I am trying to exploit to solve this problem. I do not understand your problem well enough to know if your problem also has these structures:

  1. Everything satisfies trait Copy. This is useful because I can view everything as "dumb bytes"

  2. If I switched from a "row first" to a "column first" representation, the problem has a fairly straight forward solution of:

struct Table {
  data: HashMap<String, Vec<u8>>
  schema: HashMap<String, (usize: type_size; u64: type_type)>
}

here, in the 'data : HashMap<String, Vec>, the String is the column_name, everything in the column has the same size of usize: type_size, and it's type is indicated by u64: type_type`

Because the problem has such a "clean" representation in a "column first" representation, I think there are certain structures that can be exploited in a "row first" / record representation.

I do not know if your problem satisfies this condition.

  1. Because of (1) & (2), I think everything I care about can be represented using Cap'n Proto: Schema Language . I don't intend to use capnproto, but the fact it serializes so nicely again implies certain stuctures.

Again, I don't understand your problem well enough to comment on your case. In my case, I don't think it matters: whatever the fixed O(1) cost is, I pay it once per column. After that, all "dynamic ways to combine columns" happen at runtime.

1 Like

This probably just comes from seeing so many graph/tree structures based on indices to avoid self-referential in structs, but is there a way to wrap the column first solution such that appears/behaves as if it were row first? Or are you specifically looking to work out a new format that is conducive for all uses?

I'd suggest that your type_size and type_types might need to be generics associated with the table: struct Table<TypeType, const TypeSize: usize>, or at least const values in some way so they remain compile-time guaranteed. Thus, you may end up needing to encode your entire top-level "row" type in the table's type anyway, or as some form of attached constants. Also, is the u64 for type_type meant to take advantage of TypeId?

1 Like

I am not a database expert, so take this with a huge grain of salt: Cache locality.

Common ops are: delete a row, insert a row. Uncommon ops are: delete a column, add a new column. So if we have a table with 10 columns; in row-first, we modify at one place. In column-first, we modify 10 vecs.

On the read side, if we are searching for "everyone where name == 'John'", then, sure, column oriented wins as we only have to scan one column. However, if we are mapping through the records, and for every record we need to read multiple columns, then, once again, row-first has better cache locality.

I agree that the pseudo code I posted is trash. I don't know what the right solution is. It seems it needs to somehow encode the schema of the table, which involves encoding the type of each column, which may or may not involve enumerating each field, it's name, it's offset, it's type. This is actually why I was looking at capnproto a bit earlier, the "schema" capnproto defines is very similar to the "schema" of a column.

I think this also implies you are right about using procedural macros. This does sound like a situation involving "running through the struct/enum" and outputting lots of stats. Possibly something like:

pub trait CanBeColumnT {
  fn enumerate_fields() -> Vec<(name: String, offset: usize, type: ???)>;
}

and then the procedural macro auto generates impl CanBeColumnT for ... { ... }

dunno, not sure

====

What do you think? We're like two blind person looking for needle in the haystack in the dark without a magnet. But I think we're making progress.

1 Like

Ahhh! I'm sure I'm even less of an expert with databases, but that definitely seems important. If you care about rows, but data exists as columns, grabbing row 10,517 would mean making that index call to each column. Not great, especially if those are of a size where they may need to be cached to disk!

Totally! At the very least I've found that making assumptions about an ideal concept that could exist at runtime and trying to work backwards from there has helped me to learn more about Rust.

Are you looking to retain mutable ownership of the table at all times? If not, immutable structs with multiple overlapping localized views into segments of data might give some flexibility. Though that gives the added complexity of having to shadow variables into mutated versions of themselves any time you do want to add/remove columns...

This is the issue I was running into about not being able to monomorphize traits. As far as I was able to work out, if you want to be able to return a type from raw data, it needs to be either a fixed return type or as an associated type with the trait. So, I could do this:

trait Append {
   type Next;
   fn append<T: Copy>(self, val: T) -> Self::Next;
}

But you can never break out of the fact that that type now must be enumerated with the trait, and thus:
dyn Append<Next = (E0, A)> is different than dyn Append<Next = (E0, B)>

I'm not sure I understand the question properly. Here is the answer I have in mind:

pub struct Table {
  data: HashMap<String, Vec<u8>,
  schema: HashMap<String, (width: usize, type_id: u64)>
}

impl Table {
  fn get<T>(&self, column_name: String, idx: usize) -> Option<&T> {
    let (width, type_id) = schema.get(column_name)?;
    if type_id != T::some_assigned_type_id { return None; }
    data.get(column_name)[idx * width .. (idx+1)*width -1] transmuted as &T
  }
}

schema check can be optimized to once per column instead of once per get with a bit more complexity