2d table dropping of objects

Thanks to everyone for discussion in MagicRc, MagicWeak, MagicHashSet , it greatly clarified the question, which I am reposting here.

Suppose we have a 2D table.

  • In the top row, we have obj-id. In the left colunmn, we have snapshot-id.
  • Some (not all) cells are filled in. snapshots only contain some objects; objects only appear in some snapshots
  • when an snapshot-id is dropped, we want to drop everything in the corresponding row
  • when an obj-id is dropped, we want to drop everything in the corresponding column
  • EDIT: objects can have different types; all objects within a single column have the same type, but different columns can have different types

How do we do this?

How sparse is the array? 1%? 10%? 25%? 50%? 90%?

Since all objects within a column have the same type, to me that implies that a virtual array (or sparse array) of virtual column vectors (or sparse column vectors) probably would be optimal. The only serious inefficiency with such an approach is if the column vectors are implemented as sparse vectors (i.e., as hash sets or btree sets, etc) and you have to drop a row. Different solutions come to mind, depending on how sparse the column vectors are, as well as the size of the stored objects.

If those stored objects are large and vary significantly in size I probably would store each in a box, probably using an arena allocator. If they're all the same size I might use an ECS-like array and reclaim dropped entries for potential reuse. Without knowing the statistics of your input, nor whether you get it all at once or intermittently over a long period of time, it's difficult to design a high-performance solution.

2 Likes

I do not understand the focus on sparsity / high performance.

IMHO, the core of the issue is we need a 'table cell' that drops when EITHER snapshot-id drops or obj-id drops.

Usually for this sort of thing, you’ll have a main storage that’s organized in row-major or column-major order depending in your primary access pattern which contains owned copies of all the cells. The following assumes you chose row-major; column-major works similarly.

Deleting a row is simply removing it from the top-level of main storage. If the cells implement Drop, the compiler will need to scan the removed row to call its destructors. To remove a column, you need to scan through all of the rows and remove the value individually from each.

If there are Drop handlers, either case will take time proportional to the number of cells being removed, but removing a row will be much faster if there are no destructors.

There are more exotic memory layouts that make different tradeoffs between memory and insert, update, and delete performance. They tend to be more complicated, and are usually best saved for after the simple scheme proves insufficient.

1 Like

I think I have an implementation working along these same techniques. I was hoping for something cleaner, but I think a certain amoung of messiness is unavoidable here.