In memory databases as first class types

John_Nagle · October 30, 2021, 5:57am

Some thoughts on generalizing the concept of hash tables in Rust.

Rust has a lot of hash-type crates:

HashSet - one key field, zero data fields, no duplicates.
HashMap - one key field, one data field, no duplicates.
multi-map::MultiMap - two keys, one data field, no duplicates.
multimap::MultiMap - one key, one data field, duplicates.

They're all from different authors with different APIs. I'm using all four of these in a client for a 3D virtual world. There's a lot of state to track.

These are really all subsets of a relational database. But full-scale relational databases are overkill for many applications. These are really just multiple hash maps kept in sync. Suppose we had a generic that could do all of these.

Conceptually, this is simple. A database row is a struct. A database table is a vector of structs. Operations on the database look like the things you can do by iterating over a vector. That's slow, but it's a model for thinking about how to express this.

Suppose we had a macro that took a struct and a list of fields which are to be indexed, and cranked out the appropriate data structure and functions to create a lightweight database. Indices could be equality only, represented as a hash. Or allowing greater and less tests, with a tree type index. Indices could be unique, or allow duplicates. That's it for the table definition.

The usual select, insert, update, delete functions would be provided. Select by one or more keys. That's it for the lookup functionality.

This should never be worse than rolling your own from multiple HashMaps. It's really just a scheme to handle all the boilerplate of updating multiple hash maps that are supposed to stay in sync and boil it down to a familiar database-like form.

quinedot · October 30, 2021, 6:16am

CC @2e71828 (maybe).

scottmcm · October 30, 2021, 7:15am

Can I interest you in an ECS? Or maybe in sqlite, depending what you're doing?

(Terminology nitpick: "first class" means things like "str", which is not what you mean here.)

John_Nagle · October 30, 2021, 7:28am

No, I don't want an entity-component system. Or all the overhead of SQLite. Just coordinated hash tables.

I'm using hashes for things such as "which textures are already loaded", and "which faces need to be updated when texture X is loaded". Also for objects which have to be looked up by multiple keys.

2e71828 · October 30, 2021, 7:45am

My master's thesis is essentially a proof-by-prototype that what you're describing is possible, even in the most general case: It describes a Relation trait for tables that can be implemented for all kinds of storage and indexing strategies so that they're drop-in replacements for each other (as long as the set of fields in the record matches). Query planning is done entirely at compile time, and produces runtime performance similar to hand-written implementations.

There are a few downsides to my implementation, though: All of the compile-time computation is done in the type system, which is incredibly slow¹, and the process of writing a new collection type is quite daunting. If I were to do this again, I'd look into an external code-generation tool that only produced code for the specific queries and structures used in the program instead of the generic soup I ended up with.

¹ And occasionally taxes the trait solver to the point of failure.

droundy · October 30, 2021, 1:09pm

I started something like this polygraph and then other things for busy and I dropped it. The main design choice that distinguishes polygraph (which mostly sort of works?) is that it used my tinyset to store the indices between struct types, which makes it pretty compact. I also made no attempt at speeding Ord queries.

geebee22 · October 30, 2021, 5:36pm

I have been working on a database written in Rust for some time now. It allows direct access to the data when iterating through a table ( so no copying overhead ). Data is stored in fixed-size pages ( typically 16kb ), with 3 bytes of overhead used to implement a binary tree ( two bits for the balance, 2 x 11 bits for left/right node ids ).

It's not finished though, I am continuing to work on it. Probably not relevant, but thought I would mention it.

John_Nagle · October 30, 2021, 5:48pm

So others recognize the need.

Gebee2's Memquery system is impressive. That goes much further than I'm talking about. I'm not asking for inter-table joins, where relational databases get serious. I just want something to handle the case where you need several HashMaps indexing the same data, and need to update them all in sync. It's just bookkeeping boilerplate, an appropriate thing to do with generics.

I could write this, but not elegantly. It needs someone who spends more time writing generics and macros than I do.

geebee22 · October 30, 2021, 6:00pm

I am not planning on supporting joins (although I guess I could), just indexed tables. There is an SQL-like language, with the ability to add your own Rust functions, but data can also be accessed directly from Rust code. Records have a stable address ( an Rc+RefCell and an offset into the byte data), although if you need to update an indexed field, you cannot easily do that directly. The things I haven't implemented yet are things like DROP TABLE, which probably wouldn't be needed for an in memory application. I am still deciding on the design of how to shrink the underlying file on DROP operations, and generally how to manage raw file storage after DELETE and DROP type operations.

2e71828 · October 30, 2021, 7:37pm

There's an uncanny valley in this problem space: It's relatively simple to make something that works for a hyper-specific problem, like the modules you mentioned in your first post. Problems tend to grow in scope over time, though, so those specific solutions start to become an issue. You can look for a new turnkey solution that solves your new, more complicated, problem, but you'll inevitably hit the same issue further down the line when your problem grows again. Eventually, your requirements will be so unique that there won't be a turnkey solution available.

In the relational database world, they had all of these problems 50 years ago and "solved" them with a unified theory, relational algebra. My thesis was an attempt to show that the same formalism is useful for managing data organization within a program. Any given program will probably only use 20% of what's available, but every program will use a different 20%; there's not much that can be cut out and still be left with something that's useful for more than a handful of programs. I think I made this point successfully, but didn't end up with a production-ready system; hopefully, someone will make another run at it someday.

So, you'll find lots of prepackaged simple data management tools and a few complex, comprehensive ones. But in between there is, and probably will always be, a complexity level where the simple tools aren't enough but the complex ones are too heavy for one reason or another. It sounds like you're venturing into that middle space where writing a bespoke solution tailored to your particular problem is the only practical option.

As an aside, I've found that people get the wrong impression as soon as you say the words "database," or "relational algebra"— Those words call up images of slow, giant systems rather than basic information management.

2e71828 · November 6, 2021, 11:18pm

I sketched out what the skeleton of this might look like. It's missing some key features that are hard to add in a generic fashion but probably aren't too hard for a specific problem. In particular, making it generic over the different data types you might want to index on is quite difficult.

I also haven't tested the code at all beyond making sure it compiles, so there may be some significant logic mistakes.

_{Apologies for the lack of commentary or examples, but I've spent too long on this already}

anon80458984 · November 6, 2021, 11:42pm

This is a very interesting topic. I want to see if I understand what you want. You want a Rust data structure that:

is like an ACID database, minus the durability (no write ahead log, no serializing to disk, no recovery, may or may not involve b-trees)
you do want to keep the concept of indexes
instead of SQL, you want to write transactions in Rust

Is that correct ?

So something like GitHub - tonsky/datascript: Immutable database and Datalog query engine for Clojure, ClojureScript and JS but Clojure -> Rust ?
A shallow dive into DataScript internals @ tonsky.me

There is the now unmaintained: GitHub - mozilla/mentat: UNMAINTAINED A persistent, relational store inspired by Datomic and DataScript.

John_Nagle · November 7, 2021, 12:12am

Generally, yes. But something that's simple enough that it doesn't need a query language, 87 releases, 5 talks, a Patreon, a book and/or a thesis, a web server, a REPL, and a maintenance team. Just a crate, please.

Work in this area tends to be way overdesigned. Stop before adding joins. That's when it gets complicated. You should be able to look up by any one index. If you need lookup by multiple keys, use a real database. Or pre-create an index that covers more than one column. Then you can use that for lookups by two keys. But only for cases where that index was defined in advance.

This is really about keeping all the hash and tree inserts and removals straight. That's just bookkeeping, and bookkeeping that's often done wrong. If you're inserting into multiple indices, and one of them reports an unallowed dup, something has to back out the other inserts. That's precompilable boilerplate. The lookup part is trivial.

system · February 5, 2022, 12:12am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Is there a crate where you can look up items in a Vec in O(logn) via multiple struct properties? help	22	1109	September 18, 2020
HashMap performance	26	28041	January 22, 2020
Rust data structure that is indexed twice; ECS? help	8	967	June 1, 2022
Database recommendation for my use case help	3	828	January 24, 2022
Multiple Implementations of a Data Structure	7	1166	September 28, 2020

In memory databases as first class types

Related topics