So basically I am creating a server-sent events library and I have a Vec of structs with this format:
Subscription {
sse_id: String,
user_id: Option<i32>,
room: String, // or Vec<String> for list of rooms they're in?
mpsc::Sender<Command>
}
I need to provide support for the following operations:
Iterating over the entire list of subscriptions to send an empty heartbeat to each user
Let a user join a room via their SSE ID
Let a user leave a room via their SSE ID
Sending a message to an individual user via their User ID
Sending a message to an individual user via their SSE ID
Sending a message to everyone in a single room (all users that have the same room String)
Sending a message to everyone from a list of User IDs
Deleting a user from the list when sending a message to their mpsc fails
This is all achievable by using a Vec, but the problem is that I need to iterate the entire Vec for many operations. This can be slightly alleviated by sorting the Vec by a single one of the struct's properties (like sse_id, room, or user_id) and then binary searching, but the problem is that many of the above operations require querying based on different properties:
What I am looking for, I think, is something similar to a database table, where you can index multiple columns, and then query for rows based on that. I would like to be able to index my Vec's items by user id, sse id, and room, but I'm not sure if that can be done using the standard library. I was wondering, is there a crate that provides this sort of functionality perhaps? Or is there another way to do this that I am not considering?
I’m actively developing a crate that will be able to address this kind of problem, but it’s nowhere near functional yet, so I’d also be very interested in hearing about any solutions that are already out there.
I’ve mostly been spending my time figuring out what the right API would be and don’t have much in the way of implementation. This is the sort of solution that I hope to enable, but it’ll likely take all summer to finish even a decent prototype:
Proposed Syntax
// Define newtypes for columns / fields
def_fields! {
SseId: String,
UserId: Option<i32>,
RoomId: String,
Channel: Arc<mpsc::Sender<Command>>
}
let connections = Table::<(SseId, UserId, Channel)>::new();
let subscriptions = Table::<(RoomId, SseId)>::new();
// Set up indices; I haven’t thought about this step enough
fn send_to_all(msg) {
for c in connections.iter_as::<Channel>() {
c.send(msg);
}
}
fn send_to_room(room:RoomId, msg) {
for c in subscriptions.select::<Room>(room)
.join::<SseId>(connections)
.iter_as::<Channel>() {
c.send(msg);
}
}
fn send_to_users(users:&[UserId], msg) {
for sse, c in users.join::<UserId>(subscriptions)
.iter_as::<(SseId, Channel)>() {
c.send(format!(“{:}: {:}”, sse, msg));
}
)
Oh cool! Well, I'm excited to see what you come up with. I was actually pacing around for the past hour and thought of a way to slightly alleviate my problem (although not completely).
I think that the number of queryables can be reduced from 3 to 2. My idea is to prefix the User ID to the randomly generated SSE ID. So if the User ID was 362, then the SSE ID would be something like "362-Qb07If1oO1aDfavuH8iDfyBFohHZwSK7fzHEIDFj"
Then if that were stored in a sorted Vec, you could query by either SSE ID or User ID (since you could just binary search for something like 362-000000000000000000000000000000 if you only had the User ID for a lookup, or you could binary search by the entire SSE ID if you had that.)
The only issue is rooms. I suppose you would still need a HashMap from SSEID -> Vec<Room>, and a HashMap from Room -> Vec<SSEID>. I'm trying to think if there is a way around that and only need a single Vector, but I'm not sure.
As far as I know, there’s only two real ways to do a query: you either have a precomputed index where the key starts with your search field or you do a sequential scan over the table. For multifaceted searches, you can combine the two approaches if you’re clever.
To some extent, this could be like the indexmap crate, where the implementation is roughly like a pair of collections, (Vec<(K, V)>, HashTable<usize>), where the table stores indexes into the vector. This idea could be extended to multiple tables:
So you can have multiple lookup tables, but the actual data is only stored in one place.
If you want to implement this, you might look at indexmap#131 where I'm porting to hashbrown::raw::RawTable, because that API is pretty straightforward.
Hmmm... to be super quick on look up when sending messages why not:
Sending a message to an individual user via their User ID
Have a hash map of users keyed by UserID. As simple get returns whatever is needed to send a message to that user.
Sending a message to an individual user via their SSE ID
Another hash map of users keyed by SSE ID. As above.
Sending a message to everyone in a single room (all users that have the same room String)
A hashmap of rooms, keyed by room string. Each element of this is another hashmap of users keyed by whatever identity.
Sending a message to everyone from a list of User IDs
Iterate through that list of User ID's and use 4) above.
Essentially we now have keys, User ID, SSE ID, Room name, to enable fast lookup of user(s) by whatever criteria when required. The problem then is:
Let a user join a room via their SSE ID
Let a user leave a room via their SSE ID
Deleting a user from the list when sending a message to their mpsc fails
All those hash maps described above have to be updated appropriately as users arrive and depart, join and leave rooms etc.
You're not overthinking it, but it can be a bit more complex than that. For example, you would also need a HashMap from UserID -> Room and SSE ID -> Room. The only reason for their existence would be so that when a user disconnects (sending a message to their mpsc fails), you need to find which rooms they are in so that you can remove them from that room's hash set. If you are in the context where you only have the User ID, then you need the former, and if you only have the SSE ID, then the latter. Keeping 4-5 HashMaps in sync is a bit of a pain.
Thankfully this post was a bit of a rubber duck as I have now realize only two collections are needed. This is what I ended up with, and now have gotten working:
The first is a sorted Vec of User (which is a string prefixed with the user id, then a hyphen, then the SSE ID). It contains the MPSC sender and also a HashSet of the rooms the user belongs to. Now we can query a user by either their SSE ID or their User ID (by doing a prefix search with the binary_search function: example: `binary_search("362-") if their user id is 362).
The second collection is a HashMap that maps the User (SSEID + "-" + USERID) to their list of rooms. This lets us quickly broadcast messages to everyone in a room. Ideally only one collection would be needed, but I'm not sure how I could quickly query User ID, SSE ID, and Rooms in a single collection. This seems to be working in any case, so I'm happy now.
A couple people have mentioned BTreeMaps and that they might be superior here instead of a manually sorted Vec with binary_search, but I don't know anything about them. I'm tired but am going to look into them tomorrow and see if they might be a better fit.
I put some time into a crate like this and then got too busy. I'd love to collaborate (keeping in mind I don't have much time for this) or have you use my ideas and/or code.
My main thinking is to have a proc-macro convert a set of struct (maybe also enum) definitions into linked tables in an in-memory database. There would be a Key type to create a variety of connections between different types. You can see an example in the repository. Accessors and setters are automatically created which maintain links between elements. I used my tinyset to compactly store one-to-many links between types.
Yours looks quite interesting, but it looks like we're taking fundamentally different approaches to reach the same end: I'm trying to (and mostly succeeding at) shoehorning relational algebra primitives onto Rust's type system. My real goal isn't to write a database, but to provide a common interface for all of the collection types with the same expressivity as SQL.
That would allow library authors to hide the details of their data layout so that it can be changed without breaking downstream users.
That depends on your definition of database, I guess. At least for the reading side, it’s possible to do everything backed by a single, denormalized Vec via sequential scans.
The goal is to provide a smooth enhancement path from that easy solution to more complicated and efficient ones that doesn’t require reengineering outside of the datastore code— if your main program logic is using semantic instead of structural queries, it shouldn’t need to change when one denormalized table is split into two normalized ones (or vice versa).
My idea of a "database" is a big pile of data stored as zeros and ones on some persistent media that outlives any instance of any program that uses it. A huge array some place.
SQL and whatever are just somebodies idea of an abstraction to interface to that external array.
In that case, I don’t think a database is necessary to gain a benefit from the concepts of relational algebra. (I recognize that this is likely a minority opinion.)
It was invented to alleviate the problems caused by database users writing their queries against a particular database structure, preventing the DBAs from restructuring things for better performance. As programmers, we face similar problems when trying to future-proof internal APIs.
Take the original question asked by @rustlanguser as an example: this is all in-memory, live data. A large portion of the program will be designed around the particular design of this data structure, so performance-critical decisions have to be made now, before there’s any data. Wouldn’t it be nice if the datastore was abstracted enough to add an index later, when we know which operations need to be faster?
Aside
@rustlanguser: I seem to have accidentally hijacked your thread; did you get the answers you were looking for?
That sounds to me problematic. If you hide the implementation but don't make it fast, then your users are going to be trapped by poor scaling sooner or later, right? I am not seeing the benefit of this. If I use relational primitives, I'd like to do so knowing that I'll be getting nice scaling when doing so.
If you use a slow implementation, your users are going to be trapped by poor scaling whether or not the internal structure is exposed in the interface. Having replaceable parts means that when the bug report about poor scaling comes in, you have the ability to address the problem.
Minority or not it's an interesting idea you have there.
If we skip the idea of a database as some big external, persistent store of data for sure we can keep the ideas of database organization and querying that have been developed for them. Relational algebra or whatever.
So now we are talking about using said relational algebra, call it SQL, within our programs for their own fleeting memory content.
I'm not sure what to think of that.
It sounds like a lot of abstraction for a systems programming language like Rust.
Or perhaps you have the seeds of an idea for a high level language based on relational algebra.
Yes, but you're telling me that your goal is not to use or implement a fast implementation, right? And you're saying that there difference between my goal and yours is precisely that I'm planning to have a fast implementation.
My first goal is to provide an interchangeable interface for the std::collections, which seem fast enough for most purposes when intelligently organized. If I do my job right, a faster, more complicated, engine should be developable as a drop-in replacement that implements the same APIs.
I want something that can be as easily used for a one-off memoization dictionary as a central clearinghouse for all the application’s data.
I’m hoping to have most (or all) of the query planning done at compile time, so the generated code is more or less the same as if you’d written it by hand. With the added benefit that, if an index gets added, it’ll automatically get used everywhere relevant.