Most efficient way to convert relational data to nested Rust structs

rajesh-rahul · April 13, 2024, 10:21pm

Suppose you have the following SQL query:

SELECT 
	Artist.ArtistId as "artist.artist_id", Artist.Name as "artist.name", 
	Album.AlbumId as "artist.albums.album_id", Album.Title as "artist.albums.title", 
	Track.TrackId as "artist.albums.track.track_id", Track.Name  as "artist.albums.tracks.name"
FROM Artist 
	JOIN Album USING(ArtistId) 
	JOIN Track USING(AlbumId)

You are trying to load the result of that query into a vector of rust struct: Vec<Artist>. Note that result column names contain information on how to load the data. For example, "artist.albums.title" tells us that the data for that particular column should go into Artist -> Optional<Vec<Album>> -> title

struct Artist {
    name: String,
    artist_id: i32,
    albums: Option<Vec<Album>>,
}

struct Album {
    title: String,
    album_id: i32,
    tracks: Vec<Track>,
}

struct Track {
    track_id: i32,
    name: String,
}

I was trying to figure out what is the most performant way of doing this in Rust.

My Current Solution

My current solution involves loading the query result into a tree like structure(Node) and a serde Deserializer will convert that tree like structure into Vec<Artist>. It works but not very fast.

pub enum Node {
    Leaf(Value),
    Internal(BTreeMap<Vec<Value>, HashMap<String, Node>>),
    Root(HashMap<String, Node>), // I am not fully convinced yet if I need this variant yet
}

pub enum Value {
    I64(i64),
    F64(u64),
    Str(String),
    Blob(Vec<u8>),
    Null,
}

Please uncollapse this to see how the relational data will be loaded into Node enum above

Internal(
    {
        [
            I64(
                1,
            ),
            Str(
                "AC/DC",
            ),
        ]: {
            "artist.artist_id": Leaf(
                I64(
                    1,
                ),
            ),
            "artist.name": Leaf(
                Str(
                    "AC/DC",
                ),
            ),
            "artist.albums": Internal(
                {
                    [
                        I64(
                            1,
                        ),
                        Str(
                            "For Those About To Rock We Salute You",
                        ),
                    ]: {
                        "artist.albums.tracks": Internal(
                            {
                                [
                                    I64(
                                        1,
                                    ),
                                    Str(
                                        "For Those About To Rock (We Salute You)",
                                    ),
                                ]: {
                                    "artist.albums.track.track_id": Leaf(
                                        I64(
                                            1,
                                        ),
                                    ),
                                    "artist.albums.tracks.name": Leaf(
                                        Str(
                                            "For Those About To Rock (We Salute You)",
                                        ),
                                    ),
                                },
                            },
                        ),
                        "artist.albums.album_id": Leaf(
                            I64(
                                1,
                            ),
                        ),
                        "artist.albums.title": Leaf(
                            Str(
                                "For Those About To Rock We Salute You",
                            ),
                        ),
                        "artist.albums.track": Internal(
                            {
                                [
                                    I64(
                                        1,
                                    ),
                                    Str(
                                        "For Those About To Rock (We Salute You)",
                                    ),
                                ]: {
                                    "artist.albums.tracks.name": Leaf(
                                        Str(
                                            "For Those About To Rock (We Salute You)",
                                        ),
                                    ),
                                    "artist.albums.track.track_id": Leaf(
                                        I64(
                                            1,
                                        ),
                                    ),
                                },
                            },
                        ),
                    },
                },
            ),
        },
        [
            I64(
                2,
            ),
            Str(
                "Accept",
            ),
        ]: {
            "artist.albums": Internal(
                {
                    [
                        I64(
                            2,
                        ),
                        Str(
                            "Balls to the Wall",
                        ),
                    ]: {
                        "artist.albums.album_id": Leaf(
                            I64(
                                2,
                            ),
                        ),
                        "artist.albums.track": Internal(
                            {
                                [
                                    I64(
                                        2,
                                    ),
                                    Str(
                                        "Balls to the Wall",
                                    ),
                                ]: {
                                    "artist.albums.tracks.name": Leaf(
                                        Str(
                                            "Balls to the Wall",
                                        ),
                                    ),
                                    "artist.albums.track.track_id": Leaf(
                                        I64(
                                            2,
                                        ),
                                    ),
                                },
                            },
                        ),
                        "artist.albums.tracks": Internal(
                            {
                                [
                                    I64(
                                        2,
                                    ),
                                    Str(
                                        "Balls to the Wall",
                                    ),
                                ]: {
                                    "artist.albums.track.track_id": Leaf(
                                        I64(
                                            2,
                                        ),
                                    ),
                                    "artist.albums.tracks.name": Leaf(
                                        Str(
                                            "Balls to the Wall",
                                        ),
                                    ),
                                },
                            },
                        ),
                        "artist.albums.title": Leaf(
                            Str(
                                "Balls to the Wall",
                            ),
                        ),
                    },
                },
            ),
            "artist.name": Leaf(
                Str(
                    "Accept",
                ),
            ),
            "artist.artist_id": Leaf(
                I64(
                    2,
                ),
            ),
        },
    },
)

Is the above the best way of going about this? I think there are some optimizations I can do to make loading the relational data into the Node enum more efficient but I just wanted to make sure if this is even the right approach before I did anything like that.

Sidenote: I am aware Diesel (and maybe even other ORMs that I haven't tried) is capable of loading queries like this into Rust structs very fast but I am interested in the general problem of loading relational query results into arbitrarily nested Rust structures.

kornel · April 14, 2024, 11:08am

The HashMap in the standard library is slow. Use a fast one, e.g. ahash.

Do you need serde to do this? It could be faster if you iterated over the rows yourself, while maintaining lookup tables from id to concrete value.

chrefr · April 14, 2024, 11:22am

Nit: it's not the hashmap implementation is slow (ahash use the same), it's the hashing algorithm.

weiznich · April 15, 2024, 6:37am

Diesel does perform a few optimizations to make this kind of operation really fast. The most important one is likely that it does not build an internal structure like your Node/Value structure at all. It directly deserializes the values from the database representation (mostly bytes for postgres) to the target struct. That's done on a field by field basis and allows us to skip both copying/cloning data and intermediate operations like hashing field names etc. The order of the mapping is determined at compile time via proc-macros based on type information and field order. This again let us skip more work (looking up field names at all).

2e71828 · April 15, 2024, 7:55am

If you’re able to modify the query slightly, consider adding ORDER BY ArtistId, AlbumId so that the results come in the order that you’ll need them. This should enable you to write a streaming solution that doesn’t require intermediate maps.

Edit: For example:

_{This can be cleaned up a lot, but I don’t have the time right now…}

pub fn artist_query(q: QueryResult<impl Iterator<Item=Vec<Value>>>, mut f:impl FnMut(Artist)) {
    use itertools::Itertools;

    let mut col_artist_id: usize=0;
    let mut col_artist_name: usize=0;
    let mut col_album_id: usize=0;
    let mut col_album_title: usize=0;
    let mut col_track_id: usize=0;
    let mut col_track_name: usize=0;

    for (i, label) in q.cols.iter().enumerate() {
        match label.as_str() {
            "artist.artist_id" => { col_artist_id = i; }
            "artist.name" => { col_artist_name = i; }
            "artist.albums.album_id" => { col_album_id = i; }
            "artist.albums.title" => { col_album_title = i; }
            "artist.albums.tracks.track_id" => { col_track_id = i; }
            "artist.albums.tracks.track_name" => { col_track_name = i; }
            _ => ()
        }
    }

    for (artist_id, rows) in &q.rows.group_by(|r| r[col_artist_id].read_i64().unwrap()) {
        let mut rows=rows.into_iter().peekable();
        let first_row = rows.peek_mut().unwrap();
        let name = first_row[col_artist_name].take_str().unwrap();

        let albums = rows.group_by(|r| r[col_album_id].read_i64().unwrap()).into_iter().map(|(album_id, rows)| {
            let mut rows=rows.into_iter().peekable();
            let first_row = rows.peek_mut().unwrap();
            let title = first_row[col_album_title].take_str().unwrap();
            let tracks = rows.map(|mut row| Track {
                name: row[col_track_name].take_str().unwrap(),
                track_id: row[col_track_id].read_i64().unwrap()
            }).collect();
            Album { album_id, title, tracks }
        }).collect();
        f(Artist { artist_id, name, albums })
    }
}

rajesh-rahul · April 20, 2024, 3:51am

This was a great suggestion! Although, I changed the intermediate structure (to keep the original ordering of rows) in a way that the HashMap is no longer necessary, I can see that even in the new intermediate structure, for small keys, ahash is faster than std hash algorithm (and much faster than BTreeMap)

// Previous
pub enum Node {
    Leaf(Value),
    Internal(BTreeMap<Vec<Value>, HashMap<String, Node>>),
    Root(HashMap<String, Node>), 
}

//New
pub struct MapList {
    pub rows: Vec<Vec<Node>>,
    pub rows_lookup: AHashMap<Vec<Value>, usize>
}
pub enum Node {
    Leaf(Value),
    MapList(MapList),
    Map(Vec<Node>), // index is the key(another vector is used to retrieve the column name)
    Unoccupied
}

Above, if rows_lookup use ahash instead of std hash algorithm, there seems to be around ~3% improvement for the following code:
The key for the map in this case was "Track.track_id"

let mut stmt = db.prepare(r#"SELECT Track.Name as "Track.name", Track.TrackId as "Track.track_id" FROM  Track"#).unwrap();
    
let _: Vec<Track> = execute(&mut stmt).unwrap();

Naive benchmark result for the above code:

loading tracks          time:   [1.4710 ms 1.4800 ms 1.4895 ms]
                        change: [-3.9194% -3.2188% -2.5368%] (p = 0.00 < 0.05)
                        Performance has improved.

Do you need serde to do this?

I realize now I needed to mention this in the original post but I was working on a SQL query builder and one of my primary goals was to use serde to deserialize the result. It is still an experiment (no idea how types like timestamps, uuid etc. can be handled).

rajesh-rahul · April 20, 2024, 4:24am

Thank you so much for the example! Unfortunately, I was trying to implement a query builder(with serde to deserialize results) so I wouldn't be able to modify the query. However, would this matter because in this case, shouldn't it be possible to modify the query result in a such a way that the rows are in the order that we need for the streaming solution but still obey the ordering specified by the user in the SQL query?

For example, for the following query:

SELECT 
	Artist.ArtistId as "artist.artist_id", Artist.Name as "artist.name", 
	Album.AlbumId as "artist.albums.album_id", Album.Title as "artist.albums.title", 
	Track.TrackId as "artist.albums.track.track_id", Track.Name  as "artist.albums.tracks.name"
FROM Artist 
	JOIN Album USING(ArtistId) 
	JOIN Track USING(AlbumId)
ORDER BY Track."Name";

One possible query result is the following:

We can modify the query result such that the "groupings" appear consecutively (but we are not sorting the query result):

For both query results the resulting Vec<Artist> should be exactly the same.

I should note that I am not sure if this will work in all cases and if this modification of original query result can be done efficiently.

I did try to do a naive benchmark of the example code with rusqlite and this solution was about 1.7 times as fast compared to a semi-optimized version of my solution in the original post. The difference is likely larger since I added a few clones in your solution to test it out quickly.

It seems that like @weiznich said, the biggest optimization is to get the rid of the intermediate structure altogether. I am not sure if this can be integrated with serde(this is one of the goals that I failed to mention in the original post) but I am working on this atm.

loading artists with intermediate tree structure
                        time:   [5.2117 ms 5.2307 ms 5.2506 ms]
                        change: [+0.3689% +0.8266% +1.3048%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

loading artists with 2e71828's solution
                        time:   [2.9873 ms 2.9936 ms 3.0002 ms]
                        change: [-0.1332% +0.1592% +0.4271%] (p = 0.25 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

simonbuchan · April 20, 2024, 7:01am

Serde would be mostly useful for the Deserialize derive giving you a more succinct way to apply the rows to fields. I dunno how well it maps though since you need to go through column indices; might work if you fill a DeserializeSeed with the indices?

2e71828 · April 20, 2024, 7:46am

This is equivalent to modifying the ordering specified in the original query— You’re just using the original position of each artist and album as its sort key. I expect the database engine to be more efficient at changing the order of query results than client code is, because it can take advantage of indices and a mature query planner.

Another reason to let the database do all of the ordering work is to avoid the necessity of loading the entire result into local memory in some cases: Processing all of the Artists in the database one at a time should only require enough free memory to load the largest Artist (plus some small overhead). If you want to do the reordering on the client, however, you’ll need enough free memory to store all the Artists in the database instead.

If you’re writing a query builder, then you can modify the original query while you’re building it; you just need to ensure that your modifications still deliver whatever the user asked for in the end. As long as each of the user’s order clauses refer to only one table each, you can insert some additional clauses when you build the query to achieve this.

In general, you want to insert some kind of ordering clause on the parent table’s primary key just before dealing with each join to force the grouping to happen. So, the sort order you want to use in this case is:

User-provided artist ordering
artist_id, to force grouping where (1) is only a partial order
User-provided album ordering
album_id, to force grouping where (3) is only a partial order
User-provided track ordering

Topic		Replies	Views
Ease of manipulating sqlite3, json, other formats outside of Rust	4	578	January 12, 2023
Rust structs <-> SQL Tables	4	2584	January 19, 2021
Rust how to convert vector of raw string into the Json or convert complete vector into something which suits the project help	2	662	July 30, 2023
How to serialize this tree like data structure? help	4	788	February 18, 2023
Serde from skyscanner help	6	358	October 12, 2023

Most efficient way to convert relational data to nested Rust structs

My Current Solution

Related Topics