Suggestions for graph API

I'm developing (in my spare spare time) a little crate called polygraph that creates data types for multiple interconnected graph structures. This was somewhat motivated by my development of a build system fac, which needs to track relationships between files (both input, output, and both) and build rules. It's got two kinds of node and multiple kinds of relationships, so you can't easily use something like petgraph.

I'd appreciate suggestions on the API for querying my graph data structure (see below after I've explained an example graph).

Anyhow, the basic idea of polygraph is to use a macro to declare both a set of interrelated data types, and a graph type (or in-memory database type) that will hold all the data, including relationships between the objects. This system will store both forward and reverse linkages, so iterating over links in any direction will cost O(1) per element returned. Accessing any data requires a reference to the graph, and a Key<T> which is a Copy type that internally is just an index into an array, but its type indicates which array it is an index into. Conceptually I think of a Key as like a kind of pointer, but it requires a reference to the graph in order to be dereferenced. If this were C or C++, I probably would have used a pointer for this (and had all the accompanying unsafety). Requiring that reference, however, makes exclusive access via &mut very clear, so I think it's a good thing, it's just a little hard to figure out how best to arrange that access.

Here is an example (see also the documentation) which generates a family tree data structure with pets (well, only dogs).

polygraph_macro::schema! {
    /// The database (or graph) itself will be a type called FamilyTree.
    type FamilyTree;
    /// By defining a `Surname` type, we can query which `Person` have a
    /// given surname.
    pub struct Surname(pub String);
    pub struct Person {
        pub last_name: Key<Surname>,
        /// `father` is a one-to-many relationship
        pub father: Option<Key<Person>>,
        /// `mother` is a one-to-many relationship
        pub mother: Option<Key<Person>>,
        /// We use a raw String for given name, since we don't care to
        /// iterate over all "Davids".  Otherwise we'd do the same as with [`Surname`].
        pub name: String,
        /// `dog` is a many-to-many relationship
        pub dog: KeySet<Dog>,
    }
    pub struct Dog {
        pub name: String,
    }
}

I can construct a simple graph with the following:

let mut db = Tree::new();

let mickey = db.insert_dog(Dog {
    name: "Mickey".to_string(),
});
let minnie = db.insert_dog(Dog {
    name: "Minnie".to_string(),
});
let my_dogs: tinyset::Set64<_> = [mickey, minnie].iter().cloned().collect();

let roundy = db.insert_surname(Surname("Roundy".to_string()));
let maiden_name = db.insert_surname(Surname("Maiden".to_string()));
let me = db.insert_person(Person {
    last_name: roundy,
    father: None,
    mother: None,
    name: "David".to_string(),
    dog: my_dogs.clone(),
});
let wife = db.insert_person(Person {
    last_name: maiden_name,
    father: None,
    mother: None,
    name: "Monica".to_string(),
    dog: my_dogs.clone(),
});
let kid = db.insert_person(Person {
    last_name: roundy,
    father: Some(me),
    mother: Some(wife),
    name: "Kid".to_string(),
    dog: my_dogs.clone(),
});

What I'd like help with is in writing a query to this graph. I've implemented a couple of APIs, but I'm not really happy with any of them.

The first option would be to simply index the db using a key in order to query that element. When querying one deep this looks nice:

assert!(db[me].father_of.contains(kid));

However, when you start wanting to query something that requires two accesses to the database (like the text of my last name) it gets harder to read because the db access is nested rather than sequential:

assert_eq!(db[db[me].last_name].0, "Roundy");

Or as a third possible query, you could check if two people have at least one dog in common:

assert_eq!(db[me].dog.iter().any(|dog| db[dog].dog_of.contains(kid));

or I suppose we could have looked at the intersection of my dogs with those of my kid. This doesn't seem too bad, but I really dislike the nested square brackets on the last_name assertion above.

A second option would be to create a method on Key that takes a &FamilyTree as an argument, so that chaining works. For now I'll call this method d(&db), but giving it a decent name (short, but clear) is one thing I'd like advice on, if this seems the best way. The three queries above become:

assert!(me.d(&db).father_of.contains(kid));
assert_eq!(me.d(&db).last_name.d(&db).0, "Roundy");
assert_eq!(me.d(&db).dog.iter().any(|dog| dog.(&db).dog_of.contains(kid));

I like that you can read this from left to right in a more natural way (for those of us accustomed to methods, anyhow), but I don't like my name d for the method, and am having trouble inventing a short name that is decent.

Other options that I have considered and rejected (but could reconsider) are:

  1. To have a global variable for the graph data structure, and implement Deref on Key. This would be extremely ergonomic, but would preclude having two different graphs, which could be a problem. It would also make any sort of mutation challenging, since we wouldn't every be able to ensure there is exclusive access. If we used some sort of a lock-free data structure we might be able to mitigate this, but on the whole it doesn't seem like a great idea (unless one of you has a brilliant suggestion to revive it).

  2. Embed a reference to the FamilyTree in each Key and then implement Deref. This gives the same ergonomics benefit, but now we can't hang onto a Key without preventing any mutation of the graph, which seems like a pain. Also, keys become larger, and are presumably no longer Copy?

  3. Perhaps a hybrid system where you have both Key<T> and KeyRef<'a,T> with the latter like #2 above, holding a reference to the graph and implementing Deref, and a method to convert a Key into a KeyRef (which could either be indexing or a method that would look like my .d(&db) above).

As I said, I'm not satisfied by any of my ideas, and am hoping that y'all can come up with either something better, or a fix for one of the issues I've identified. Or at least you can tell me that you like one of these ideas, and I can move forward with fewer doubts that I'm making something that will be incomprehensible to use.

I've been taking notes on some ideas for a graph query API, and one of my ideas was to build a 'prototype' graph in terms of some relations consisting of empty nodes.

As an addendum to that, I classified the different kinds of query structures I'd like to account for:

Boolean predicates conditions (And, Or, Not)
Node relation conditions (IsRoot, IsLeaf, IsAncestorOf, IsDescendentOf)
Relative depth conditions (WithinDepthNOf)
Counting conditions (ExactlyN, AtMostN, All, None)
State conditions (functions on the node that return bool, indicating that the node satisfies some constraint)

So you would construct a protpotype graph out of these primitives, hand it to a graph processor, which would hand back a corresponding object populated with references to any nodes satisfying the expressed conditions. It gets a bit hairy, complexity-wise, because you could easily use these to create a query that has to process the whole graph many times, but it might be more reasonable with a subset of features.

Unfortunately, I haven't had any time to implement it so I won't vouch for it being a good idea.

I looked at the github repo- my head is having a bit of a hard time following code that is more procedural macros than anything else? (unless I am missing something).

That being said, I did a similar project in the past with Python, with a minimal re-write in Rust below. The fundamental tension I found is between what is owned by the graph, and what is owned by an object in the graph. Does a person own their pet? Does the pet own its person? Or does the graph own that relationship? My two cents is you should let the graph win.

In the example below, notice the Person has no dog property. People exist and Dogs exist, but an edge is required to show ownership. Getting all the dogs I own is then as simple as graph.origins.get(“OWNS_PET”).get((“person”, “me”)).

Minimal Example:

use std::collections::HashMap;
use std::vec::Vec;


#[derive(Clone)]
pub enum Vertex {
    Dog(String),
    Surname(String),
    Person {
        surname: String,
        name: String,
        // notice there is no father, mother, or dog- these are nodes connected by edges, not properties of the node
    }
}

pub struct Graph {
    vertexes: HashMap<&'static str, HashMap<String,Vertex>>, // first key groups by variant, second key is for id
    origins: HashMap<&'static str, HashMap<(&'static str, String), (&'static str, String)>>, // first key is for edge "type", second is for (variant, id)
    destinations: HashMap<&'static str, HashMap<(&'static str, String), (&'static str, String)>>,
}

impl Graph {

    pub fn new() -> Self {
        Graph{
            vertexes: HashMap::new(),
            origins: HashMap::new(),
            destinations: HashMap::new(),
        }
    }

    pub fn add_vertex(&mut self, v: &Vertex) {
        let variant_vertexes = self.vertexes.entry(v.variant()).or_insert(HashMap::new());
        variant_vertexes.insert(v.id(), v.clone());
    }

    pub fn add_edge(&mut self,  origin: &Vertex, edge_type: &'static str, destination: &Vertex) {
        self.add_vertex(origin);
        self.add_vertex(destination);
        let edge_type_outbound = self.origins.entry(edge_type).or_insert(HashMap::new());
        let edge_type_inbound = self.destinations.entry(edge_type).or_insert(HashMap::new());
        edge_type_outbound.insert((origin.variant(), origin.id()), (destination.variant(), destination.id()));
        edge_type_inbound.insert((destination.variant(), destination.id()), (origin.variant(), origin.id()));
    }

    pub fn add_person(&mut self, name: &str, surname: &str, father: Option<(String, String)>, dogs:Vec<String>) {
        let person = Vertex::Person{name: name.to_string(), surname: surname.to_string()};
        self.add_vertex(&person);
        match father {
            Some((f_name, f_surname)) => {
                let father_vertex = Vertex::Person{name: f_name, surname: f_surname.to_string()};
                self.add_edge(&father_vertex, "FATHER_OF", &person);
            },
            None => {},
        }
        for dog in dogs.iter() {
            let dog_vertex = Vertex::Dog(dog.clone());
            self.add_edge(&person, "OWNS_PET", &dog_vertex);
        }

    }
}
pub trait Graphable {

    fn variant(&self) -> &'static str; // variant
    fn id(&self) -> String; // id
}

impl Graphable for Vertex {
    fn variant(&self) -> &'static str {
        match self {
            Vertex::Dog(_) => "dog",
            Vertex::Surname(_) => "surname",
            Vertex::Person{..} => "person",
        }
    }

    fn id(&self) -> String {
        match self {
            Vertex::Dog(name) => name.clone(),
            Vertex::Surname(surname) => surname.clone(),
            Vertex::Person{name, surname} => format!("{},{}", surname, name),
        }
    }
}

Yes, all the code is generated by procedural macros, which gives you types that express all the relationships that you request. The generated documentation for the Tree example gives an easier to read picture than the proc-macro code.

Yes, in my crate everything is owned by the graph, but there are specific types that are used to populate the graph. The data type does store both forward and backward references to edges in tinyset::Set64, so accessing edges is always O(1) and is cache friendly.

I guess you assume that names are unique, and that's why there's no Key? Presumably you could change this by changing vertexes to

vertexes: HashMap<&'static str, Vec<Vertex>>,

so your id is just an integer.

It also looks like in your picture I can't own more than one pet? Because your origins is a HashMap? Presumably we need to stick a Vec in there, if we want a given person to own multiple pets or have multiple children? So it would look like:

origins: HashMap<&'static str, HashMap<(&'static str, String), Vec<(&'static str, String)>>>

With this change, it sounds like finding out if kid has a dog in common with me would require something like

let me = (“person”, “me”); // or a typed Key<Person>?
let kid = ("person", "kid"); // or a Key
graph.origins.get(“OWNS_PET”).get(me).unwrap_or(Vec::new())
  .iter().any(|pet| 
                   graph.destinations.get("OWNS_PET").get(pet).unwrap_or(Vec::new())
                        .contains(kid))

Presumably for actual use you'd define an actual API rather than simply exposing the data structures, which definitely looks worse to use than any API I had imagined. I'd be curious what this might look like?

My biggest concern with this untyped interface is that a simple bug like swapping origins for destinaions would give you no error, and would just result in getting either no results (because no dog owns a person as a pet), or incorrect logic (because you're looking for fathers rather than children. I'm also noticing that your system doesn't have a way to encode a one-to-many relationship versus a many-to-many relationship. Presumably you could have a separate set of origins_many_to_one that does not have my modification to store a Vec of destinations?

As far as the data structure itself goes, I'm not a huge fan because it spreads the information out into three different HashMaps, which will reduce your memory locality. With my implementation (which stores the origin/destination sets alongside the other vertex data itself in a Vec), you've got considerably greater memory locality. I do pay 8 bytes per vertex per kind of possible connection (since I always store a set of vertices that I connect to), but if there are many edges the tinyset::Set64 will pay off hugely, since those 8 bytes can hold multiple keys, or if there are more edges the maximum storage is about one bit per vertex present in the tree.

That looks sort of polar-opposite to my thinking. My hope is to create an API in which all operations that are easy to express are fast, and any operations with higher complexity have that complexity apparent in their code.

I love that with ordinary std::collections you generally don't have to worry about complexity (much) because anything that is implemented is fast, and where things are less fast it's apparent in the code. So e.g. BTreeMap exposes range but HashMap does not and with a HashMap you'd need to code

   map.iter().filter(|(_,v)| v > min && v < max)

which makes it abundantly clear that this goes through all possible elements. The result is that although your code is dependent on the representation of your data structures, it is hard to accidentally implement an algorithm in a terribly sub-optimal way.

(Which is to say, the hairy complexity is something I am aiming to avoid.)

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.