Graphs in Memory

I am building a API with a graph data structure inside. You can think
of the data structure like a piece of XML (actually, it serialises as
a piece of XML!). Mostly it's a tree, but with cross links which turn
it into a generalized graph. In the XML representation, URIs are used
to make the cross-links.

Now, I've been around the houses with representing this in rust. The
obvious way to represent this is as a graph data structure, but this
gets into lots of Rcs and Boxes with Rust. This made the types
long and resulted in much fighting with the borrow checker. So, I
tried the alternative, when I add an URI to a graph, I swap the URI
for a usize, which I can now copy around freely into multiple
places.

However, this causes it's own problem. URI is now a new type for a
usize. So I have add to add quite substantial code for checking when I
add a subgraph that any URIs are already in the graph -- because they
could have come from another graph. In which case, I will get a
dangling usize, unless I check.

More, comparing two graphs now becomes painful -- since the usize that
represents each URI differs between graphs, I have to switch them all
back to String to do the comparision.

So I have started to revisit my original design. What not make URI a
Rc<String>? I can make a URICache which takes a String and returns
a Rc<String>. So I have tried this. I could even share the
URICache between graphs. Except that I can't because that needs a
mutable reference, and rust's rules do not allow this.

To my question: should I be using usize to represent URIs, accepting
the pain that goes with it? Or is the Rc<String> approach more
sane. And if I do this, can I share a single cache between multiple
graphs. Or have I missed something else entirely?

As you might guess, this is my first Rust project. I'm enjoying many
parts of it, the docs, cargo and so forth. But the borrowing and
sharing rules are driving me a bit nuts.

Have you seen the petgraph crate? It uses tricks like you're describing to reference nodes by index. If you don't want to use that as a dependency, then perhaps you can at least learn from how they do it.

2 Likes

Yes, I thought about petgraph. But this is a crate for representing graphs, in the nodes and edges sense. My data only happens to be incidentally a graph. I'd still need to convert edges and graphs into URIs at some point. In the end, I decided petgraph was too far away from my needs that I'd spend

Petgraph, afaict, takes the nodes as usize approach.

This sounds like you want to do interning. There are some crates that you may want to look at, such as @droundy's https://crates.io/crates/internment

That aside, to share a URICache across graphs and allow mutation, you'd need to allow interior mutability. So in a nutshell, your structures would be something like:

#[derive(Clone)]
struct URICache {
   inner: Rc<RefCell<InnerCache>>,
}

impl URICache {
     fn get_or_create(&self, uri: SomeType) -> URI {
         let inner = self.inner.borrow_mut();
         // can now use &mut self methods on `InnerCache`
     }
}

struct InnerCache {
   // actual URI storage here
}

Yes, you are right, that has really cleaned up my code.

I will investigate internment and the like. Yes, this is what I am doing. A reference counted, shared String cache would probably do the job. I will check that also.

There has been a cool blog post by @matklad about handling graph indexes. Maybe It'll give you some inspiration :). https://matklad.github.io/2018/06/03/newtype-index-pattern.html

1 Like

It's an interesting post, but it misses three things for me. AFAICT, with this pattern you can add a index created in one arena to another, which will surely break. Second, if the graphs represent real world entities, then having indexes will be harder to debug, since this is how graphs will appear as a set of indexes. And, finally, graph comparison is difficult, because the indexes will be different in different arenas, even if they reference the same real world entity.

1 Like

I agree. Unfortunately, there’s no particularly great way to model this - there are tradeoffs to be made (but hey, that’s the case with lots of things right?). I’ve played around with using indices, and although they make the immediate borrowck problems go away, they bring about their own problems or unergonomics. Since they’re essentially references, when you need to get some data out of the entity they point to, you end up needing to thread this index through all the abstraction layers to a point where you can access the actual storage. This makes, eg, simple logging or error display challenging because logging just the numeric index means almost nothing to whoever is looking at the log or error message. Instead, you need to return this index to a place where you can use it to do the actual indexing to get more useful data out of the entity.

Then you need to carefully arrange entity removal or any other mutation of the storage because you need to make sure there are no outstanding index handles. This can get non trivial as the application size and complexity grows, particularly if this aspect isn’t carefully considered upfront. References enforce this statically, but are also problematic because of this reason.

Let me know if you'd like an RcIntern that is reference counted but localized to a thread. It would not be at all hard to make, I just haven't had need, so I stopped at ArcIntern. The RcIntern could avoid the extra cost of a mutex on allocation.

Yes, all agreed. I have moved backward and forward with this. I gave up with references because the borrow checker gave me a headache. Now I am back with them, but I have re-thought the overall data structure since and now I need to use Rc in only one place.

We shall see whether this turns out to be a good decision. As always with a new language, guess work and luck plays a big part in early attempts. Rust has been enjoyable but also more frustrating than many languages I have tried. But, my early bench marks suggest it delivers on its promise of being fast which is the whole point of this library.

Will do. At the moment, my application is single threaded, but I am not using internment, but it does look like a good option.

I have recently started using this pattern, and based on that I can say the following about the points you make:

  1. This can be avoided through newtyping (specifically one newtype per graph struct type), and perhaps even making the graph struct type generic on the index newtype. I don't do the latter part myself, because there's only 1 such struct I need. Otherwise I'd probably do it though.
  2. With #[derive(Debug)] I can easily see what you mean, having run up against it myself.
    But what would the alternative be? a println of a potantially giant struct instance? That can be done with e.g. a DFS traversal, which is O(n) in the number of nodes in the graph and thus is cheap enough in the context of printing the entire struct. It'll take a couple extra lines of code, but it's not hard to do.
    As for using debuggers with such a graph, most debuggers allow you to define custom expressions to be (re)evaluated as the IP passes over particular code regions. One of those could be a lookup of the actual node using the index.
  3. Indeed, this is a trickier issue. I've never had to compare graphs using this data structure, but if I did I would think about using some kind of graph traversal (with node marking to avoid cycle traversal, as those can easily lead to infinite recursion) algorithm on 2 graphs in parallel; as long as there is a consistent root / source node of each graph to start comparing with, that could work I think.

@jjpe Point 1) I am not sure about this. Newtyping will stop me adding NodeA where a NodeB is expected, for sure. But, I cannot see how it will prevent me from adding NodeA from one graph to another. To achieve this, you have to check, when adding a Node.

Point 2) Yes, I agree you can overcome it (although, because of point 1, you have to know for sure which graph a node is from). But you have to overcome it in any tools that you use. I've been here in other languages, and not having an easy way to produce an easy-to-read string representation of an object is painful.

Point 3) Exact graph comparison is actually very easy if each node has some real world relationship. For instance, you just break them down in Node-Edge-Node triples, then sort, then compare these.

  1. I was assuming that each graph effectively has its own type by means of a type parameter. If you have multiple graphs of identical type this won't work, no. But if you parameterize the graph type on the index (which could just be newtypes) using a PhantomData field, then that just might. For example:
use std::marker::PhantomData;

trait Idx { 
  // probably empty
}

struct Graph<I: Idx> { 
  /// graph data here
  phantom: PhantomData<I>, 
}

impl<I: Idx> Graph<I> {
    pub fn new() -> Self {
        Graph {
            // ..
            phantom: PhantomData,
        }
    }
    
    fn add_node(&mut self, idx: I, node: (/* wouldn't be unit IRL */)) {
        
    }
}

struct GraphIdx0(usize);
impl Idx for GraphIdx0 {}

struct GraphIdx1(usize);
impl Idx for GraphIdx1 {}

fn main() {
  let mut graph0 = Graph::<GraphIdx0>::new();
  let mut graph1 = Graph::<GraphIdx1>::new();
  
  graph0.add_node(GraphIdx0(0), ()); // This is fine
  graph1.add_node(GraphIdx1(0), ()); // also fine
  graph1.add_node(GraphIdx0(0), ()); // *BOOM* goes the compiler
}

If necessary you can pull a similar newtype trick with the type of the graph nodes.

  1. In general I do not think easy printing for arbitrary graphs is possible because graphs are hard to get right, and even harder to make ergonomic. For example, how should cycles be printed? as a stub, similar to how Rcs and Arcs Weak refs are printed? What traversal should be used, since for graphs in general DFS might be far from optimal? What happens when the graph is sparsely connected? Or the converse, e.g. when it is fully connected? And what happens when the graph turns out to be a forest (i.e. multiple disconnected node clusters)? etc

  2. While this would work, it would also make for a rather expensive comparison: rather than being O(n) in the total number of bytes for the struct (stack + heap), the sorting turns it into O(E log E) in the number of edges E. For smallish graphs this won't matter, but for graphs with > 1M edges, when the log factor becomes significant, I'd expect some serious comparative slowdown.

1 Like