Exploring Rust / Developing a Custom Graph

Hello,

I am new to Rust -- exploring whether to use it. My aim is to create a mini in-memory custom graph database with some augmented low level features.

I am reading that graph structures are in particular difficult to program in Rust [1], given its ownership model -- I am curious how steep the leaning curve would be to get this right.

It seems that I can't use existing libraries (e.g. petgraph), given some custom features i need -- e.g. allowance for multiple links between same two nodes.

Given that pointers are a key capability in Rust -- i wonder why a simple linked structure such as a graph can be so difficult to program, and how to easy to learning curve.

much appreciating your response,

Dan

[1] " 1.8.1 Cyclic data structures

It is difficult to model cyclic data like an arbitrary graph structure in Rust. Implementing a doubly-linked list is an undergraduate-level computer science problem. Yet Rust’s safety checks do hamper progress here. If you’re learning the language, avoid implementing these sorts of data structures until you’re more familiar with Rust."

1 Like

Rust’s ownership model mostly requires object instances to form a tree whose root is a local variable in some function. Cross-linking between different branches of the same tree is tricky at best, impossible at worst. To represent general graph structures, there’s two main approaches:

  • Use shared-ownership references (Rc or Arc), which can be arbitrarily interlinked, but incur a runtime cost.
  • Store the graph elements in one or more flat collections; use IDs instead of pointers to refer to other elements.
1 Like

Thank you.

Can you elaborate a bit on the runtime cost -- why its there and, in principle, what it is ?

I guess, this would replace following pointers with calculating indices in collections (arrays) -- and, it would require collections that can grow.

Can this be done efficiently -- or as efficiently than pointers?

thank you,

Dan

In my experience, index based graphs are usually more efficient than pointer based ones. If you need easy support for removal of nodes, you may want to use a slab for the collection.

7 Likes

Every object managed by Rc (or Arc) is allocated on the heap with a reference count. When an Rc is cloned, producing an additional reference to the same underlying object, that reference count needs to be incremented. Similarly, when an Rc is dropped it has to decrement the reference count; if there are no more references, the managed object needs to be destroyed.

If you drop a cycle of Rc-managed objects, they will never be freed because the refcount never reaches zero. To mitigate this, you can store some of the links as Weak references, which must check the refcount before every access to ensure the pointee still exists.

1 Like

thank you Alice,

I just noticed a great overview with several examples of index based representations [1].

I think key requirements would be:

  1. quick access to any node by ID (perhaps via a hash)
  2. identification of outgoing links
  3. identification of incoming links
  4. attributes (key-value pairs) on nodes and links
    edit:
  5. acceptable response time for addition of nodes and links

I would likely have little need for removing links -- addition, lookup and traversal are key operations.

Dan

[1] https://payasr.github.io/Are%20Graphs%20hard%20in%20Rust.pdf

1 Like

I am wondering if an unsafe approach for a graph can co-exist / still have the Rust benefits -- trading off performance at the very core representation with safety -- if the core is kept very small -- just to overcome this limitation (if i were to go the pointer route)

Or, would I then loose most every benefit, including saftey during parallelism.

Given those requirements, I'd probably do something along these lines:

struct NodeDesc<N> {
    payload: N,
    inbound: Vec<usize>, // Index into Graph.edges where dest = Self
    outbound: Vec<usize>, // Index into Graph.edges where src = Self
}

struct EdgeDesc<E> {
    payload: E,
    src: usize,  // index into Graph.nodes
    dest: usize  // index into Graph.nodes
}

struct Graph<N,E> {
    nodes: Vec<NodeDesc<N>>,
    edges: Vec<EdgeDesc<E>>
}

Possibly, but it's easy to produce unsafe code which appears to work most of the time but is, in reality, broken. At minimum, you'd need to somehow ensure that the graph elements can't move or be deallocated while there's a reference to them somewhere. The mechanisms to make that happen look a lot like the interfaces you'd have for an index-based implementation. I suspect the performance improvement would quickly get lost in the noise of whatever other computation you're trying to perform.

Thank you. That is great.

Looking at the implementation and the Vec documentation, it does prompt me to look at non-functional requirements as well.

For example, what the upper bound response time requirements for adding nodes and links are -- a vec preallocates capacity, but when its full, does a more expensive (time consuming) operation to increase capacity (and what if we are now having a 1M nodes and link graph -- this should not be copied into a larger structure -- if that's the increase capacity mechanism) -- if this is a problem, perhaps this can be mitigated by anticipating the need for capacity increase and do it in parallel or during small "downtimes", if existing.

A pointer implementation would have the same "constant" addition time ... and be simpler in this respect for this performance requirement.

Dan

That is indeed the mechanism used to increase capacity of a Vec. In practice, though, large block copies like this tend to be quite efficient: Instead of actually copying the data to another place in physical memory, the OS will usually just update the virtual memory page table to change the virtual address of the physical memory page.

Thank you.

I find our discussion very helpful to hash out the issues i may deal with.

I think I would rather not want to depend on the OS for this - given that the graph database might run on different OSs.

I am wondering if this calls for a BigVec -- which is implemented as interlinked Vec

Perhaps, this approach could be a happy medium between a pointers and a vec approach.

I guess the tradeoff here would be access time -- which now is not a simple indexed lookup anymore, but requires identifying a vec and getting to it ...

It would not be particularly hard to write a collection that uses something like Vec<Vec<T>> for its storage, where the upper bits of the index refer to the outer vector and the lower bits refer to the inner one. This would slow down accesses somewhat (though still constant time), but make appending significantly cheaper.

It's also useful to note that while Vec's resizing strategy may take linear time for any given insert, it only requires constant time when amortized over all inserts.

Yea, i thought of a vector of vectors -- but this does introduce some unpredictability when the first vector's capacity needs to be increased.

I think connecting vec's via pointers, somehow, is more complex -- and, eventually, the complexity does slow down access -- perhaps vecs could be hashed per upper bits.

I think these are then the two choices i might have:

a) some kind of vec of vec approach
b) unsafe pointer based graph carefully developed at the core

thanks!

Dan

Something like a BTreeMap could also be an interesting choice. Its operations are logarithmic-time instead of constant-time, but the time for individual operations will be more consistent. Note that a HashMap will have similar problems as Vec does: once its table gets full, it will need to allocate a larger one and move the elements (I think; haven't looked at the implementation).

1 Like

Indeed:

I am actually coming from working on a Prolog based implementation -- and in Prolog all access of such structures (Facts in prolog lingo) is a Hash with logarithmic access time.

I am hoping to improve on this significantly, and my first thought were pointers but with safety in mind, and because i do want to offer concurrency -- hence my interest in Rust.

Graph and StableGraph both permit duplicate edges. Furthermore, you can always model a multigraph as a graph where the edge weights are (e.g.) Vecs, if you really wanted to use one of the other graph types (but from your other posts, it sounds like an adjacency list would be fine).

1 Like

Thank you for pointing me to these -- based on the discussion here, I would be very interested in learning how they deal with increasing capacity.

I'm curious as to what other language(s) you might be considering to use if not Rust?

If one were to do this in C or C++ the old school way then ones entire program is the equivalent of "unsafe" blocks in Rust. One would have to take care of all those issues of ownership and lifetimes oneself. Things that have traditionally been very error prone, exactly the things Rust takes care of so that you don't have to. The difference being with Rust it's all taken care of at compile time rather than after hours of debugging and head scratching.

In modern C++ one would be expected to do this with a "smart" pointer. shared_ptr or whatever they have now a days. This, as far as I can tell is the equivalent of Rust's Rc with presumably similar runtime overheads of reference counting.

This problems shared data of course get far worse if you ever decide to parallelise your program some how. At which point Rust will shine in the advantages it has.

I like the idea of keeping everything in arrays and using indices instead of pointers. Although it seems to me that bypasses some checking Rust can do with pointers. For example, what happens if one creates an index that indexes an element that no longer contains a node?

It's a long time since I had to deal with anything resembling large and complex graphs, in electronic/printed board CAD, so I only have vague notions to share. Personally I would stay away from "unsafe" unless there were really no other way to get the job done.

1 Like

If you use a large branching factor, like 216, you can store a lot of items before needing to increase the capacity of the outer Vec. There's a very real possibility that you'll run out of RAM before ever needing to reallocate it.