Ppar: Persistent Array using modified Rope

Did some looking around for a persistent array that can be used in IPLD's data-model. Found im package and its Vector type which is based on the RRB-Tree.

And I couldn't find insert() and delete() that return the modified tree keeping the older version persistent. Also, I think due to the RRB-Tree limitation the mutable version of insert() and remove() are costly operation. Please correct me if I am wrong here.

ppar is persistent array implementation using a modified variant of rope-data-structure. Fundamentally, it can be viewed as a binary-tree of array-blocks, where each leaf-node is a block of contiguous item of type T, while intermediate nodes only hold references to the child nodes, left and right. To be more precise, intermediate nodes in the tree are organised similar to rope structure, as a tuple of (weight, left, right) where weight is the sum of all items present in the leaf-nodes under the left-branch.

Both unit-testing and fuzzy-testing (using the im::Vector as the reference) is passing for this algorithm. Here are some performance characteristics of ppar::Vector compare to im::Vector.

ppar::Vector performance characterization
-----------------------------------------
append-load(1000000 items)     : 9.094956ms
random-load(1000000 items)     : 6.128µs
get(1000000 ops)               : 127ns
insert(1000000 ops)            : 8.936µs
set(1000000 ops)               : 6.093µs
delete-insert(1000000 ops)     : 11.742µs
overhead                       : "33.87%"
overhead after 90% delete      : "33.91%"
  • Takes around 9ms to convert a Vec of len 1-million items to ppar::Vector.
  • Random insert operation take around 8us.
  • Random get operation take around 100ns.
  • Random set operation take around 8us. Note that these set operation update an existing offset.
  • Random delete operation take around 4us. Note that in the above perf data, after a delete an insert is performed at the same offset.
  • For 8-byte u64 type, overhead is around 33%.
  • I believe the worst-case overhead can go upto 100%, that is 2x, the actual data, when compared to std::vec::Vec<T>.

On same machine same set of operation on a im::Vector<T> gives following numbers:

append-load(1000000 items)     : 37.651219ms
get(1000000 ops)               : 123ns
update(1000000 ops)            : 10.344µs
insert(1000000 ops)            : 735.307µs
delete-insert(1000000 ops)     : 1.426266ms

While append-load operation is bulk-operation, other ops are individual metric.

Ref:

crate: https://crates.io/crates/ppar
repo: https://github.com/iprs-dev/ppar
fuzzy-testing: https://github.com/iprs-dev/ppar/blob/master/src/bin/check.rs
perf-testing: https://github.com/iprs-dev/ppar/blob/master/src/bin/perf.rs

1 Like

As far as I understand it, im's design philosophy is to provide traditional mutable containers that have a cheap clone implementation based on shared internal state.

If you want to keep the old version around, you should clone it before you make your modifications just like any of the std collections; the two copies will share essentially all of their internal state. When you modify either of the copies, the relevant internal nodes are duplicated so that your change doesn't affect the other one.

1 Like


Lower numbers are better

Lower numbers are better
We are loading the array with an initial data set of 100_000 u64 numbers. Then applying each operation in a tight loop, measuring the elapsed time for the entire loop and finally computing per-op latency. insert, insert_mut, remove, remove_mut, update, update_mut, and get use random offset into the array. For iteration, we compute the elapsed time for iterating each item.

  • Using the logarithm scale implies an order-of-magniture difference in performance.
  • Between arc and rc variant of ppar, performance difference is not much. Though it might make a difference for cache-optimized algorithms.
  • ppar gains order-of-magniture improvent over vec and im for insert, remove, load, split, append, clone operations.
  • And it falls behind on update, get, iter operations. But it may be worth noting that where-ever it falls behind it is still on the lower end of the log-scale, that is, the difference is well within 100ns.

It is also worth noting that, maximum size of the operated data-set is less than 2MB, which means the entire array footprint can be held inside the CPU cache. As far as my understanding goes, when we increase the size of the array, the performance difference will only be more pronounced, that is, fast-op might look faster and slow-op might look slower.