Yes, what I'm doing is evil and dangerous. I'm also open to alternative ways to accomplish my goal that manage to avoid doing borderline abusive things to hashmaps.
Here's my current progress: https://github.com/CAD97/sorbus/pull/53
I have an immutable Arc
-managed tree. It's shape is roughly the following (but mangled by confounding implementation details):
pub struct Node {
data: NodeData,
children: Box<[Element]>,
}
pub struct Token {
data: TokenData,
}
pub enum Element {
Node(Arc<Node>),
Token(Arc<Token>),
}
Construction of a tree is always proxied through a builder cache, such that equivalent nodes can be deduplicated and shared. The cached nodes (and leaf tokens) are stored in (the equivalent of) a hashbrown::HashSet
in the builder cache.
My current goal is to allow garbage collection of any tree elements that are only being kept live by the cache. The simple version is map.retain(|node| Arc::strong_count(node) > 1)
. This works to collect some elements, but not all that are possible to collect.
The reason is that the nodes still own their children. Because hashmap iteration order is (effectively) random, we can visit child nodes that are owned by parents in the cache before the parent, which will later be collected as it has no more references keeping it alive.
There's possible ways of handling this that I see. The first, which I'm currently using, is just to keep doing this until no more nodes are deleted. This works, and is simple. However, it is (probabilistic) O(n log n) (probably): one pass through the cache requires checking n nodes, and will probably collect one and half "layers" from the tree. The tree is probably log n layers deep, giving log n passes over n elements for a complexity of O(n log n).
However, it's theoretically possible to do this in a single linear O(n) pass, using a mechanism similar to a tracing garbage collector. When we find a root node that we can collect, also (transitively) try to collect its children. However, this runs into the titular problem: mutating the hashset while iterating over it.
I honestly can't tell what kind of concurrent modification hashbrown's RawTable
iterator allows while holding onto the iterator. It certainly allows deletion of the element you're currently on, as this is used to implement retain
. Deletion of elements yet to be iterated, however, definitely is a logic error at least, as RawIter
keeps track of how many items it has yet to iterate and checks that it iterated all of them when exhausting.
... and there, while writing this, I got a Miri run with a seed that showed use-after-free UB. I thought it might just be pure UB to delete elements "in front" of the iterator head, because of the group reading of control bytes that hashbrown does.
Based on my reading of the implementation, it should be sound to erase buckets "behind" the iterator head, and only doing so will still retain O(n) cache garbage collection. But I don't see hashbrown currently providing any guarantees to what kind of modification of the RawTable
is allowed while holding a RawIter
.