A good starting point is to look at the APIs that lxml and Nokogiri provide, though a better approach would be to find potential users and ask what they need. Other than parsing/serialization/querying that were already discussed, this probably includes:
- Traversing the tree: going from one node to its parent, first/last child, previous/next sibling, … and iterators for ancestors, children, descendants, previous/next siblings.
- Modifying the tree: adding and removing nodes at any point
- Accessing and modifying attributes on elements and other data on specific types of node
Also, I’d probably go with a tree of nodes of various kinds (element, text, comment, …) as in the DOM (the one in web browsers), and not a tree of elements with text attached to
.tail attributes of elements like Python’s ElementTree or lxml. This for a couple reason:
- Dealing with
.tail can be a pain
- You have more than elements and text in the tree anyway (starting with comments)
That’s a good question, and I don’t have an obvious answer.
ElementTree has a list of children in each node, with no parent pointer. This means no access to ancestors or siblings from a given node, which is severely limiting. We probably don’t want to do that. This was done at a time where Python didn’t have a cycle collector and adding a parent pointer would create cycles in Python’s reference counting and make entire trees leak in memory, but today it could add it and rely on the cycle/garbage collector.
src/sink/owned_dom.rs is similar in that it has a
Vec of boxed (owned) children. Because children are owned, there can not be a parent pointer and so this has the same limitations as ElementTree.
src/sink/rcdom.rs has a
Vec or reference-counted pointers for children and a weak reference for the parent.
lxml and Nokogiri are based on libxml2, which I think uses the same model as web browsers: each node has raw pointers to its parent, first child, last child, previous sibling, and next sibling. So it’s kind of a doubly-linked list generalized to a tree. I don’t know how memory management works there (who is responsible for freeing what), and it may be hard to do it safe Rust.