Kuchiki: a (Edit: no longer vaporware) HTML/XML tree manipulation library


#1

Over time, I’ve heard multiple people in the Rust ecosystem ask for a library like Nokogiri in Ruby or lxml in Python, for parsing and serializing HTML and XML documents, traversing, manipulating, and querying the tree, etc. This can be useful for scraping web pages, testing web apps, pre-processing documents, …

This library does not exist. Yet.

The name

Nokogiri (鋸) is the Japanese work for saw, the wood-cutting tool.

It may be premature since there is no code to attach to it, but I’ve had some fun looking in my Japanese dictionary. I’d like this Rust library to be called Kuchiki (朽木). It is the Japanese word for decayed tree or rotted tree, which seems appropriate for a Rust library for tree manipulation.

According to our rustacean friend Tetsuharu / @saneyuki, the word is much more interesting than I initially thought!

朽木 has been used as the family name, and the word has some “nihilistic”, “seasoned” nuances in Japanese.
The word is felt as a mossy garden like Koke-dera and others http://en.wikipedia.org/wiki/Saihō-ji_(Kyoto)http://tabijikan.jp/2014/06/12/2579/
I don’t think it’s incongruity/taboo word in Japanese, and feel it is compatible with the style of “Rust” :slight_smile:

Big pieces

Kuchiki does not exist yet, but some of the more important components do.

  • html5ever is an HTML parser written in Rust. We use it in Servo and we maintain it, but it was designed as an independent library that other projects can use. It’s good.
  • Some people might be interested in XML in addition to HTML. I’m not familiar with either of them, but both xml-rs and RustyXML seem to be active. xml-rs is on crates.io.
  • We’ve just exctracted Servo’s CSS Selector matching code into an independent library, rust-selectors, that I will be maintaining. The API needs a lot of work to be made nicer (it was not designed for use outside of Servo), but it exists.

What’s left to do

  • Improve the rust-selectors API. I will work on this… at some point in time.
  • Define a good tree representation. html5ever has two of them in its src/sink directory which can be used as a starting point, but Kuchiki will probably want many more convenience methods for traversing and modifying the tree.
  • Glue everything together
  • A million other things I’m not thinking of right now
  • Lead and maintain the project

Who’s in?

I can help if some specific areas, but unfortunately Servo does not leave me enough time to lead this project. Who’s interested?


#2

Thank you Simon.
So, what’s there is the selector matching code, but not the tree manipulation?

Are there plans to add documentation to rust-selectors? If not, can you explain in a few lines the overall structure of the library?


#3

Yes, in rust-selectors.

Right, that would be Kuchiki (and doesn’t exist yet).

Yes, eventually. But the API will need to change, so I don’t know that it’s worth documenting the current one. But roughly:

  • The tree module contains traits to abstract the tree representation. This needs to be refactored to use associated types.
  • The parser module contains the parsing code and the data structures representing a parsed selector. Most of this should be private.
  • The matching module implements the actual matching. The must fundamental operation is the matches function, which takes a selector and an element and returns a boolean.
  • The other modules are utilities that should move into their own crates and be published separately.

#4

Also, there’s quite a bit to do for Kuchiki other than selector matching, so it doesn’t have to wait on rust-selectors being refactored.


#5

I’ll throw in a link to my work, the much less-awesome named “sxd”. An XML DOM and XPath implementation. I’ve been trying to upload a 0.1.0 version to crates.io, but some of my dependencies are not compiling at the moment.


#6

Interesting. Have you chatted with the authors of xml-rs or RustyXML to see if they have different goals or if efforts could be unified?

CC @tomaka


#7

An orthodox DOM implementation might be useful, but I’d rather have access to the XML infoset tree with convenient Rust idioms such as iterators and borrowed references.


#8

By “orthodox”, do you mean trying to replicate in Rust an API that looks like https://dom.spec.whatwg.org/ ? Let’s… not do that.


#9

I’m currently working on adding xml5to html5ever library, although the project has been derailed due to work and sickness[1].

Most of the big parts are done (tokenizer, tree builder, actions), but the character tokenizer is still undone and there are no tests, so I’m not sure the thing works (although it should in theory). I’m hoping to finish it all and test it all at start of March, but things can get messy, not to mention the huge time one would need to review the bloody thing ([2]).

Hopefully, once html5ever and xml5ever are fully united, that and rust-selectors could be use to build Kuchiki thingy. Although having generic interface would be quite beneficial, so any parser could theoretically be plugged in.

[1] I’ve been hilariously sick. I got flu week after week and then a stomach infection to top things off.
[2] 1398 lines added (and probably more, since namespaces aren’t done and some things I’ve avoided to ease the frequent mergers). I do not envy kmc :wink: https://github.com/Ygg01/html5ever/compare/servo:master...xml5ever


#10

If your tree builder can work with html5ever::tree_builder::TreeSink or something that can be mapped to it, it should be easy to make things work together.


#12

@SimonSapin What kind of tree manipulation will Kuchiki need? What would be the preferred structure backing it in your experience?


#13

A good starting point is to look at the APIs that lxml and Nokogiri provide, though a better approach would be to find potential users and ask what they need. Other than parsing/serialization/querying that were already discussed, this probably includes:

  • Traversing the tree: going from one node to its parent, first/last child, previous/next sibling, … and iterators for ancestors, children, descendants, previous/next siblings.
  • Modifying the tree: adding and removing nodes at any point
  • Accessing and modifying attributes on elements and other data on specific types of node

Also, I’d probably go with a tree of nodes of various kinds (element, text, comment, …) as in the DOM (the one in web browsers), and not a tree of elements with text attached to .text and .tail attributes of elements like Python’s ElementTree or lxml. This for a couple reason:

  • Dealing with .text and .tail can be a pain
  • You have more than elements and text in the tree anyway (starting with comments)

That’s a good question, and I don’t have an obvious answer.

ElementTree has a list of children in each node, with no parent pointer. This means no access to ancestors or siblings from a given node, which is severely limiting. We probably don’t want to do that. This was done at a time where Python didn’t have a cycle collector and adding a parent pointer would create cycles in Python’s reference counting and make entire trees leak in memory, but today it could add it and rely on the cycle/garbage collector.

html5ever’s src/sink/owned_dom.rs is similar in that it has a Vec of boxed (owned) children. Because children are owned, there can not be a parent pointer and so this has the same limitations as ElementTree.

html5ever’s src/sink/rcdom.rs has a Vec or reference-counted pointers for children and a weak reference for the parent.

lxml and Nokogiri are based on libxml2, which I think uses the same model as web browsers: each node has raw pointers to its parent, first child, last child, previous sibling, and next sibling. So it’s kind of a doubly-linked list generalized to a tree. I don’t know how memory management works there (who is responsible for freeing what), and it may be hard to do it safe Rust.


#14

I was thinking of getting access to nodes with a lifetime-bound accessor type that has a parent reference (meaning that the parent chain of node accessor values is instantiated for that lifetime). The only public owning values are a document or a document fragment, and the internal tree data structure is simple and unidirectional.


#15

I was wondering if in case of an element tree a cursor like structure would be better, perhaps? Cursor are kinda like Iterators, except they are reversible and possible multiple ranges(see Pseudo RFC Cursors). In theory a custom Cursor could traverse tree structure parent/child, prev/next sibling, giving you option to mutate the tree, appending things before or after an element.


Category suggestions
#16

Lifetime-bound accessors and cursors sound interesting and definitely worth looking into. https://github.com/pcwalton/multilist might be relevant too.


#17

I started experimenting with something like this, but with reference-counted pointers (strong and weak to avoid cycles), and RefCell for interior mutability. Does this design seem sane?

https://github.com/SimonSapin/rust-rctree/blob/master/src/lib.rs


#18

… and with an html5ever tree sink: https://github.com/SimonSapin/kuchiki


#19

Looks cool. Although I thought it was going to use multilist as data structure?


#20

That was an idea, but to be honest I had forgotten about it when I started rctree!


#21

Multilist seemed kinda related when I first heard of it, but I just gave it a closer look and I think it is either not applicable to a tree or too general compared to what we can do with a tree-specific data structure.

It’s still early stages (it basically hasn’t gotten any use) but I’m fairly happy with rctree so far. It has no unsafe code and takes O(1) time for all basic operations.