Kuchiki: a (Edit: no longer vaporware) HTML/XML tree manipulation library

Hey, I have a DOM-like tree all wired up with HTML parsing/serialization (html5ever) and CSS Selelector matching (rust-selectors, extracted from Servo). But the API at this point is ugly: kuchiki/tests.rs at master · kuchiki-rs/kuchiki · GitHub

Would it be better if it looked like Html::parse() or Html.parse() like in Nokogiri?

What is the API improvements you'd like to see?

I don't know! What do you think would be a good API?

Maybe at this point it'd be good to have a real project that tries to use all this and provide feedback.

Well, Personally. If I wanted to use it I'd want to use it in following ways:

  1. Quick and dirty parse of string.
  2. Parse stuff from a file
  3. Parse from the internet (e.g. RSS on some link, or some XML Rest response)

Ideally something along the lines of this:

 #1  XML.parse("<book>
                   <name>Count of Monte Cristo</name>
                   <author>Alexandre Dumas<author/>
                </book>")
 #2  HTML.parse("./book.html");
 #3  HTML.parse("www.mozilla.org");

After I've parsed I'd like to select parsed stuff, perhaps select first h1 or second p.foo, etc.
So for example.

 let x = HTML.parse("<h1><p class="foo">Paragraph</p></h1>").css("p.foo"); // Returns list of nodes

Now, I'm aware this isn't fully possible, without some macro hackery, but say if parse is split into from_str, from_file, from_url, etc. That seems fine by me. Now I know I'm not being super original, because this is pretty much Nokogiri :wink:

I did some preliminary change of interface, so parsing document now looks like this.

The changes are mostly in

https://github.com/Ygg01/kuchiki/blob/new_api/src/parser.rs#L14-L50

I've been trying to get rid of <IgnoreParseErrors> out of Html::<IgnoreParseErrors> (without changing ParseOpts too much) but it's getting late, I'm a newb so I'm conceding for now.

Is this an improvement, in your opinion?

Additional UI ideas for Kuchiki, so let me know what you guys think.

Accessing content

Nokogiri and eTree love to access text content of a node using indexes.

Nokogiri:

#doc
#<body>
# <h1>Three's Company</h1>
#  <div>A love triangle.</div>
#</body>

h1 = doc[0]
h1.content = "Snap, Crackle & Pop"

@doc.to_html
# => "<body>
#   <h1>Snap, Crackle &amp; Pop</h1>
#   <div>A love triangle.</div>
# 
# </body>"

ElementTree:

country_data.xml

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

root = ET.parse("country_data.xml").getroot()
root.tag // data
root[0].attrib // {name="Liechtenstein"}
root[0][1].text //2008

So question is do we override indexed access to allow things like node[0][2] or node.attrib["class"]? I admit I'm not a great fan of node.as_document().unwrap(). However in this regard it seems practical to look at DOM1 and see what it can do.

I'm not proposing following it to the letter, but it makes innate sense to me to be able to say node.value or node.name or even node.attributes and have it return an option. It's well defined, it makes sense and there is existing precedent.

Additionally I propose to adopt the node[x] as a way to access nodes. It'll probably just be a shorthand for Iterator access. It's used across DOM, as well as Nokogiri and ElementTree.

What remains a mystery is whether or not to use more controversial stuff like node.content="foo" that would allow people to change text nodes inside an Element node, which (since HTML allows mixed content) means going down infoset route. I prefer the more explicit design here - i.e. create text node iterator and then replace the content as far as you want.

Writing to files

Allow for DocumentNode to serialize itself to things other than string. For example files.

let doc = parsed_stuff.as_documents().unwrap();
let doc.write_file("hello.xml");

Or maybe something more generic, perhaps? Maybe combine with rust serde?

Regarding indexing like node[2], Rust sort of has a convention that it should be fast (O(1) time), but with Kuchiki’s current internal representation (where siblings are linked rather than in a vector) accessing the n-th child of a node take O(n) time. So code like for i in 0..n { do_something(node[i]) } would be a classic case of accidentally O(n²).

Given how RefCell is currently used for attributes, a convenience API for accessing one attribute would probably need to return a std::cell::Ref (and use Ref::filter_map internally) which unfortunately can’t be done with the Index trait as it requires returning &T.

Regarding infosets, that was kinda my understanding of how Kuchiki works already. Or do you mean what effbot.org calls the simplified infoset model, with .text and .tail strings on elements instead of text nodes? It was a deliberate choice not to do this in Kuchiki. That model pretends that nothing but elements and text exist, and has no way to represent other nodes like comments or doctypes. And as effbot.org itself notes, it’s easy for client code to forget to look at .tail.

Writing to files with AsRef<Path> sounds good, but the lowest-level abstraction is std::io::Write or maybe some Unicode equivalent. (See a postponed RFC, though std::fmt::Write can also be (ab)used.)

I don’t think using serde makes much sense here. As far as I understand, serde mostly serializes arbitrary structs based on their fields.

2 Likes

I’ve wanted a data structure that represents a node that is known to be e.g. an element (so you don’t have to write .as_element().unwrap() so much), but still has all the methods of nodes. I’m not sure of that would work. I’ve been told that abusing Deref for this is a terrible, terrible idea.

I’ve also considered using vector indices (applied to trees) instead of &T references to an arena. This would remove all the awkwardness of RefCell in APIs (mutability would be handled in a way that’s more idiomatic to Rust) and would even allow accessing a tree from multiple threads. But then API awkwardness appears elsewhere: you have this arena-like things that holds the vec that not only needs to be kept alive like an arena, but you have to access it all the time: node.parent() becomes tree.parent(node_id) or node_id.parent(tree). Maybe a (&tree, node_id) wrapper could make the API nicer, but then mutability becomes tricky again.

Dang. Still thanks for a detailed summary :smile:

I meant the simplified infoset. Also I agree that .tail and .text would be a mistake. However we could possibly have iterator over text nodes on a node, like we have descendants/ancestors/etc. Basically it's children iterator that skips non-text nodes.

Let me quickly summarize, additions to Node interface , so you can weigh in. Just keep in mind return types are more of a sketch, than the real deal:

impl<'a> Node<'a> {
    fn name(&self) -> Option<&str>
    fn value(&self) -> Option<&str>
    fn attribs(&self) -> Option<HashMap>
    fn text(&self) -> TextIterator
    fn attr(&mut self, key: &str) -> Option<Ref<String>> 
} 

From the looks of it, it seems it will be difficult to allow mutations like node.value() = 3. What would be best way to mutate the tree?

For some reason I thought serde would allow more generic serialization options. Doh!

I see. Oh, well, then it's either wait for inheritance or use macros.

@SimonSapin I wanted to ask, where is the best place to discuss/propose new functionality for Kuchiki? Here? Github Issues? IRC?

I was thinking about allowing modification to tree nodes in Kuchiki and I wanted to hear some feedback on this topic. This only applies to HTML. In XML without a schema there is no such restriction.

What is someone decides to change all <a> nodes into <br>? That should be illegal because certain tags in HTML have special semantics. So perhaps forbidding tag modification altogether is for the best when it comes to HTML? But on other hand, changing h1->h2 is completely legit transformation :frowning: and not allowing such changes would strike me as silly.

One possible way to solve this issue would be to:

  • take node being changed
  • convert node into string
  • change string representation
  • parse transformed string, returning either errors or transformed_node
  • if it parses without errors, replace node with transformed_node

Perhaps fragment parsing could be used for this kind of transformation?

That’s a good question. I like IRC for informal discussions, but I’m not sure what channel to use. Kuchiki isn’t really a big enough project to have its own channel. #rust is appropriate, but probably high-traffic. #servo is only tangentially relevant.

GitHub definitely for bugs and pull requests, but also works for proposals.

I don’t think that tree manipulation APIs need to be aware of that. It’s up to the HTML serializer to deal with it (or not).

For what it’s worth, an element’s name is read-only in the DOM. What you can do is:

  • Create a new element node with a new name
  • Copy attributes
  • Move child nodes
  • Replace the old node with the new one.

Maybe Kuchiki can provide helpers to do the replacing.

That said, we could still diverge from the DOM and make element names mutable. But that probably means more RefCells and their ergonomic hit.

No. There’s a million things that can go wrong when doing string manipulation of XML/HTML data like this. The whole point of Kuchiki is to provide a proper tree API so you don’t have to do stuff at the string level.

2 Likes

Hi Simon. I'm very interested in this project.

I working on project that need to parse millons of HTML pages and search there by xpath. I'm from .NET world and I use HtmlAgilityPack to parse HTML pages and perform xpath search on them. Right now I'm looking for alternative in unmanaged world and I think Rust - it's a best choice.

What you think about to add xpath support to Kuchiki? I'm saw SXD-XPath project, that support xpath very vell, but it only working with XML documents and it depends on SXD-Document library that cannot be used with html5ever. Could we create some library, like rust-selectors (but with xpath's) that not depends on any DOM based library? Kuchiki need to connect API from that library to html5ever to perform search on HTML.

I'm very new in Rust, but I want to spend some of my free time to learn this language and implement some useful things. Could you give me a point, where I can start from? Maybe it would be right to stabilize API of Kuchiki with existed rust-selectors first, before start working on xpath library? Maybe some things from rust-selectors could help to create xpath library?

Yes. Sadly the way Kuchiki works, requires that there is a dom_sink in which you can toss your elements. SXD doesn't use dom_sink, so Kuchiki can't use SXD-XPath.

Anyway, I was also looking into XPath and it seems an overkill for Kuchiki's needs. XPath includes not just selecting but doing other weird transformations with it. See eTree XPath.

One alternative that came to my mind, but I think it was infeasible because parent selector was missing, was to convert a eTree's subset of XPath expressions into selectors and find the correct elements.

I’m not planning to write an XPath matcher myself, but I’m in favor of integrating one in Kuchiki if someone writes it :)

html5ever only provides a parser and serializer that are fairly orthogonal from the tree representation. XPath only cares about the tree representation. Kuchiki provides one tree representation and "connects" html5ever and rust-selectors to it.

So I imagine two possible approaches for this matcher:

  • Hard-coded to Kuchiki’s tree representation. This may be easier.
  • Use a trait or a set of trait to make the tree representation abstract, like rust-selectors does. That way, maybe the same matcher (an evolution of SXD-XPath?) could be used with Kuchiki and SXD-Document. I recommend chatting with the author of SXD-XPath about this. (I don’t know if they read this thread, filing a github issue might be the easiest way to contact them.)

I don’t think that’s accurate. SXD-Document, Kuchiki, and html5ever_dom_sink::rcdom are three different tree representations (data structures). The latter two implement html5ever’s TreeSink trait so that such a tree can be created from html5ever’s HTML parser. SXD-XPath is (currently) hard-coded to use SXD-Document’s tree representation. (But maybe making it generic is the way forward.)

I don’t understand what you mean. XPath only queries stuff from the tree. Maybe you’re thinking of XSLT, which uses XPath to do tree transformations? (“eTree XPath” is a subset of XPath that does not support going “up” the tree, because ElementTree does not have parent pointers because Python 1.5.x did not have a cycle collector.)

Nope, nope, nope. I maintain GitHub - scrapy/cssselect: CSS Selectors for Python which does the opposite. Down that road is madness.

[quote="SimonSapin, post:36, topic:435"]
I don’t understand what you mean. XPath only queries stuff from the tree. Maybe you’re thinking of XSLT, which uses XPath to do tree transformations?
[/quote]I mean, you can do queries in XPath that don't return a set of Nodes. You can for example count number of nodes divide them by another set of Nodes, floor them, multiply them, etc. I don't see that being useful in Kuchiki, do you?

I mean, in Kuchiki, you can select the nodes you want and do a count of them, to get number of nodes and further process it in Rust.

Strange. It has the parent element selector, which implies it's possible to go up the tree. Could you elaborate, what exactly does going up the tree entail and how it's different than parent selector in eTree?

Welp, it seemed like a good way to avoid code duplication.

You're right. But we could do not implement that features of xpath. We could implement just node selection, like rust-selectors and it would be enough. Xpath search is more advanced then css selectors. Because of that in our project we use xpath, but not a css selectors.

I think this is a right choice for now.

I will try to contact with author of that project. Maybe I try to convince him to implement abstract tree in SXD-XPath. If that will succeed Kuchiki can just add that tree support and all should work.

But for now, @SimonSapin, do you have some minor tasks for me to try code for Kuchiki? Or it's too early to code and there need to discuss many things?

@Ygg01, sorry if this came off more strongly than I intended. Mapping Selectors to XPath seems easy at first, but the two languages are different enough that you quickly get very tricky edge cases. See https://github.com/SimonSapin/cssselect/issues/12 for example.

@Settler, I don’t really have tasks to give in Kuchiki. Rather, what would be beneficial at this point is someone using it in a real project and giving feedback on what are the pain points, what can be improved, etc.

No need to apologize. I can get quite incomprehensible at times. Anway,I learned a valuable lesson. Never attempt to replicate XPath with CSS.

Ever :wink:

I'm interested though whether you want to support all of XPath, including expressions that evaluate to float or bool or just a subset that only selects NodeSets? Or do you think it depends on user demand?

Never ever might be strong. All I’m saying is that I’ve dealt with something like this before and I don’t recommend it. The apparent simplicity is deceptive.

I’m gonna let @Settler reply about what they want. I personally have no desire to use XPath at all, I just don’t mind having it in Kuchiki if other people find it useful and are willing to do most of the work. (Including returning non-nodes if people find that useful.)