Kuchiki: a (Edit: no longer vaporware) HTML/XML tree manipulation library

Our use-case is just search\select nodes. Nothing more nothing less. HtmlAgilityPack can only search and it's ok for us. That's why I found this project very interesting. Because this project is very similar to .net HtmlAgilityPack and ideas included in Kuchiki cover most our use-cases (except xpath).

I will try to implement some of our use-cases using current code of Kuchiki and rust-selectors and post here results. Eventually, no matter which approach is used for selecting nodes - css or xpath. API for that should be very close for both approaches.

You might want to avoid that. As SimonSapin said - it's a hack and it's bound to backfire in really subtle ways. Maybe basing it on Kuchiki's various iterators might prove prudent? I dunno shrug

If you have code that uses specific XPath queries an rewrite it to use specific Selectors (or the reverse), that’s fine IMO. What I’m recommending against is trying to write a library that takes arbitrary Selectors and rewrite them as XPath. The reverse probably has similar issues.

@SimonSapin: Couple of questions.

  1. How should (and when) we add docs to kuchiki? Do you wait for API to stabilize? Or add docs now and change them later?

  2. I'm thinking of when I'd use HTML parsing, and it occurs to me, I only ever use it to get HTML from a site, like for example a crawler. Which begs the question do we add hyper as an optional dependency? I.e. should we add HTML.parseUri that relies on some http client?

  1. I’ve configured Travis-CI to push rustdoc output to https://simonsapin.github.io/kuchiki/, but many things are still missing a doc-comment. I don’t think there is a reason to wait, the work just hasn’t been done yet. Eventually I’d like to add #![deny(missing_docs)] to the crate.

  2. I’ve added Html::from_stream which is generic and takes anything that implements std::io::Read, including hyper::client::Response. (from_file is now a thin wrapper around it.) This is much more general than mandating Hyper specifically, and makes user code only a little more verbose:

    try!(Html::from_stream(try!(hyper::Client::get(url).send())).parse())
    

    Instead of:

    try!(Html::from_url(url).parse())
    

Keep up the good work! I'm loving it! :heart_eyes:

To be honest I made Kuchiki as a proof-of-concept since I had seen asked multiple times how to do this, but I don’t use it myself. I’m not really spending any time on developing it further. If someone is motivated to take over maintenance of the project I’d be happy to give it.

3 Likes

I'll maintain it, if there is no one else. I promise in advance never to delete it :wink: (I deleted my own xml5ever repo once in error).

Haha. @Ygg01 I’ll give you access if you want to do stuff with it, but don’t feel “forced” if there’s no one else. Ideally someone will magically show up who is interested in pushing the project forward, developing new features, etc. Not just bear the burden. Oh well, maybe it’ll happen in time as the ecosystem grows.

One thing I want to experiment is perhaps indexing XML/HTML tree, for query purposes. Is there any good books/blogs on implementing such Indexes?

Indexing as in full-text, Google-like fuzzy search? There’s lots of resources on this, I don’t know if it’s in any way specific to XML/HTML, though.

If not, I don’t understand what you mean.

I meant like this: https://developer.marklogic.com/blog/how-indexing-makes-xpath-fast

What does Firefox uses to speed its XPath queries?

Ow too bad, since you advised me to use this. So I though/hoped it will be further improved and maintained. Since you did some effort to create this crate. Let's hope somebody can take it over.

Well, for what it’s worth I recommend it because I still think that it’s better than the alternatives.

1 Like

There are alternatives? SXD?

I don’t know much about XPath, but Firefox goes to great lengths to optimize CSS selectors in stylesheets. There’s a lot to say about this, and it’s probably out of scope for this thread :slight_smile:

Kuchiki supports CSS selector matching through the same library as Servo. Servo’s style system is also going into Firefox as part of the Stylo a.k.a. Quantum CSS project. So Kuchiki ends up profiting from some of these optimizations. (Not all, since some require more scaffolding than exists in the selectors crate.)

Does Firefox translate XPath queries into CSS? I remember there was talk about it, and it was frowned upon.

From what I've gathered, CSS selectors aren't a natural fit for XPath-like queries that eTree uses, so I assume they should use different data structures/algorithms.

Once upon a time I inherited maintenance of a project that translates CSS Selectors to XPath: https://github.com/scrapy/cssselect. The easy cases seem easy to translate at first, but I believe it is not possible to support everything correctly. (For example consider div ~ :nth-of-type(4): 4 is counted since the start of the parent, not since the div.) I assume going the other way would hit similar mismatches.