Kuchiki: a (Edit: no longer vaporware) HTML/XML tree manipulation library

Settler · July 20, 2015, 2:04pm

Our use-case is just search\select nodes. Nothing more nothing less. HtmlAgilityPack can only search and it's ok for us. That's why I found this project very interesting. Because this project is very similar to .net HtmlAgilityPack and ideas included in Kuchiki cover most our use-cases (except xpath).

I will try to implement some of our use-cases using current code of Kuchiki and rust-selectors and post here results. Eventually, no matter which approach is used for selecting nodes - css or xpath. API for that should be very close for both approaches.

Ygg01 · July 20, 2015, 3:37pm

You might want to avoid that. As SimonSapin said - it's a hack and it's bound to backfire in really subtle ways. Maybe basing it on Kuchiki's various iterators might prove prudent? I dunno shrug

SimonSapin · July 20, 2015, 8:50pm

If you have code that uses specific XPath queries an rewrite it to use specific Selectors (or the reverse), that’s fine IMO. What I’m recommending against is trying to write a library that takes arbitrary Selectors and rewrite them as XPath. The reverse probably has similar issues.

Ygg01 · July 29, 2015, 12:39pm

@SimonSapin: Couple of questions.

How should (and when) we add docs to kuchiki? Do you wait for API to stabilize? Or add docs now and change them later?
I'm thinking of when I'd use HTML parsing, and it occurs to me, I only ever use it to get HTML from a site, like for example a crawler. Which begs the question do we add hyper as an optional dependency? I.e. should we add HTML.parseUri that relies on some http client?

SimonSapin · July 29, 2015, 2:30pm

I’ve configured Travis-CI to push rustdoc output to https://simonsapin.github.io/kuchiki/, but many things are still missing a doc-comment. I don’t think there is a reason to wait, the work just hasn’t been done yet. Eventually I’d like to add #![deny(missing_docs)] to the crate.
I’ve added Html::from_stream which is generic and takes anything that implements std::io::Read, including hyper::client::Response. (from_file is now a thin wrapper around it.) This is much more general than mandating Hyper specifically, and makes user code only a little more verbose:
```
try!(Html::from_stream(try!(hyper::Client::get(url).send())).parse())
```
Instead of:
```
try!(Html::from_url(url).parse())
```

danger89 · June 23, 2017, 1:55pm

Keep up the good work! I'm loving it!

SimonSapin · June 23, 2017, 2:34pm

To be honest I made Kuchiki as a proof-of-concept since I had seen asked multiple times how to do this, but I don’t use it myself. I’m not really spending any time on developing it further. If someone is motivated to take over maintenance of the project I’d be happy to give it.

Ygg01 · June 23, 2017, 3:13pm

I'll maintain it, if there is no one else. I promise in advance never to delete it (I deleted my own xml5ever repo once in error).

SimonSapin · June 23, 2017, 3:51pm

Haha. @Ygg01 I’ll give you access if you want to do stuff with it, but don’t feel “forced” if there’s no one else. Ideally someone will magically show up who is interested in pushing the project forward, developing new features, etc. Not just bear the burden. Oh well, maybe it’ll happen in time as the ecosystem grows.

Ygg01 · June 23, 2017, 4:08pm

One thing I want to experiment is perhaps indexing XML/HTML tree, for query purposes. Is there any good books/blogs on implementing such Indexes?

SimonSapin · June 23, 2017, 4:24pm

Indexing as in full-text, Google-like fuzzy search? There’s lots of resources on this, I don’t know if it’s in any way specific to XML/HTML, though.

If not, I don’t understand what you mean.

Ygg01 · June 23, 2017, 4:38pm

I meant like this: https://developer.marklogic.com/blog/how-indexing-makes-xpath-fast

What does Firefox uses to speed its XPath queries?

danger89 · June 23, 2017, 4:56pm

Ow too bad, since you advised me to use this. So I though/hoped it will be further improved and maintained. Since you did some effort to create this crate. Let's hope somebody can take it over.

SimonSapin · June 23, 2017, 5:15pm

Well, for what it’s worth I recommend it because I still think that it’s better than the alternatives.

Ygg01 · June 23, 2017, 5:19pm

There are alternatives? SXD?

SimonSapin · June 23, 2017, 5:20pm

I don’t know much about XPath, but Firefox goes to great lengths to optimize CSS selectors in stylesheets. There’s a lot to say about this, and it’s probably out of scope for this thread

Kuchiki supports CSS selector matching through the same library as Servo. Servo’s style system is also going into Firefox as part of the Stylo a.k.a. Quantum CSS project. So Kuchiki ends up profiting from some of these optimizations. (Not all, since some require more scaffolding than exists in the selectors crate.)

Ygg01 · June 23, 2017, 5:23pm

Does Firefox translate XPath queries into CSS? I remember there was talk about it, and it was frowned upon.

From what I've gathered, CSS selectors aren't a natural fit for XPath-like queries that eTree uses, so I assume they should use different data structures/algorithms.

SimonSapin · June 23, 2017, 7:41pm

Once upon a time I inherited maintenance of a project that translates CSS Selectors to XPath: https://github.com/scrapy/cssselect. The easy cases seem easy to translate at first, but I believe it is not possible to support everything correctly. (For example consider div ~ :nth-of-type(4): 4 is counted since the start of the parent, not since the div.) I assume going the other way would hit similar mismatches.

Topic		Replies	Views
Märkəd (new crate release) community	11	1442	July 11, 2020
Recommendations for HTML parsing	5	16994	January 20, 2021
Scraping ajax/spa with Rust	18	4129	December 17, 2020
Crate of the Week community	1537	264230	March 22, 2026
Review my xml parser code review	19	612	April 21, 2025

Kuchiki: a (Edit: no longer vaporware) HTML/XML tree manipulation library

Related topics