Märkəd (new crate release)

dekellum · March 16, 2020, 7:07pm

I originally came to rust, primarily out of the need to replace an admittedly jankie, poorly aging, and RAM-bloating Ruby+Java HTTP, HTML and text processing stack with cleaner, safer, latest spec compliant, and efficient rust libs. Obviously the existence of the Servo project was an early encouragement for this journey, but it seems that many things associated with Servo tend to languish in a somewhat incomplete state, missing just-a-few pieces required for production use.

For example, html5ever is missing legacy encoding detection and recently jettisoned its default RcDom in the latest MINOR release. For DOM, kuchiki (same authors) is rumored to have a better maintained RcDom but even issues there suggest a victor-like (same authors, "No Maintenance Intended") alternative of a Vec<Node> with index-based parent/child associations.

So I've forked victor and further optimized a vector-based DOM in marked. To märkəd (alt spelling) I've also added:

Heuristic based legacy HTML character encoding hints and parser buffered restart. An estimated 5% of the web remains in encodings other than UTF-8. One (1) in 20 is far too common to treat as errors.
A rust-idiomatic "selectors" API (similar to what "current stack" has in ruby)
A vistor-pattern, bulk mutating filter system (the real subject of my prior request for help: Save me from going unsafe here)

Its now released under a compatible, MIT/Apache dual license. The above linked marked README gives a more complete feature overview.

As a new ammonia crate backend or drop-in compatible replacement

Ammonia has a seriously high download count, so I want to highlight this use case…

Initially just for high level comparative testing, the source tree ./ammonia-compare includes an example and benchmarks using a combination of built-in and custom filters to achieve the same byte-output as ammonia::Builder::default settings. The märkəd implementation is slightly faster (most time is in the same html5ever parse, see benchmarks below), small in LoC and easily customized.

As a next step I'm CCing the Ammonia developers here, nomiminally as a heads-up and in case they might be interested in Ammonia switching to the marked::Document (off of a questionablly-maintained RcDom) or using märkəd as a backend. Otherwise I think it would be fairly easy if someone wanted to contribute a fully ammonia compatible Builder API to the märkəd project, which could be released as a marked-sanitizer (now reserved) crate. Between all the crates linked in this post, many by the same authors, it would seem there should be some consolidation toward single, vetted and trusted implementations. I know I could certainly use the help!

Benchmarks

I'm providing these benchmarks from an oldish Intel i7-5600U laptop that may or may not have also been decoding youtube full albums via firefox concurrently. Also the content being tested is samples of one. So definitely not scientific, but perhaps enticing readers to test more and contribute:

rustc 1.43.0-nightly (564758c4c 2020-03-08)

test b00_round_trip_rcdom                ... bench:  17,444,427 ns/iter (+/- 10,292,071)
test b01_round_trip_marked               ... bench:  16,550,426 ns/iter (+/- 13,706,491)
test b11_decode_eucjp_parse_marked       ... bench:   2,892,920 ns/iter (+/- 140,045)
test b12_decode_windows1251_parse_marked ... bench:   2,294,980 ns/iter (+/- 114,444)
test b13_utf8_parse_marked               ... bench:  10,147,583 ns/iter (+/- 6,388,081)
test b20_text_content                    ... bench:      60,622 ns/iter (+/- 2,442)
test b30_text_normalize_content          ... bench:   1,172,002 ns/iter (+/- 405,902)
test b31_text_normalize_content_identity ... bench:     253,619 ns/iter (+/- 35,972)

test b40_marked_parse_only               ... bench:  10,798,601 ns/iter (+/- 10,945,186)
test b41_marked_clean                    ... bench:  10,985,886 ns/iter (+/- 803,655)
test b42_ammonia_clean                   ... bench:  13,111,580 ns/iter (+/- 2,677,844)

I'd appreciate any and all constructive feedback here or via github!

CC: @SimonSapin @dont_tolerate_bigots @notriddle @lnicola @Ygg01

notriddle · March 16, 2020, 7:20pm

I'm not against this, on the condition that the marked repository is moved to an organization, and I get write access to it. I don't want such an important dependency to potentially wind up unmaintained. Yes, that's kind of rich considering that rcdom itself is unmaintained, but the biggest reason for making this jump is that we're switching off a legacy dependency onto a new one, so I'd like to focus on sustainability this time around.

cuviper · March 16, 2020, 7:23pm

FWIW, cargo treats different 0.x releases as semver-incompatible, so breaking changes are allowed. So don't think of 0.25 as minor, it's effectively a pre-1.0 major release.

dekellum · March 16, 2020, 7:27pm

Great, that would certainly match my stated goals at consolidation and vetting. Your requirements seem reasonable to me.

dekellum · March 16, 2020, 7:38pm

As I think you rightly assumed I was implying something by "jettisoned", IMO and at the very least, dropping RcDom was a maintainer-centric change as opposed to a user-centric change. Yes, according to SemVer, that it very fair thing to do in a pre-1.0.0 release. (An aside: what worries me more with servo and other places in rust, is the same willingness for breaking changes in post-1.0.0 PATCH releases!)

SimonSapin · March 16, 2020, 8:37pm

Thank you for taking this initiative!

There’s an open project for adding <meta charset> detection to html5ever: Implement HTML charset parsing project · servo/servo Wiki · GitHub Maybe efforts can be joined? CC @jdm

To clarify:

html5ever is part of the Servo project. I help co-maintain it as part of being in Mozilla’s Servo team but I wouldn’t call myself author.
I made kuchiki as a proof-of-concept after multiple people asked how html5ever and cssselect could be used together, but never intended to use it myself. I asked and a couple other people offered to help maintain it, but it’s never been a very active project after its initial implementation. Kuchiki is not directly affiliated to Servo, other than me being involved with both.
Victor is my personal side-project where I experiment with PDF generation. I only very occasionally spend time on it. I considered not publishing it at all, but I see the "No maintenance intended" badge worked as intended and I’m glad you found that code useful to fork as the basis for märkəd.

With my Servo team member hat on:

That is not what’s happening.

What we found to matter in practice is not what semver.org says but what Cargo does. If you have a dependency specified as foo = "1.3.7", running cargo update (or any Cargo command if there’s no Cargo.lock file or foo is not in it) will pick up the latest available version whose number is 1.x.y with (x, y) >= (3, 7). That is, 1.4.3 for example is considered to be compatible with 1.3.7. This much is compatible with semver.org. For bar = "0.3.7" however, Cargo will accept any 0.3.x version with x >= 7. So a 0.3.12 version would be considered compatible with 0.3.7. This is not what semver.org says. It says:

Major version zero (0.y.z) is for initial development. Anything MAY change at any time. The public API SHOULD NOT be considered stable.

From Cargo’s point of view, that would mean instead that no version at all would be compatible with 0.3.7, other than itself.

html5ever = { version=">=0.25.1, <2.0" }

Speaking of, I see that märkəd currently specifies a dependency as above. I recommend against this pattern. You’re basically stating that this version märkəd is compatible with future versions of html5ever that don’t exist yet, including 0.26.0, 0.73.5, and 1.0.0. However all of those (if they ever exist) have version numbers that express some breaking change from 0.25.1. Maybe you’ll be lucky and that breaking change is a part of the API that you don’t use, maybe not.

If you specify a dependency as html5ever = ">=0.25.1" then Cargo will do the right thing for html5ever’s version numbering policy, which I believe is widely accepted in the Rust ecosystem.

Compiling Servo involves around 600 crates. We very much feel the pain of managing updates from upstream dependencies. We take it seriously that breaking changes come with a version number that Cargo (not semver.org) considers incompatible (and consider avoiding them in the first place when reasonable).

That’s fair. But this move also reflected a reality that resources are finite. The Servo team is small and Servo · GitHub lists 170 repositories (though many of them are historical and not used anymore). I think officially dropping something is sometimes be better than letting PRs and issues go unanswered.

dekellum · March 16, 2020, 10:33pm

Would this include what the HTML5 Standard calls the "prescan algorithm"? The BOM handling in marked 0.1.0 is a subset of that, recommended by the Encoding Standard. Its possible that marked goals and servo's might be well enough aligned. @jdm ?

Thanks for all your work on html5ever and associated source, @SimonSapin! I'm wondering with your experience gained with victor and RcDom, would you be able to comment on strengths/weaknesses of these two DOM designs, at a high level? Do you see the RcDom in kuchiki continuing to be maintained with the same Rc<RefCell<_>> design?

Thanks for pointing this out. That snuck through from my early development tree (where I want all updates) and it was not my intention to release marked 0.1.0 that way. Ode to a rush to secure the crate name! Its a bug that I can now fix as soon as marked 0.1.1, this way:

-html5ever       = { version=">=0.25.1, <2.0" }
+html5ever       = { version=">=0.25.1, <0.26" }

And yes I know this is the same as "0.25.1", "^0.25.1" and "~0.25.1", but I'm allergic to magic operators with rules I can't entirely remember without consulting the reference! Also I reserve the right to make it ">=0.25.1, <0.27", if you release an html5ever 0.26.0 that is fully compatible with marked. The approach seems to work much better then the common rust wisdom would suggest.

SimonSapin · March 17, 2020, 12:08am

Yes that’s the idea.

RefCell has some downsides:

It can cause panics when a std::cell::RefMut or std::cell::Ref is kept on the stack longer than strictly necessary and you end up trying to borrow the same refcell again (perhaps in some deeply nested call).
It has space and time overhead for storing and manipulating the borrow flag. (Maybe it’s negligible, but that’s hard to tell without a benchmark and likely workflow-specific.) Cell does not have this overhead (and works with !Copy types these days!) but is less flexible.

To help with reduce such panics (by making refcell borrows shorter-lived) and because it felt more convenient Kuchiki does not literally use Rc<RefCell<Node>> but has individual fields in their respective cells. To reduce the multiplication of the space overhead it prefers Cell where possible/practical. There are some unsafe-using helpers to avoid the option dance (which would be 1. call Cell::<Option<T>>::take, 2. do something with &mut Option<T>, 3. put it back with Cell::set).

Downsides of indexing in a shared Vec:

Removing nodes. Victor never does it until it drops a whole document. With the naïve way to remove nodes from the tree they become "unreachable" but are not dropped and still use memory. You can fix the former by keeping Vec<Option<Node>> instead of Vec<Node>. But then you may want to find removed entries and reuse them, so a "empty" entries could form a free list like Vec<Slot> with enum Slot { Full(Node), Empty { next_empty_slot_index: Option<NodeId> }}. To reduce space overhead you could replace the enum with a union + a BitVec on the side, but then you run into unions with !Copy fields not being stable yet. And you risk "use after free" outdated indices somewhere that still point to a slot that has been emptied and then reused for a different node. To avoid that you can add a generation number to both slots and indices, which has its own space overhead…
(I feel there’s another one I wanted to mention, but I forget.)

Advantages of indexing:

Unlike Rc’s shared ownership sort of fighting against Rust’s ownership model, this takes full advantage of it: taking a &Document shared reference trivially makes the whole tree read-only, allowing safe parallel processing from multiple threads.
A single Vec is usually more compact in memory than separate heap allocations for each node, and reportedly has better performance through cache locality. But then maybe an allocator like jemalloc is already good at bucketing you still end up with good cache locality.

Both borrowing refcells and indexing a vec both have some ergonomic/convenience hit, I feel there’s no clear winner there.

Well no-one is actively working on Kuchiki at the moment, so it’s not likely to move to a different design soon.

dekellum · March 17, 2020, 1:27am

Thanks @SimonSapin! I've been going off of your prototype-code and the tea leaves of various github issues and commits across 3 repositories to arrive at the general-if-vague sense that the Vec<Node> approach has advantages. Your response is very helpful to understanding the progression and tradeoffs. Yes, marked::Document shares the same removal disadvantage. The marked::Node size on x86_64 is down from 120 bytes to 88 bytes so some savings there. I added a deep_clone to partially mitigate this, taking advantage of StrTendril being efficient to clone. It would probably be possible to offer a more elaborate in-place compaction strategy.

For märkəd I also looked into inlining attributes in the same or a single companion Vec<Attribute> but found that ultimately html5ever (or associated crate) allocates this Vec per-Element and moves it to the marked::html::Sink impl, so inlining is at best a copy.

dekellum · March 18, 2020, 1:00pm

Dependency fix:

dekellum · April 12, 2020, 7:02pm

As part of the just released marked 0.2.0, and as suggested above, I've added an in-place Document::compact method which is faster and more convenient for the common use case than the existing deep_clone. See benchmarks below, but here marked saves 2ms (22,862-20,816) on a round-trip parse+serialize vs rcdom. A compact on the same document after aggressive cleaning only costs 61µs (468,847-407,901) wall time. Thus I feel the previously discussed design downside ("Removing nodes. Victor never does it until it drops a whole document") is fully addressed by marked 0.2.0.

test b00_round_trip_rcdom                ... bench:  22,862,635 ns/iter (+/- 3,423,975)
test b01_round_trip_marked               ... bench:  20,816,274 ns/iter (+/- 2,614,777)
test b11_decode_eucjp_parse_marked       ... bench:   3,967,514 ns/iter (+/- 793,660)
test b12_decode_windows1251_parse_marked ... bench:   3,061,867 ns/iter (+/- 403,506)
test b13_utf8_parse_marked               ... bench:  12,794,733 ns/iter (+/- 1,884,035)
test b20_text_content                    ... bench:      76,949 ns/iter (+/- 13,433)
test b30_text_normalize_content          ... bench:   1,572,767 ns/iter (+/- 265,441)
test b31_text_normalize_content_identity ... bench:     330,224 ns/iter (+/- 60,921)
test b50_sparse_bulk_clone               ... bench:     407,901 ns/iter (+/- 55,255)
test b51_sparse_compact                  ... bench:     468,847 ns/iter (+/- 116,664)
test b52_sparse_deep_clone               ... bench:     525,031 ns/iter (+/- 80,292)

(video playing in the background concurrent with testing on same laptop)

system · July 11, 2020, 7:02pm

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
html5ever is on crates.io! announcements	8	2151	January 12, 2023
Html DOM and web apps: what is possible? help	1	391	July 7, 2022
Call for proposals for next Rust Doc Days crates!	17	4418	January 12, 2023
New CommonMark parser announcements	27	9902	January 12, 2023
What's everyone working on this week (29/2021)? community	9	1006	October 30, 2021

Märkəd (new crate release)

As a new ammonia crate backend or drop-in compatible replacement

Benchmarks

Related Topics