I originally came to rust, primarily out of the need to replace an admittedly jankie, poorly aging, and RAM-bloating Ruby+Java HTTP, HTML and text processing stack with cleaner, safer, latest spec compliant, and efficient rust libs. Obviously the existence of the Servo project was an early encouragement for this journey, but it seems that many things associated with Servo tend to languish in a somewhat incomplete state, missing just-a-few pieces required for production use.
For example, html5ever is missing legacy encoding detection and recently jettisoned its default
RcDom in the latest MINOR release. For DOM, kuchiki (same authors) is rumored to have a better maintained
RcDom but even issues there suggest a victor-like (same authors, "No Maintenance Intended") alternative of a
Vec<Node> with index-based parent/child associations.
So I've forked victor and further optimized a vector-based DOM in marked. To märkəd (alt spelling) I've also added:
Heuristic based legacy HTML character encoding hints and parser buffered restart. An estimated 5% of the web remains in encodings other than UTF-8. One (1) in 20 is far too common to treat as errors.
A rust-idiomatic "selectors" API (similar to what "current stack" has in ruby)
A vistor-pattern, bulk mutating filter system (the real subject of my prior request for help: Save me from going
Its now released under a compatible, MIT/Apache dual license. The above linked marked README gives a more complete feature overview.
As a new ammonia crate backend or drop-in compatible replacement
Ammonia has a seriously high download count, so I want to highlight this use case…
Initially just for high level comparative testing, the source tree ./ammonia-compare includes an example and benchmarks using a combination of built-in and custom filters to achieve the same byte-output as
ammonia::Builder::default settings. The märkəd implementation is slightly faster (most time is in the same html5ever parse, see benchmarks below), small in LoC and easily customized.
As a next step I'm CCing the Ammonia developers here, nomiminally as a heads-up and in case they might be interested in Ammonia switching to the
marked::Document (off of a questionablly-maintained
RcDom) or using märkəd as a backend. Otherwise I think it would be fairly easy if someone wanted to contribute a fully ammonia compatible
Builder API to the märkəd project, which could be released as a marked-sanitizer (now reserved) crate. Between all the crates linked in this post, many by the same authors, it would seem there should be some consolidation toward single, vetted and trusted implementations. I know I could certainly use the help!
I'm providing these benchmarks from an oldish Intel i7-5600U laptop that may or may not have also been decoding youtube full albums via firefox concurrently. Also the content being tested is samples of one. So definitely not scientific, but perhaps enticing readers to test more and contribute:
rustc 1.43.0-nightly (564758c4c 2020-03-08) test b00_round_trip_rcdom ... bench: 17,444,427 ns/iter (+/- 10,292,071) test b01_round_trip_marked ... bench: 16,550,426 ns/iter (+/- 13,706,491) test b11_decode_eucjp_parse_marked ... bench: 2,892,920 ns/iter (+/- 140,045) test b12_decode_windows1251_parse_marked ... bench: 2,294,980 ns/iter (+/- 114,444) test b13_utf8_parse_marked ... bench: 10,147,583 ns/iter (+/- 6,388,081) test b20_text_content ... bench: 60,622 ns/iter (+/- 2,442) test b30_text_normalize_content ... bench: 1,172,002 ns/iter (+/- 405,902) test b31_text_normalize_content_identity ... bench: 253,619 ns/iter (+/- 35,972) test b40_marked_parse_only ... bench: 10,798,601 ns/iter (+/- 10,945,186) test b41_marked_clean ... bench: 10,985,886 ns/iter (+/- 803,655) test b42_ammonia_clean ... bench: 13,111,580 ns/iter (+/- 2,677,844)
I'd appreciate any and all constructive feedback here or via github!