Introducing encoding_rs: A browser-oriented character encoding conversion library


#1

encoding_rs is an implementation of the Encoding Standard for the purpose of replacing the old character encoding library of Firefox (uconv) with something that is more correct, safer, faster (for browser-relevant workloads) and smaller and that supports conversion to and from UTF-8 without pivoting through UTF-16 and without adding a second set of data tables.

encoding_rs has replaced uconv in Firefox Nightly (56 train). crates.io shows calamine, quick-xml and ripgrep as dependent crates.

API docs are on docs.rs.

UTF-8 and UTF-16

Supporting conversion to and from UTF-8 is primarily meant for use from Rust code. This makes the library usable by non-Firefox Rust programs.

Since Firefox is a UTF-16-based legacy C++ codebase, conversion to and from UTF-16 is supported, too, and the API has been designed to be FFI-friendly. Beyond documentation clutter, these aspects shouldn’t be a problem for pure-Rust programs that only need the UTF-8 facet.

Performance

The performance profile of encoding_rs is biased towards browser use cases. This means that

  1. ASCII performance is favored at the expense of non-ASCII when this makes HTML faster (ASCII for markup, potentially non-ASCII for natural language) even if this makes plain text performance for some languages slower than competition.
  2. ISO-2022-JP, UTF-16BE and UTF-16LE (i.e. UTF-16 as interchange encoding as opposed to in-RAM representation) are at the bottom of optimization TODO list and haven’t been optimized.
  3. Footprint is favored over speed for the non-ASCII parts of legacy encodings on the encoder side.

If you have use cases of encoding non-ASCII content to legacy encodings, the performace profile of encoding_rs is probably disappointing or very disappointing, but software should be using UTF-8 for output these days.

Correctness and Scope

encoding_rs is scoped strictly to the Encoding Standard. Non-Encoding Standard legacy encodings are explicitly out of scope.

As far as I am currently aware, encoding_rs implements the Encoding Standard correctly. (Please file bugs if this isn’t true!)

Name

It has been pointed out to me that the name doesn’t conform to preferred conventions. So far, it has seemed that renaming the crate would be more disruptive than letting the unideal name be.

Compared to rust-encoding

“How does this differ from rust-encoding?” is obvious enough a question that I should probably address it here.

The key differences are:

  • encoding_rs supports conversion to and from UTF-16 in addition to supporting conversion to and from UTF-8.
  • The streaming API in encoding_rs uses a slice for output instead of using a trait object for output. This is more efficient and more FFI-friendly.
  • Given enough output space and no errors, the decoders in encoding_rs always consume all input even if the input ends with an incomplete multi-byte sequence. (This makes things nicer for callers.)
  • For Web workloads at least, encoding_rs performs better than rust-encoding for decode and for encode to UTF-8. For non-ASCII encode to legacy encodings, rust-encoding performs better.
  • encoding_rs targets a newer snapshot of the Encoding Standard. (Some spec changes made after rust-encoding was created would be API-breaking changes for rust-encoding: The case of the encoding names has changed and error reporting for ISO-2022-JP encode changed in a way that’s not fully representable in the rust-encoding API.)
  • encoding_rs only supports one error replacement mode for decode and one unmappable replacement mode for encode: the modes defined in the Encoding Standard today. (It is possible to implement other modes on top of the lower-level API, though.)
  • rust-encoding supports non-Encoding Standard encodings and allows other crates to supply additional encodings using the same API. encoding_rs only support encodings that are in the Encoding Standard and is intentionally not extensible (i.e. "&'static encoding_rs::Encoding doesn’t have characteristics that the encodings defined in the Encoding Standard don’t have" is part of the type-safety provided by encoding_rs).

(When I thought that vendored crates in Firefox would need to see the rust-encoding API, I wrote a compatibility wrapper by forking rust-encoding and replacing the internals. At present, it’s not actually needed by Firefox, so I haven’t kept it up-to-date with encoding_rs changes. Its source code may still provide insight into the API differences.)

TODO

  • SIMD acceleration for Aarch64 is coming up. Some SSE2 improvements aren’t yet on crates.io pending that work. (SIMD acceleration requires nightly Rust.)
  • rustc and the simd crate need a bit of tweaking before SIMD support for 32-bit ARM can be added.
  • There is probably useless use of unsafe, so I should benchmark various unsafe removal options to see which uses of unsafe aren’t actually needed for performance.
  • The C++ standard library-based C++ wrapper and the C and C++ code samples aren’t quite up-to-date.
  • Parallel UTF-8 validation using rayon hasn’t been calibrated beyond one computer, so it’s still unclear if that feature is going to stay (and become default) or go away.
  • The UTF-8 validation code was forked from the standard library in order to add SIMD acceleration. Since then, there have been a couple of improvements to the standard library. Things around this area need to be synced.
  • Make UTF-16BE and UTF-16LE less slow.