Introducing encoding_rs: A browser-oriented character encoding conversion library

hsivonen · June 21, 2017, 1:05pm

encoding_rs is an implementation of the Encoding Standard for the purpose of replacing the old character encoding library of Firefox (uconv) with something that is more correct, safer, faster (for browser-relevant workloads) and smaller and that supports conversion to and from UTF-8 without pivoting through UTF-16 and without adding a second set of data tables.

encoding_rs has replaced uconv in Firefox Nightly (56 train). crates.io shows calamine, quick-xml and ripgrep as dependent crates.

API docs are on docs.rs.

UTF-8 and UTF-16

Supporting conversion to and from UTF-8 is primarily meant for use from Rust code. This makes the library usable by non-Firefox Rust programs.

Since Firefox is a UTF-16-based legacy C++ codebase, conversion to and from UTF-16 is supported, too, and the API has been designed to be FFI-friendly. Beyond documentation clutter, these aspects shouldn't be a problem for pure-Rust programs that only need the UTF-8 facet.

Performance

The performance profile of encoding_rs is biased towards browser use cases. This means that

ASCII performance is favored at the expense of non-ASCII when this makes HTML faster (ASCII for markup, potentially non-ASCII for natural language) even if this makes plain text performance for some languages slower than competition.
ISO-2022-JP, UTF-16BE and UTF-16LE (i.e. UTF-16 as interchange encoding as opposed to in-RAM representation) are at the bottom of optimization TODO list and haven't been optimized.
Footprint is favored over speed for the non-ASCII parts of legacy encodings on the encoder side.

If you have use cases of encoding non-ASCII content to legacy encodings, the performace profile of encoding_rs is probably disappointing or very disappointing, but software should be using UTF-8 for output these days.

Correctness and Scope

encoding_rs is scoped strictly to the Encoding Standard. Non-Encoding Standard legacy encodings are explicitly out of scope.

As far as I am currently aware, encoding_rs implements the Encoding Standard correctly. (Please file bugs if this isn't true!)

Name

It has been pointed out to me that the name doesn't conform to preferred conventions. So far, it has seemed that renaming the crate would be more disruptive than letting the unideal name be.

Compared to rust-encoding

"How does this differ from rust-encoding?" is obvious enough a question that I should probably address it here.

The key differences are:

encoding_rs supports conversion to and from UTF-16 in addition to supporting conversion to and from UTF-8.
The streaming API in encoding_rs uses a slice for output instead of using a trait object for output. This is more efficient and more FFI-friendly.
Given enough output space and no errors, the decoders in encoding_rs always consume all input even if the input ends with an incomplete multi-byte sequence. (This makes things nicer for callers.)
For Web workloads at least, encoding_rs performs better than rust-encoding for decode and for encode to UTF-8. For non-ASCII encode to legacy encodings, rust-encoding performs better.
encoding_rs targets a newer snapshot of the Encoding Standard. (Some spec changes made after rust-encoding was created would be API-breaking changes for rust-encoding: The case of the encoding names has changed and error reporting for ISO-2022-JP encode changed in a way that's not fully representable in the rust-encoding API.)
encoding_rs only supports one error replacement mode for decode and one unmappable replacement mode for encode: the modes defined in the Encoding Standard today. (It is possible to implement other modes on top of the lower-level API, though.)
rust-encoding supports non-Encoding Standard encodings and allows other crates to supply additional encodings using the same API. encoding_rs only support encodings that are in the Encoding Standard and is intentionally not extensible (i.e. "&'static encoding_rs::Encoding doesn't have characteristics that the encodings defined in the Encoding Standard don't have" is part of the type-safety provided by encoding_rs).

(When I thought that vendored crates in Firefox would need to see the rust-encoding API, I wrote a compatibility wrapper by forking rust-encoding and replacing the internals. At present, it's not actually needed by Firefox, so I haven't kept it up-to-date with encoding_rs changes. Its source code may still provide insight into the API differences.)

TODO

SIMD acceleration for Aarch64 is coming up. Some SSE2 improvements aren't yet on crates.io pending that work. (SIMD acceleration requires nightly Rust.)
rustc and the simd crate need a bit of tweaking before SIMD support for 32-bit ARM can be added.
There is probably useless use of unsafe, so I should benchmark various unsafe removal options to see which uses of unsafe aren't actually needed for performance.
The C++ standard library-based C++ wrapper and the C and C++ code samples aren't quite up-to-date.
Parallel UTF-8 validation using rayon hasn't been calibrated beyond one computer, so it's still unclear if that feature is going to stay (and become default) or go away.
The UTF-8 validation code was forked from the standard library in order to add SIMD acceleration. Since then, there have been a couple of improvements to the standard library. Things around this area need to be synced.
Make UTF-16BE and UTF-16LE less slow.

hsivonen · December 3, 2018, 1:29pm

Some things have happened since the original announcement that seem worth mentioning.

Articles

A long article covering background, API design, internals, and performance (but not C or C++ integration)
An article about the C++ integration (Most of the content was already covered in my talk at RustFest Paris back in May.)
A brief article about using model-based testing and cargo-fuzz to gain confidence in the correctness of the less reviewable code in encoding_rs::mem

Performance Improvements

There is now SIMD acceleration (still using nightly Rust features) support for NEON (both on ARMv7 and aarch64) in addition to SSE2 acceleration.
UTF-8 validation and UTF-8 to UTF-16 decoding is faster than before.
UTF-16LE and UTF-16BE decode is fast.
rayon code has been removed, because it was a pessimization for ASCII. (ASCII validation is memory-bandwidth-bound and not compute-bound.)
x-user-defined to UTF-16 decode is faster.
Encode to Latin1-like encodings and non-Latin single-byte legacy encodings is faster in exchange to minimal (32 bits per single-byte encoding) footprint increase. (Encode performance to non-Latin1-like Latin single-byte legacy encodings is unchanged.)
There are compile-time options to make encode to CJK legacy encodings faster (tradeoff between footprint and speed).

Companion Crates

encoding_rs_io (by @BurntSushi) provides decoder integration with the standard-library Read trait.
charset provides Thunderbird-compatible (non-streaming) character encoding decoding for email by adding UTF-7 decoding support.
codepage maps between Windows code page identifiers and encoding_rs Encodings.
encoding_rs::mem provides conversions between in-RAM text representations that occur in an applications that in addition to Rust code that need to deal with the 1990s C or C++ legacy of using UTF-16 and Latin1 as in-RAM Unicode representations. It should logically be a crate but is a module due to implementation details.
encoding_c provides C bindings and sample C++ bindings on top the C bindings for (the non-::mem part of) encoding_rs.

TODO

Moving from the simd crate the packed_simd crate and then to std::simd. The main problem at present is that unless the standard library for 32-bit ARM is compiled with NEON enabled, this regresses performance on 32-bit ARM.
Reducing the use of unsafe by migrating to packed_simd, align_to and chunks_exact.

Topic		Replies	Views
Support beyond UTF-8? help	11	6643	January 12, 2023
[ANN] encoding_rs_io provides an io::Read implementation for encoding_rs announcements	1	522	January 12, 2023
Reading Latin1 ASCII chars from a binary file help	4	4136	January 12, 2023
Stdin, stdout, stderr and encoding	5	3768	January 12, 2023
Frank's Rust String Class	31	5976	January 12, 2023

Introducing encoding_rs: A browser-oriented character encoding conversion library

UTF-8 and UTF-16

Performance

Correctness and Scope

Name

Compared to rust-encoding

TODO

Articles

Performance Improvements

Companion Crates

TODO

Related topics