Introducing UNIC: Unicode and Internationalization Crates for Rust

A few months ago, I started learning Rust while working on a new project around text rendering, and soon realized that not only working with strings is one of the not-so-obvious parts of Rust, but also there's little availability of libraries for text processing. As the Bidi algorithm was one of the very first things I needed, I went to the implementation by the Servo project (https://github.com/servo/unicode-bidi/), which had some limits. As I was reading along the Rust book and other learning materials, I starting improving the unicode-bidi package: fixing little bugs and cleanups first, then adding conformance tests from the reference, and then improving conformance from about 70% to 99%. So, that's how I learned Rust.

Then, I got a bit comfortable with Rust and wanted to look at other Unicode libraries, which were scattered all over the place and each had their own limits and customized test methods or not much test. In one example, fixing conformance test bugs in the rust-url/idna package lead me to a bug in the data generation script in the unicode-normalization package and finishing the original task cleanly way took easily a few weeks. Doing all these, I noticed some difficulties in expanding existing functionalities, specially:

  • all the development steps that slow down the work, even when you have a clear set of conformance test, because of the code sitting in different repositories with different timelines,

  • toolings being customized, with plenty of copy-pasted code, and sometimes missing testing of the auto-generated outputs.

With that, I ended up setting up a new project, UNIC: Unicode and Internationalization Crates for Rust, which addresses most of these issues, and hopefully some others. (More on the project below)

The project is just born, but most of the code comes from existing projects (links on the README file of each module) with some refactoring and more modularization, and added tests. So, from the code quality perspective, it's fairly stable and ready to use. But from the package structure and API, a faster pace is expected, as expanding the functionalities here is the top priority.

Earlier today, I released the 0.1 version with the existing version of Unicode data. This afternoon the new Unicode 10.0.0 was released, and because of the improved tooling, and the integration tests between the components, I was able to upgrade to the new release in just a few short steps.

I would love to hear back your comments about the project, specially the new things here, like:

  • having a super-crate with not only dependencies into all the components, but also pub extern crate the major ones, which in turn do the same for their child components.

  • splitting existing crates into much smaller ones, to enable more control about what data tables get pulled in.

  • auto-generating .rsv files from Python, which are expected to be Rust expressions, and using them in mod.rs files to bring in generated data, therefore limiting generation tasks to only dumping an int/tuple/string value, or a table (and no need to turn of fmt on .rs files anymore!)

I'll be busy with this project for the next few months (since I need some of the stuff I'm developing here for my other project) and I'm hopping that we can expand the community investment in this area—Unicode and Internationalization—and this project can be of help in that path.


UNIC Logo

About UNIC

UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benifiting from performance gains of unsafe mode when possible.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.

Design Goals

  1. Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.

  2. Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.

  3. Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.

Read more on the project homepage: https://github.com/behnam/rust-unic

https://github.com/behnam/rust-unic

25 Likes