Introducing UNIC: Unicode and Internationalization Crates for Rust

A few months ago, I started learning Rust while working on a new project around text rendering, and soon realized that not only working with strings is one of the not-so-obvious parts of Rust, but also there's little availability of libraries for text processing. As the Bidi algorithm was one of the very first things I needed, I went to the implementation by the Servo project (https://github.com/servo/unicode-bidi/), which had some limits. As I was reading along the Rust book and other learning materials, I starting improving the unicode-bidi package: fixing little bugs and cleanups first, then adding conformance tests from the reference, and then improving conformance from about 70% to 99%. So, that's how I learned Rust.

Then, I got a bit comfortable with Rust and wanted to look at other Unicode libraries, which were scattered all over the place and each had their own limits and customized test methods or not much test. In one example, fixing conformance test bugs in the rust-url/idna package lead me to a bug in the data generation script in the unicode-normalization package and finishing the original task cleanly way took easily a few weeks. Doing all these, I noticed some difficulties in expanding existing functionalities, specially:

  • all the development steps that slow down the work, even when you have a clear set of conformance test, because of the code sitting in different repositories with different timelines,

  • toolings being customized, with plenty of copy-pasted code, and sometimes missing testing of the auto-generated outputs.

With that, I ended up setting up a new project, UNIC: Unicode and Internationalization Crates for Rust, which addresses most of these issues, and hopefully some others. (More on the project below)

The project is just born, but most of the code comes from existing projects (links on the README file of each module) with some refactoring and more modularization, and added tests. So, from the code quality perspective, it's fairly stable and ready to use. But from the package structure and API, a faster pace is expected, as expanding the functionalities here is the top priority.

Earlier today, I released the 0.1 version with the existing version of Unicode data. This afternoon the new Unicode 10.0.0 was released, and because of the improved tooling, and the integration tests between the components, I was able to upgrade to the new release in just a few short steps.

I would love to hear back your comments about the project, specially the new things here, like:

  • having a super-crate with not only dependencies into all the components, but also pub extern crate the major ones, which in turn do the same for their child components.

  • splitting existing crates into much smaller ones, to enable more control about what data tables get pulled in.

  • auto-generating .rsv files from Python, which are expected to be Rust expressions, and using them in mod.rs files to bring in generated data, therefore limiting generation tasks to only dumping an int/tuple/string value, or a table (and no need to turn of fmt on .rs files anymore!)

I'll be busy with this project for the next few months (since I need some of the stuff I'm developing here for my other project) and I'm hopping that we can expand the community investment in this area—Unicode and Internationalization—and this project can be of help in that path.


UNIC Logo

About UNIC

UNIC is a project to develop components for the Rust programming language to provide high-quality and easy-to-use crates for Unicode and Internationalization data and algorithms. In other words, it's like ICU for Rust, written completely in Rust, mostly in safe mode, but also benifiting from performance gains of unsafe mode when possible.

Project Goal

The goal for UNIC is to provide access to all levels of Unicode and Internationalization functionalities, starting from Unicode character properties, to Unicode algorithms for processing text, and more advanced (locale-based) processes based on Unicode Common Locale Data Repository (CLDR).

Other standards and best practices, like IETF RFCs, are also implemented, as needed by Unicode/CLDR components, or common demand.

Design Goals

  1. Primary goal of UNIC is to provide reliable functionality by way of easy-to-use API. Therefore, new components are added may not be well-optimized for performance, but will have enough tests to show conformance to the standard, and examples to show users how they can be used to address common needs.

  2. Next major goal for UNIC components is performance and low binary and memory footprints. Specially, optimizing runtime for ASCII and other common cases will encourage adaptation without fear of slowing down regular development processes.

  3. Components are guaranteed, to the extend possible, to provide consistent data and algorithms. Cross-component tests are used to catch any inconsistency between implementations, without slowing down development processes.

Read more on the project homepage: https://github.com/behnam/rust-unic

https://github.com/behnam/rust-unic

25 Likes

Thank you for taking the lead on this. I’d been thinking for a while that it would be good to centralize the rust crates for Unicode data. The Python script that parses text files from Unicode and generates Rust tables originated in Rust’s standard library, was copied into several repository, and each copy grew organically. Some improvements and fixes were made to some copies but not others. As you pointed out, some crates have been updated to new Unicode versions but not others.

However, it looks like UNIC includes not only data tables, but also algorithms that use these tables (and even some like Punycode that do not) copied from existing crates. All of that code was not especially in need of being taken over.

So you’re forking multiple crates, for no apparent reason, without first discussing it with their respective maintainers. While you are legally allowed to do so by the respective open source licenses, forking is often considered an hostile move to be kept as a last resort.

I still think it would be good to discuss how UNIC and all these pre-existing projects can work together, but duplicating everything unilaterally is not a great start.

4 Likes

I am working on this haphazardly: GitHub - BurntSushi/rucd: WIP --- It's very much a WIP, but the intent is to build a ucd-parse crate that parses the UCD with no additional smarts and a ucd-generate crate that can produce Rust source code that contains various data tables. I intend the format of the table to be configurable based on the data. e.g., The normal sequences that one uses binary search over, or @raphlinus's trie approach or even FSTs (which are useful for representing the set of all Unicode names compactly). I don't have any plans to implement the various algorithms, but would instead like to see other existing Unicode crates adopt these tools over the various Python scripts. But it's not ready yet. :slight_smile: (This entire endeavor was motivated by me wanting to provide more principled Unicode support in the regex crate, which in turn means a more principled solution for representing a good portion of all Unicode data.)

I do think Rust's story on Unicode needs more work, and I would like to see something more cohesive than what we have now though. :slight_smile: But I'm hoping we can all collaborate towards that!

6 Likes

In short, because upgrading to newer versions of Unicode (data and algorithms) should be intentional by developers.

To get more philosophical, the main reasons is the fact that many systems allow unassigned Unicode codepoints to enter, get processed, stored, and later returned. With every new version of Unicode, some of those unassigned codepoints change their status and because new characters with character properties different from the default one assigned to them before. This instantly invalidates all the processing done in the system previous to storage. So, it's basically up to the system to decide when take this step, and if it should happen without any re-processing of stored data. Or, in other words, it's up to every system to decide "what Unicode version their system is working on/with".

If any system is blocking/dropping unassigned characters on its boundaries (as suggested by many specs, like the "Network Unicode" RFC), then it's most probably safe, in respect of security, to auto-upgrade, because of promises provided by Unicode's * Stability Policies*. But, still then, those policies don't (and can't) cover all the algorithms and it's, again, up to the system designers to evaluate if they are storing anything that needs to be updated, even if it's not a security matter.

I should add that, what's I'm thinking of right now is that when CLDR data and algorithms are added to the repo, the same rule will apply for that, as well. So, if you get UNIC version X, it's explicitly Unicode version Y and CLDR version Z and Y and Z won't change until you change X.

What do you think?

One of the great thing in the Rust crates ecosystem is that it’s very modular, so having many different crates is great. The problem with the Unicode crates is that they sometimes need to extract the same data, and they are all using a specific version of the Unicode data, and needs to be updated regularly.

@BurntSushi’s rucd, or some library to manipulate this data, might a solution to the first problem. I’m not sure what to do for the second: a Unicode organization on Github, or setting up a way to ping maintainers to update to last Unicode version?

We could theoretically have unicode-&;something>-data then crates that depends on them, but it might be limiting (for example, I’m working on a reimplementation of to_uppercase() and generating a match with hardcoded values is slightly faster than using binary_search() on a static slice).

Is it not what private-use area is designed for? I don’t think we should take into account such wrong use of Unicode.

Thank you, @SimonSapin , for your work in this area for the past many years! :slight_smile: And for the comments here. Good points and valid concerns.

Right, I've given the python scripts a full rewrite here, to improve maintainability and allow reuse for all the new components that I want to add to the repo. I wouldn't be able to get this out this fast without reusing the existing code.

Although the code doesn't show right now why it's needed, I think there's good reason to keep data and algorithms in one repository.

Let's look at the Bidi example. The original unicode-bidi crate has had both data and algorithm in one crate, which totally makes sense for a crate providing functionalities in a clear kind-of-standalone area. Since the data and algorithm are sitting next to each other, it's easier to do many common development tasks, like measure and improve performance, revising API, etc.

The Bidi character data has a lot in common with other Unicode data, and the Bidi algorithm shares a lot with the provider of Bidi character data, and it's all this connection that made me to start experimenting with this new code structure.

Yes, we can keep the algorithms in separate repos, but again we get the cross-dependency that some like improving performance in the Bidi algorithm will depend on the changes in the Unicode data create to be shipped first, then used in the algorithm crate. This is still possible, but misses good things, like being able to actually see what happens when you add caching to the Bidi property accessor functions from the scope of the Unicode data crate. Meaning, in practice, there will be two separate PRs, the first one making the changes, the second one, which is only a dependency version bump, actually showing perf improvements.

IMHO, that's an obstacle in improving the current state of Unicode support and I think that this experiment, putting them together in one repo, can show the benefits of having them. The part that makes me very happy is that makes almost no difference from the perspective of component users, thanks to the abstraction provided by Rust and Cargo. So, in my view, it's a lot gain with almost no lost.

Yes, it's kind of a fork, but not a hostile one. Please let me explain in two areas:

  1. Code base. I had some ideas that I wanted to see in these libraries, but wanted to have a prototype first, and UNIC is that prototype. This is the reason to duplication a lot of code, but I think we need to experiment with these.
    My current plan is to keep the codes in sync, as I'm leading maintenance on the standalone repos as well. Modularization is a bit different and the data generation scripts are rewritten, and this makes a bit expensive for a while, but I think we can merge the efforts soon.

  2. Project and Community organization. This is the tricky one. I actually didn't intent to start a new project/community. What is referred to as "the UNIC Project", in its organizational sense, is just a placeholder. I didn't want to put my own name or my firm's on very source file, and couldn't use any of the existing orgs in the Rust community, so just created one.
    @dont_tolerate_bigots already suggested to put UNIC under unicode-rs, which sounds like a great idea to me! I think we should make a decision together, so I"m going to post a separate comment about that.

Definitely! In my view, this post is starting the conversation with you, all the other maintainers/contributors/users of the standalone unicode-related repos, and the community at large. Now I realize that I forgot to point this out in my original post. My bad.

1 Like

I'm not talking about private-use here, but a legitimate Unicode character assigned in version X, but given as input in a system working with version Y, where X > Y.

For example, :star_struck: (U+1F929 GRINNING FACE WITH STAR EYES) which just became a character yesterday (literally) and is now being posted to users.rust-lang.org which I'm sure has Unicode version < 10.0.0 and getting into your browser that again is running with Unicode < 10.0.0, and I'm 99% sure it's making it there and you see a tofu (the outline square) or something similar for it now, and you will see it as a face in just a few weeks/months, when you browser/system is updated to Unicode 10.

The whole points of allowing unassigned codepoints pass through systems is exactly this: forward-compatibility, at some cost.

And I should point out that, with the big version bumps on Unicode/CLDR upgrades, library users can get auto-update (if desired) with the "unic: >=V" version requirement. They only need to make sure they catch breaks caused by API changes fast enough and update to "unic: >=V | <W" when it happens.

You were talking about using unassigned code points for special actions, which is clearly broken by design. Anyway, it’s probably good to review changes in Unicode when upgrading.

On finding a home for UNIC...

@dont_tolerate_bigots, suggested on twitter to put the UNIC repo under unicode-rs. I have seen various posts here and on github about various unicode-related efforts, but one of things I haven't figured out yet is what unicode-rs is (just a github repo, or kind of an official group of repos by the core team) and what's the plan. The github org page and the public https://unicode-rs.github.io/ page don't say much. Can someone share more about it?

Regarding the devops, would be great to be able to use existing infra from Rust and Servo projects. UNIC is mostly self-sustained, so it's flexible in that respect. (I suppose it's given that we don't want the development goals be limited to any such larger project, though.)

So, @dont_tolerate_bigots, @SimonSapin, everyone else, what do you think?

2 Likes

unicode-rs is a github org: https://github.com/unicode-rs. Today it mostly hosts crates that were extracted from the standard library when we removed a bunch of things there in the months before Rust 1.0. Current maintainers (or at least people with write access on github) are those listed below, Huon Wilson, and myself.

@alexcrichton, @kwantam, @Manishearth, what do you think? What should be the relationship of UNIC and unicode-rs?

Personally I don’t care much if things are under github.com/unicode-rs, github.com/unic, or something else. We can move repos easily enough, Github is good at adding redirects.

1 Like

I hadn't seen rucd, @BurntSushi! Like the idea!

About ucd-parse and ucd-generate: although having the generators in Rust is more verbose (comparing to Python) and can slow down that part of the development, they can speed up the conformance testing step. And looks like your parser code has some unit tests, which is plenty to the zero amount is in the Python scripts (and UNIC's tools/pylib).

About having command-line tools for accessing UCD and more, I totally agree with the need for these tools (in a systems programming language), and it's already in the plans for UNIC.

And, the most interesting part, the data structures! So, that's where there's a lot of room for work. I think almost every data structure used in UCD (and other Unicode/CLDR data) to store and access character properties can be, and eventually should be, abstracted into standalone components/crates/repos, to be shared. There's a lot of room for optimization there, which can be done better having abstract DS APIs and separation of data from DS implementation.

Indeed! Looking forward to working on these together!

I also agree with @dont_tolerate_bigots that updating to a new Unicode version should not be a breaking change as far as SemVer is concerned. Unicode is committed to compatibility, your own README links to Character Encoding Stability

Or, put another way, they learned the hard way: RFC 3629 - UTF-8, a transformation format of ISO 10646

In 1996, Amendment 5 to
the 1993 edition of ISO/IEC 10646 and Unicode 2.0 moved and expanded
the Korean Hangul block, thereby making any previous data containing
Hangul characters invalid under the new version. Unicode 2.0 has the
same difference from Unicode 1.1. The justification for allowing
such an incompatible change was that there were no major
implementations and no significant amounts of data containing Hangul.
The incident has been dubbed the "Korean mess", and the relevant
committees have pledged to never, ever again make such an
incompatible change (see Unicode Consortium Policies [1]).

I hope the same, but in my analysis you need very careful design to achieve the compatibility.

Take the Unicode Character Database as an example. It is tempting (and in fact reasonable) to make, for example, the Bidi_Class property to a plain C-like enum. But this breaks down because the enumerated property in the UCD can, while it's rare, be expanded (there is a guarantee that the existing value cannot be removed). In the past the set of values for Bidi_Class have been expanded between Unicode 6.2.0 and 6.3.0, and in fact the only non-boolean property which values are guaranteed to be fixed is General_Category. [1] Otherwise you cannot have a simple, exhaustive enum, or have to be prepared to possibly break the compatibility on new Unicode releases.

[1] The Unicode Standard refers this kind of properties as "closed enumeration" (section 3.5, definition D28) and lists two (and currently all) examples: General_Category and Bidi_Class. Unfortunately this is quite misleading, because Bidi_Class can be extended upon new bidirectional control characters (it is fixed for all other characters, however).

1 Like

I just added the first new component: UCD Age property: https://crates.io/crates/unic-ucd-age

    assert_eq!(Age::of('A'), Age::V1_1);
    assert_eq!(Age::of('\u{A0000}'), Age::Unassigned);
    assert_eq!(Age::of('\u{10FFFF}'), Age::V2_0);
    assert_eq!('🦊'.age(), Age::V9_0);

As follow-ups, it can have perf and memory improvements, traits for working with strings, and iterators for running over characters of a specific Unicode version.

Update on Versioning

And, interestingly, this change actually resolved the open question about versioning that we've been discussion here. Here's what happens:

  • The Age property is of type Catalog, meaning that exactly one new value will be added to pub enum Age with every (non-micro) Unicode update. This means unic::ucd::age will have a breaking API change, needing a big version bump, as well as unic::ucd and unic.

  • All components already need an update, because of changes to their UPCODE_VERSION and (possibly) data tables.

  • As mentioned as another versioning policy, in any version update, we want to use the (exactly same) new version of unic for any updating component. Therefore, every single component would get the big version bump on any (non-micro) Unicode version update.

1 Like

Good point, @dont_tolerate_bigots. I think Rust's match syntax can basically cover all the use cases I know with Option<(u8, u8)>, and it would make the API more stable. I'll update it. Filed an issue: Use Option<(u8, u8)> for ucd::Age · Issue #23 · open-i18n/rust-unic · GitHub

I think I have a more rigid API in my mind and that's why I assume a breaking bump is needed for majority of Unicode updates. ICU receives a major update every year after a Unicode release. Of course, they usually have also API changes in the release, and the slow release cycle makes it very different from what we have here. And similar situation with programming languages with more Unicode/i18n libraries built in, like Java.

Since the release cycles are much shorted in the Rust world, and based on the feedback here, I think you're wright and we better go with keeping the API more flexible and not jump versions too much (after the API gets stabilized a bit more, like in a few months).

One question remaining to answer would be: How can we make it possible for a dependent library to receive bug fixes, but not Unicode Version updates? I think that's something that gets easier when we pass 1.0.0, as we can keep bug-fixes as micro updates, and any Unicode Version update as minor.

So, how about this plan, then:

  • Micro Unicode updates give UNIC a micro version bump. (There should never be need for API change in Unicode micro updates.)
  • Minor Unicode updates (which are not expected to happen that often anymore) give UNIC a minor version bump, unless there's API breakage.
  • Major Unicode updates (expected to happen once a year, scheduled) give UNIC a minor version bump, unless there's API breakage.

Then, if we want to branch on a specific Unicode version and maintain the branch for a while, it can only receive micro updates, which sound reasonable for a side branch.

I'm hopping to reach UNIC 1.0.0 before release of Unicode 11, so I don't think we need to make any decision about Unicode updates during 0.x period.

What do you think, @dont_tolerate_bigots, @SimonSapin, @lifthrasiir?

I don’t understand why this would be a goal. Sticking with old Unicode versions sounds to me like a recipe for bugs due to different components of the same system accidentally using a different versions.

Hi, I just made an account on this forum to thank you for the time you put in this project. I am making a toy programming language and wanted it to support Unicode, and I think you just saved my day with this crate. I was musing the idea of creating my own Unicode crate, but this is way better.

Thanks!

2 Likes