Comparing UTF-8 characters

Hi there!

I'm currently making a library that at one point requires me to sort a list of paths. These paths must be sorted in a somewhat "natural" order, which means that [ 'a', 'e', 'E', 'é', 'ä' ] should result in something like [ 'a', 'ä', 'e', 'E', 'é' ] (the result may not be exactly like that, but it's the idea), which is not the result you get by using the traditional comparison methods which only compare codepoints values.

I've searched for a few hours now a library that handles that work, as it's really complicated with latin alphabet's accents and case alone, so I don't imagine what it's like with other languages.

My question is: is there any library that allows to do this? I've seen several mentions of the "Unicode Collation" algorithm, but no concrete implementation of it. ICU seems to have the solution to the problem too but there isn't any good bindings library (either the bindings are not developed enough, or they don't allow to perform simple character comparisons). There is also no way to get a simple comparison from the system apparently.

I've also found the unicode-normalization library, but this one is pretty complicated as it performs (de)composition on graphemes which is a lot more complicated.

Something like JavaScript's .localeCompare() would be absolutely perfect but I can't find anything like that :confused:

So, does anyone have an idea of what library I could use?

NOTE : I'm working on a crate where I can't use any unsafe code, and I would really like to avoid calling a command-line tool.

Thanks in advance for your asnwers :slight_smile:

Try unicode-collation.

1 Like

Unfortunately unicode-collation was yanked, and its repo no longer exists.

There is a chance that UNIC has what you want (Or something sufficiently similar), but according to this post, collation isn't implemented yet.

You might want to look for a safe wrapper around rust-icu if you're adverse to having to manage unsafe code personally.

1 Like

I've thought about this one but it appears to be unmaintained (last commit is from 2015) plus after reading the source code I don't see any function that would allow such comparison :thinking:

Whose "natural" order are you talking about?

In the Finnish alphabet 'ä' comes after 'z'. And there is 'å' before that. Then there is 'ö' at the end.

3 Likes

So stupid it just might work: use a handwriting recognition NN to decide which english character it looks like the most, sort by that.

In seriousness, this is a very difficult seeming problem. The two approaches I might try would be:

  • Look at the implementation of javascript's localCompare.
  • Simply manually compile a list of common accented characters and what they will be considered as (ä -> a), and sort the rest by code point.

I created the icu-sys crate and would be happy to help if anyone wants to add bindings to the ICU collation service APIs.

2 Likes

@ZiCog

Whose "natural" order are you talking about?
In the Finnish alphabet 'ä' comes after 'z'. And there is 'å' before that. Then there is 'ö' at the end.

That's why I said "(the result may not be exactly like that, but it's the idea)". I know the ordering differs from a language to another, which is why I'd like to be able to sort the characters based on the user's locale :slight_smile:

@gretchenfrage

So stupid it just might work: use a handwriting recognition NN to decide which english character it looks like the most, sort by that.
In seriousness, this is a very difficult seeming problem. The two approaches I might try would be:

  • Look at the implementation of javascript's localCompare .
  • Simply manually compile a list of common accented characters and what they will be considered as ( ä -> a ), and sort the rest by code point.

The fact is C++'s STD has a std::collate:compare function which can compare based on the provided locale. The problem is that there is a HUGE bunch of code behind, because this function uses a lot of others, complicated one, which would be really hard to port to Rust.

@mbrubeck

I created the icu-sys crate and would be happy to help if anyone wants to add bindings to the ICU collation service APIs.

Hmmm I don't know much about how ICU or the Unicode Collation algorithms work, but that could be nice :slight_smile:
I saw there is also a more complete version of this crate that binds most of the functions from ICU : https://github.com/fullcontact/icu-sys (which is partially derived from https://github.com/servo/rust-icu/tree/master/icu-sys), but it's still an unsafe wrapper around C functions.

2 Likes