non-English string shows as two different byte slices!

In my cli application, I search for a user-input string in an array of strings. A Korean user of app reported a bug and after some digging, it seems that the same seemingly identical strings are encoded as two different byte slices.
String is haystack and I am trying to find query in it, which is typed by user and read by my application.

    haystack: "진흙속의연꽃 속 다 지구"
    query: "지구"

(note that query seems to be the last two characters in haystack)

Using above values, haystack.contains(query) return false. However if I hardcode the value of query into my source code as embed and do haystack.contains(embed), I get true !!!
And looking at their byte representations, I get:

    query: "지구" [225, 132, 140, 225, 133, 181, 225, 132, 128, 225, 133, 174]
    embed: "지구" [236, 167, 128, 234, 181, 172]

Is this related to UTF8 vs UTF16?

My cli application is invoked using a 3rd party GUI, and within my app I can confirm that on my computer it sets LC_CTYPE to en_US.UTF-8.

In above examples haystack is fetched from web so it's possible it's using a different encoding than what my Korean user has when they type in query! How can I address this issue to prevent future bugs?

This may be related to "unicode normalization":

1 Like

I think they are different strings, even though they "look" the same. The following link is on the Go playground. It shows the same behaviour - str1 has 12 bytes, str2 has 6 bytes.

I literally copied the strings from this code.
What I think is happening is the famous Greek capital alpha (Α) and the English capital A. They look the same. But the Greek one has 3 (or 4, I don't quite remember) bytes, English one has only 1 byte. [You can try copying the characters into a playground and see it for yourself].

No. UTF-8 and UTF-16 are two low-level encoding formats for the same underlying abstract numerical values ("code points"). Rust strings are always UTF-8, and so it's not possible to get UTF-16 by inspecting the underlying bytes of a str. If the bytes of the two strings are different, then they were specified using different code points to begin with.

In particular, copying and pasting the two strings into the Playground (which uses a fixed-width font) in fact reveals that the first string is specified by combining marks, while the second string is precomposed:

2 Likes

Yes, the strings are the same text to a human, but normalized in different ways. The query consists of the for code points [U+110c, U+1175, U+1100, U+116e] while the embedded string is [U+c9c0, U+ad6c].

here's a playground with a possible solution.

2 Likes

Thanks for your reply. But the more I am testing, the more it seems the 3rd party app is causing the issue.

If I run my cli app directly from terminal and search for 흙, it behaves correctly and finds it.
If I run my app through the 3rd party app, and paste 흙 into its GUI window, then it's not found and my debug!() commands show different lengths!

So the question is I do I deal with this? is there a way of working around it?

Oh, I just saw this after my last post. Let me test it out and report back.

1 Like

As far as I can tell it is a closed source app. So the only things you can do:

  • Open a ticket
  • Contact someone on the team
  • Or maybe there is a documentation option which controls this behaviour

First of all you need to decide if these are even bugs at all.

Korean includes two ways of writing text and Unicode includes both. There are special rules related to these. You query uses followed by while haystack have . Just type “ᄌ ᅵ” in the browser input window, remove space between these and observe how they would magically combine into “지”!

Of course std doesn't include multimegabyte conversion tables needed to perform that trick in all languages. There are crate exists which deals specifically with Korean, but I haven't tested it.

2 Likes

@Yandros @kaj Thanks for your suggestions and playground code.
I used the nfc() method you suggested and my preliminary tests are working.
I am completely unfamiliar with normalization concepts of unicode so I'm gonna need to do some reading. If you have any resources to suggest to learn the basic concept please share them.

Thanks for the example and links @VorfeedCanal. It seems it's similar to Arabic/Farsi in terms of characters joining and making new shapes. However, adding a crate for each language for its special cases and quirks does not seem very ideal for my use case.

1 Like

It's a big topic, and I'm no expert, but the unicode_normalization - Rust documentation links to unicode documentation where there is a section describing the different normalization forms UAX #15: Unicode Normalization Forms

In short, there are four forms; NFD, NFC, NFKD, and NFKC. The "D" variants are "Decomposed" (longer representation) while the C variants are "Composed" (shorter representation). If your strings are only used for searching, it may be best to use the "K" (for compatibility) forms which simplifies some strings, but if used for display it might simplify too much.

Wether you select the "C" or "D" variants doesn't really matter*, but you should apply the same normalization to the haystack as you do to the needle.

*) I'd use the "NFC" variant, partly because it's shorter so it should be more efficient for storage and searching, and partly because in my native language, Swedish, the characters Å, Ä, and Ö are distinct characters, and decomposing them to and A or O with diacritics feels wrong. Of course, someone with a background in another language might instead feel that "NFD" is the "correct" form for similar reasons.

1 Like

Thanks for the explanation. So it seems if I were to write a search engine for web, I better apply the NFC normalization to both crawled webpages as well as user-input queries.

Yes, you'd need to apply the same normalization for both the data to search in and the keys to search for.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.