To_lowercase is weird

Edit: Nevermind, I figured it out. I overlooked a single 'unconditional mapping' in SpecialCasing.txt that causes the behavior.

I'm trying to understand the behavior of the to_lowercase method for char.

It returns an iterator because lowercasing isn't always one-to-one in unicode, but the only char for which it returns more than one character is 0x130 (Capital I with dot above). All other chars return a single character when lowercased.

According to the documentation:

If this char has a one-to-one lowercase mapping given by the Unicode Character Database UnicodeData.txt , the iterator yields that char .

If this char requires special considerations (e.g. multiple char s) the iterator yields the char (s) given by SpecialCasing.txt .

This operation performs an unconditional mapping without tailoring. That is, the conversion is independent of context and language.

But the behavior for 0x130 seems to match the behavior in UnicodeData.txt, not SpecialCasing.txt, despite there being a matching entry in the latter. Many other characters in UnicodeData.txt have similar one-to-multiple lower case rules as 0x130 and yet they yield only a single character when lowercased.

The unicode FAQ seems somewhat inconsistent here too:

Q: Is all of the Unicode case mapping information in UnicodeData.txt?

A: No. The UnicodeData.txt file includes all of the one-to-one case mappings. Since many parsers were built with the expectation that UnicodeData.txt would have at most a single character in each case mapping field, the file SpecialCasing.txt was added to provide the one-to-many mappings, such as the one needed for uppercasing ß (U+00DF LATIN SMALL LETTER SHARP S). In addition, CaseFolding.txt contains additional mappings used in case folding and caseless matching. For more information, see Section 5.18, Case Mappings in The Unicode Standard .

There are clearly one-to-many mappings in UnicodeData.txt and rust seems to use one of them, but no others. I can find no rhyme or reason for this behavior. For comparison, look at the mapping for 0x120 in UnicodeData.txt

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.