Just on a technical note since it's been brought up a few times, Windows paths are UTF-16. They are documented as UTF-16 and if you use any of the "old" locale-specific APIs they'll do conversion to/from UTF-16. Windows is now UTF-16.
It's true that (partly for the historical reasons noted) Windows does not do validation on UTF-16 strings so unpaired surrogates are technically possible (and there are some other places where history asserts itself). However, this is an implementation detail and does not make it UCS-2. I don't say this just to quibble, UCS-2 does not care about surrogates so there wouldn't be any problem with joining WTF-8 strings if all we had to worry about is UCS-2. If it were UCS-2, it would also add the additional restriction that we couldn't represent anything beyond the basic plane.
My point is that UCS-2 is an actual encoding and doesn't just mean UTF-16 with potentially unpaired surrogates. If we do need to express the idea that the implementation allows for more than the documentation does, let's be specific about that.
And on a broader note, it's fine for most Windows software to follow the documentation and be Unicode only. See VS Code as an example. As I said, many Windows functions will themselves will do conversions (lossy at that) from/to UTF-16. Low level software, e.g. that which you may use to inspect or remove malicious software, does however need to be aware of invalid UTF-16 names.
I have long asked how I should match regexes or globs on file paths---a very common operation in a host of programs---without doing some additional validation or transcoding or allocation step on Windows. I'm not aware of any accepted RFC that would let me do that. An iterator of Either<&str, &OsStr> wouldn't either.
It is quite all right to do a glob match on the WTF-8 representation though. At least, that's what I plan to do Real Soon Now.
Citation needed. I couldn't see neither UTF-16 not UCS-2 being mentioned in the appropriate part of documentation, intentionally vague “Unicode” term is used instead, but simple experiments would show you that you may easily pass invalid UTF-16 string and nothing would stop you.
We may debate definitions till the second coming, but the fact of life: filenames which are invalid UTF-16 can be created and have to be processed, somehow. Especially if you are scanning directories made by other apps which you couldn't control.
The problem here is that UCS-2 doesn't have notion of basic plane and surrogate pairs are not disallowed characters there. That's why creating code that correctly works with both UCS-2 and UTF-16 is hard. As long as you keep filenames in that ambitious “maybe UCS-2 or maybe UCS-16” Schrödinger state and use them as opaque handles that you use to open/close files you may process these reliably enough. But if you start processing arbitrary prefixes and suffixes compatibility raises it's ugly head. One example: if you have 😊.txt filename would it have prefix from a single UCS-2 character 0xd83d then should that prefix match or not? If you look on that string as on UCS-2 string then it definitele matches, but if you would convert both string and prefix into WTF-8 they would not longer match…
What's the difference?
AFAICS currently implementation matches the decumentation, I'm not sure what are you talking about.
And it should be possible to write such software in Rust (if you couldn't write such software in low-level system language then how may call such language low-level system language?) and we have done full circle and returned where we have started.
And how these regexps would work with emoji? Would half of 😊 match the full 😊 or not?
Unicode-enabled functions are described in Conventions for Function Prototypes. These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems. Each code value is 16 bits wide, in contrast to the older code page approach to character and string data, which uses 8-bit code values. The use of 16 bits allows the direct encoding of 65,536 characters. In fact, the universe of symbols used to transcribe human languages is even larger than that, and UTF-16 code points in the range U+D800 through U+DFFF are used to form surrogate pairs, which constitute 32-bit encodings of supplementary characters. See Surrogates and Supplementary Characters for further discussion.
You seem to have misunderstood me. I'm not arguing otherwise. I am arguing against calling it UCS-2, which the Rust documentation studiously avoids.
I would say it's not defined there. If you click on that part: see Surrogates and Supplementary Characters for further discussion. you would see “perfect” explanation of how Windows supports Unicode: The Uniscribe API supports supplementary characters.
What happens with unpaired “supplementary characters”, whether they are allowed or not, whether they are accepted or rejected (and if they are rejected then what error codes are returned)… all that remains a mystery. Especilly if note this large red text:
Okay. OS that was introduced (looking on calendar) about quarter-century ago, but not all system components are compatible with supplementary characters… what about today? Mystery.
The most amusing note is that one: Standalone surrogate code points have either a high surrogate without an adjacent low surrogate, or vice versa. These code points are invalid and are not supported. Their behavior is undefined.
What does that even means? When we are talking about language “undefined behavior” means “anything at all may happen”, but when we are talking about operations system then one may be xecused to expect that there are some limitations and a single incorrect character wouldn't bring the whole OS to it's knees, although who knows: Apple had repeated troubles there, maybe Microsoft is the same?
For my use case, it doesn't matter where the regex or glob matches. Just whether it matches or not. But being able to pluck out matches (or captures) is undeniably useful in other circumstances.
Separately, emoji seems like a red herring. If you're just trying to match a literal sequence of bytes, then it will match the full sequence, WTF-8 or not, emoji or not. Maybe you have a more complicated regex in mind though. I would need to see a concrete example.
But in WTF-8 that's 0xF00x9B0x870xAD. And now the question: is that sequence a prefix for 0xF00x9F0x980x8A. C++ on Windows claims it should be, but I'm not exactly sure if you plan to declare if that's a prefix or not.
Is is two-character string or one-character one?… Microsoft terminology seems to imply that one low-surrogate character and one high-surrogate character and thus two character string. ↩︎
That's how it was called in Windows NT, where Unicode support for Windows was developed. And NT kernel, NTFS in particular, still treats it as UCS-2. While higher-level code treats it as UTF-16.
And since we are talking about filenames NT kernel is kinda more relevant than OLE, DDE, XAML and all that crap built on top of it.
I will likely start by implementing a semantic that globs must be valid UTF-8 because globs aren't file paths, they are their own separate pattern language. I could even impose that restriction on Windows only. I could also ensure that wildcards like * match arbitrary bytes so that even if a glob pattern itself cannot literally include any invalid UTF-8, parts of it can still match through the WTF-8 parts of an &OsStr that are not valid UTF-8.
For regexes, patterns must be valid UTF-8, but if you use the &[u8] APIs and disable Unicode mode, then you can match arbitrary bytes. So you could write the WTF-8 encoding of an unpaired surrogate, but you're writing the explicit WTF-8 byte sequence for it, not the unpaired 16-bit surrogate value. Those bytes would only literally match those bytes. There isn't going to be any kind of magic that knows how to map unpaired WTF-8 surrogates into prefixes of the UTF-8 encoding of all paired surrogates beginning with the unpaired surrogate. I'm totally 100% fine with that and I see no reason why it should prevent me from running regex or glob matches on the representation of &OsStr, WTF-8 or not.
(There are alternative designs in this space, but I'm implicitly requiring the constraint of "must work with &OsStr while minimizing costs and exposing a practical and easy to understand semantics to end users.")
If you don't believe the Microsoft documentation then I don't know what else to say. The documentation is not talking about "OLE, DDE, XAML and all that crap built on top of it".
Let me put it that way: from my experience working for the company which was writing low-level filesystem drivers for Windows I have learned not to trust Microsoft documentation too much.
Even if I wasn't writing said drivers I talked to people who did. NTFS doesn't have any means of supporting anything but UCS-2. Maybe Microsoft planned to support UTF-16 in WinFS or ReFS and just went with minimal kludges for Windows 2000/XP, but as we know these upgrades stalled and we are still stick with 20+ year old NTFS 3.1 from Windows XP. That's why documentation still talks about Windows 2000.
If you believe this to be wrong then I would need something else that “look ma, a documentation”.
Windows is hybrid kernel, the driver can do whatever it wants in terms of validating paths. And yes I have be involved with maintaining filesystem drivers before.
Why would I want to play these power games? We know what happens in there, but also we know Rust plays games to not reveal what happens “undef the hood” with it's “unspecified, platform-specific, self-synchronizing superset of UTF-8” language.
I'm pretty sure Microsoft does something like that because it believes that in some unspecified future it would be able to finally stop using UCS-2 based NTFS and switch to UTF-16 fully.
But Rust have to deal with reality and not in imaginary world of product managers. And in reality we have UCS-2 based NTFS which deals with filenames on low level and UTF-16 based GUI (including GUI that deals with filesystem).
How precisely that strange combo should be called is of little importance if everyone agrees on what actually goes on under the hood. Which seems to be the case by your own admission:
You seem to agree with description of the mess that happens in Windows in reality so what are we discussing here?
As long as programs may create and use files with byte sequences that don't conform to UTF-16 it doesn't matter whether it's called UCS-2, UCS-2/UTF-16 or Microsoft Unicode, but calling it UTF-16 is wrong.
There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHAR s. Any normalization that your application requires should be performed with this in mind, external of any calls to related Windows file I/O API functions.