UTF-8 weirdness with remote f/s

I have run into a strange issue when traversing f/s hierarchies on either SMB-mounted Synology or Dropbox.

I have a sample programme to illustrate it:
github

Executive summary: the NAS is BTRFS based and the file-names are correctly encoded on the NAS.
The SMB-mounted view from OSX uses some strange encoding.
The UTF-8 strings presented to rust when traversing the dir has the weird encoding.
If I convert to "normal" UTF-8, I end up with the Finder view and the OSX command-line view: no weird
\u{xxxx} extras.
Both strings differ, but when passing them to the OS to perform f/s operations, it appears both name the file. For example (using the programme supplied at the github link):

% ~-/target/debug/fs -d Dropbox/funny
rename: "Dropbox/funny/カ\u{3099}キ\u{3099}ク\u{3099}ケ\u{3099}コ\u{3099}" => ガギグゲゴ?  [yYnNqQ!] y
"ガギグゲゴ" does exist.
renamed to ガギグゲゴ

The filename is ガギグゲゴ but rust returns: カ\u{3099}キ\u{3099}ク\u{3099}ケ\u{3099}コ\u{3099}
which is apparently what OSX is returning.
It appears that both strings name the file.

It seems to me that I should just always convert all dents returned to UTF-8 and get on with my life, but that seems to be too expensive. I am going to write something which I want to be blazingly fast at traversing f/s, and the additional conversion burden for the .001% of files I might find...

I guess I could restrict this kluge to remote mounts. I could also write a sniffer to quickly scan the chars to see if there are any funny ones before doing to encoding.

Anyway -- hello rust community!

U+3099 is Combining Katakana-Hiragana Voiced Sound Mark. It seems to me that

"カ\u{3099}キ\u{3099}ク\u{3099}ケ\u{3099}コ\u{3099}"

and

"ガギグゲゴ"

are the same string, just rendered differently for some reason. Maybe the difference is whether it's in Unicode composed or decomposed normal form? In any case both seem to be valid UTF-8.

1 Like

You could be running into a normalization issue at the OS/FS level (ugh). I'm not an expert, but from the convmv man page:

Darwin, the base of the Macintosh OS enforces normalization form D ( NFD ), where a few characters are encoded in a different way. On OS X it's not possible to create NFC UTF-8 filenames because this is prevented at filesystem layer. On HFS+ filenames are internally stored in UTF-16 and when converted back to UTF-8 , for the underlying BSD system to be handable, NFD is created. See Technical Q&A QA1173: Text Encodings in VFS for defails. I think it was a very bad idea and breaks many things under OS X which expect a normal POSIX conforming system.

(NFD is the decomposed form, using combining code points.)

3 Likes

I've installed convmv both on OSX and on the NAS. Per the conversion, the NAS is correct, but on OSX, the SMB-mounted volume is incorrect. convmv claims to correct it (as does my sample programme) but the correction does nothing.

The Technical Q&A referenced is outdated. It refers to HFS, not APFS. APFS apparently stores the filenames as NFC, not NFD. So native APFS is now the combined, "correct" encoding. Here's a link I just found which seems to shed some light on it.

For some reason, the SMB VFS implementation on OSX seems to be converting the NFC filename to NFD. Dropbox also. I'll check out some other VFS to see how they handle things.

I might be totally off in the weeds in this. My brain hurts enough just from rust without having to absorb UTF weirdness.

cryptomator using MacFuse VFS has the problem.

     Running `target/debug/fs -d /Volumes/utf-test --funny`
rename: "/Volumes/utf-test/か\u{3099}き\u{3099}く\u{3099}け\u{3099}こ\u{3099}" => がぎぐげご?  [yYnNqQ!] y
"がぎぐげご" does exist.
renamed to がぎぐげご
rename: "/Volumes/utf-test/は\u{3099}ひ\u{3099}ふ\u{3099}へ\u{3099}ほ\u{3099}" => ばびぶべぼ?  [yYnNqQ!] y
"ばびぶべぼ" does exist.
renamed to ばびぶべぼ

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.