Missing functions on Path/PathBuf/OsString?

Hi,

I have some code that is manipulating paths. Specifically I need to take one file name and derive a related file name from it. Unfortunately it seems Path/PathBuf/OsStr/OsString are missing most useful methods for string manipulation that normal UTF-8 strings have. My current code ignores this and just converts to str, but I'd like a solution that doesn't assume UTF-8 paths (on Linux, I'm rather ignorant of how this works on Windows).

Here are some example code fragments that I'm wondering how to rewrite to work without assumtions about the encoding and without making the code massively more complex (i.e. doing the thing myself by hand as I would have done in C back in the bad old days):

    let data_file = modify_script
        .file_name()
        .ok_or(anyhow!("Failed to get filename"))?
        .to_string_lossy()
        .strip_prefix("modify_")
        .and_then(|s| s.strip_suffix(".tmpl").or(Some(s)))
        .ok_or(anyhow!("This should never happen"))?
        .to_owned()
        + ".src.ini";

and (using the glob crate)

        let mut candidates: Vec<_> = glob::glob_with(
            base_path
                .to_str()
                .ok_or_else(|| anyhow!("Invalid path {base_path:?} for chezmoi source directory: not convertible to UTF-8."))?,
            glob::MatchOptions {
                case_sensitive: true,
                require_literal_separator: true,
                require_literal_leading_dot: true,
            },
        )?.collect();

and:

    let src_name = src_path
        .file_name()
        .ok_or(anyhow!("File has no filename"))?
        .to_string_lossy();
    let data_path = src_path.with_file_name(format!("{src_name}.src.ini"));
    let script_path = src_path.with_file_name(format!("modify_{src_name}.tmpl"));

and:

            let src_filename = existing_file
                .file_name()
                .ok_or(anyhow!("No file name?"))?
                .to_string_lossy();
            let is_mod_script = src_filename.starts_with("modify_");

It seems to me that Rust supports for working with paths is severely lacking. Is there any crate that helps improve the situation?

The other option would be to go all the way in the other direction and enforce UTF-8 paths (using e.g. camino). That too would somewhat simplify the code, but for my application it would be better to not make assumptions about valid Unicode.

For those interested the full module is here: https://github.com/VorpalBlade/chezmoi_modify_manager/blob/main/src/add.rs It is rather messy and I'm looking into various things I can do to clean it up and simplify it. Saner path manipulation is just one of many steps

An OS string is conceptually a blob of bytes. Use .as_encoded_bytes() to get a byte slice view, then continue operating on that slice (eg. <[_]>::strip_suffix()).

Unrelated, but your code also contains a number of unnecessary allocations and weird handling of Option. Since you are already using anyhow, you can just do this:

   let data_file = modify_script
        .file_name()
        .context("Failed to get filename")?
        .as_encoded_bytes()
        .strip_prefix(b"modify_")
        .context("invalid prefix")?;
    
    let mut data_file = data_file
        .strip_suffix(b".tmpl")
        .unwrap_or(data_file)
        .to_owned();

    data_file.extend_from_slice(b".src.ini");
1 Like

you cannot define truly portable APIs on OsStr, you have to make a choice, either handle them "the right way (not imposing the utf-8 limitation)" for each platform, or you handle them "the easy way (only accept valid utf-8 and return error otherwise)".

actually, conversion from &OsString to String it's not about utf8 per-se, but actually about valid unicode strings. for unix like systems, it's the same though.

I would personally just require valid unicode anyway and blame the user for bad input. unless you are making utilities like data backup programs, it's really not worth it.

You can have non-utf8 encoded filenames on unix systems as well, in fact I encounter them every once in a while - archives with untranslated JRPGs come to mind. They will work fine with wine, but some file names are not going to be valid.

1 Like

Indeed AFAIK the only byte pattern disallowed in a POSIX path is b'\0' (and in path components additionally b'/'.

1 Like

Thanks! I had missed as_encoded_bytes. That helps. However I'm a bit confused by the documentation:

The byte encoding is an unspecified, platform-specific, self-synchronizing superset of UTF-8. By being a self-synchronizing superset of UTF-8, this encoding is also a superset of 7-bit ASCII.

How does this work on *nix where the data can be arbitrary bytes. How is it turned into a "superset of UTF-8"?

Thanks, yes, that is simpler. I blame being somewhat new to Rust (only started using it in spring this year) and that it is difficult to keep track of all the handy helpers.

I believe that is what I was saying: I either handle it properly (don't assume Unicode) or I go fully for the easy way. The current middle ground is just a suboptimal mess.

That doesn't change that I believe it is weird that path components (which are OsStr) don't have support for prefix/suffix checks/stripping. I would of course expect these operations to also take OsStr (e.g. the signature would be something like strip_suffix(&'a self, suffix: &Self) -> Option<&'a Self> (or some suitable pattern and/or conversion trait to generalise the type of the suffix)).

That this is missing is just bizarre.

(Arguably it could also at least accept an ASCII string, the days of needing to care about EBDIC or other non-ASCII-superset encodings are as far as I know long gone.)

That is a valid point. And while I deal with arbitrary settings files that may be in the users home directory, very few of those are going to be non-ASCII, and of those that are all are probably going to be Unicode.

However, the afformentioned utility functions should be available on OsStr for those who do need it. And if it is easier to work with raw paths, then more tools can just support that (shifting the weights in the trade-off between requiring Unicode and making the code easy to use vs making the tool more versatile but harder to write).

1 Like

because these operations are not properly defined on arbitrary OsStrs, and that's the reason it is intended to accessed using os specific extension traits. for unix, if you want to treat the encoded bytes as opaque blob of bytes, you can just use the methods on slice, as @H2CO3 suggested above. on the other hand, on Windows, an OsStr is not arbitrary bytes, it's actually (possibly invalid [1]) utf16 strings encoded using the wtf8 [2] schema, so even seemingly simple operations like suffix cannot be properly defined on OsStr, and you either convert it to str, or you decode it and re-encode it in platform specific way (e.g. on windows: OsExt::encode_wide).

the relevant code in std for Windows OsStr type can be found here:

yes, I totally agree. the trade-off is everywhere when it comes to system programming. for me, the fact of having utf8 as the single encoding for the built in string types is itself a big reason to choose the rust language. even when I wrote in other languages, say, C++, I also mostly apply the principle of use utf8 internally, and only convert on system call boundaries, this principle [3] helps my sanity when dealing with unicode.


  1. that's why it cannot be encoded as utf8 ↩︎

  2. yes, that's the formally published name ↩︎

  3. published at https://utf8everywhere.org ↩︎

If the suffix you are checking for/removing is a valid OsStr as well, surely it will be fine to remove it? On Unix it is obviously okay (result may not be valid UTF-8, but neither was the input necessarily and nor does the output have to be).

On Windows it should also be fine, unless I'm misunderstanding the way in which the UTF-16 code is invalid (that is an unpaired surrogate, which to my understanding means we are referring to a character that needs more than 16 bits to represents, but the second int16_t-wide character for it is missing)?

Taking strip_suffix as an example:

  • Removing a valid UTF-16 suffix from a valid UTF-16 string (neither cut off in the middle of a code point) is fine (obviously).
  • Removing an invalid UTF-16 suffix from an invalid UTF-16 string may produce either valid or invalid UTF-16. But since the OS already permits invalid UTF-16, we are just back where we started.
  • Removing an invalid suffix from a valid string will either not change the string (suffix not matched) or produce an invalid string (again, fine as per argument above).
  • Removing a valid suffix from an invalid string is similar to the above case.

As long as I let the standard library convert my suffix or prefix string (e.g. "modify_") to a suitable platform specific representation (i.e. OsString) first, I don't see the issue.

Is the problem that we are not working directly with UTF-16 but have it encoded as this WTF-8? I had a look at this WTF-8 encoding, but I'm no an expert in this field and I'm in over my head. I'm guessing here: Is it maybe there are different types of invalid (and we want to avoid producing some of them that are even more screwed up than unpaired surrogates)?

1 Like

You are correct. The best way of “correct”: “technically correct”.

Yes, it's possible to safely remove prefix or suffix from WTF-8.

But it's not, safely, possible to concatenate them!

And I'm yet to see any program which wants to split Path into arbitrary pieces but the never concatenate these pieces.

Maybe if you explain what you are doing with limited capability to split but without capability to connect then it would be easier to reason.

The problem is that we are not dealing with UTF-16. On Windows we are dealing with UCS-2 which is interpreted as UTF-16 sometimes.

As in: low-level API works with UCS-2 and don't bother with such stupid things as surrogate pairs and/or emoji, while high-level APIs treat the same sequences as [potentially invalid] UTF-16.

And WTF-8 makes the whole thing representable as UTF-8 if we are lucky and original string was valid UTF-16. Otherwise it's not UTF-8, but something else, yet said “something else” is safe to use with raw functions which open (remove, renamed and so on) files.

1 Like

to my understanding, you are correct, since the subject and suffix are both in the same encoding. (I think the reason std doesn't include this functionality is because they are not commonly used, or maybe they want to encourage users to use utf8? just guessing.).

not to my knowledge, I think wtf8 is only different from utf8 in the supporting of unpaired surrogate utf16 units values. I might be wrong though.

1 Like

It also mandates that surrogate pairs shouldn't be adjacent so there are no problems going from WTF-8 to UCS-2 and back.

And while that, technically, doesn't prevent functions that strip suffix and/or prefix but you lost the ability to concatenate such strings.

One may probably implement things that are safe to do on these strings in some kinda crate and see if this would be possible to use in practice before ask to add these facilities to std.

My guess would be that you would need something like ASCII-only strings and then these would be usable for some common manipulations on filenames.

1 Like

Why is that though? Since the string doesn't have to be valid UTF-8 (*nix) or UTF-16 (Windows) the worst you can end up with is creating another invalid such string, right? I don't believe we ever have a UTF-16 so invalid it is not a multiple of 2 bytes (so getting "out of alignment" when concatenating should be impossible).

I do have some operations that don't concatenate:

  • I want to know if a file is on the form modify_{arbitrary name}.tmpl and if so get the arbitrary name in the middle. That involves prefix/suffix stripping without concatenation.

That said I also need to go the other way and given "arbitrary name" construct modify_{arbitrary name}.tmpl Which I though would be OK regardless of what arbitrary name is, as long as the OS is okay with that "arbitrary name" to begin with.

Basically my problem involves working with filenames following certain patterns and finding related files with filenames derived from the main file. I don't particularly care what encoding the OS uses for those file names. And I know that my prefixes/suffixes are in the ASCII subset, which can on all platforms be encoded into the OS native encoding (a no-op on *nix, conversion to UTF-16 on Windows).

Ah okay, now it finally makes sense why it is a problem. I don't believe it would be an issue if OsStr worked directly in UTF-16 though (at most combining two strings would produce a valid "ghost" character in the middle: garbage in, garbage out)?

That said, why does rust use WTF-8 instead of the native Windows "not-quite-UTF16" for OsStr on Windows? I imagine there is a lot of overhead in converting back and forth. And most operations will stay in the domain of OsStr rather than needing to convert to String/str.

You may know that but Rust typesystem wouldn't know that. And that's why it's better to create your own crate for that and then discuss what and how can be done safely instead of asking std to provide that facility.

Maybe eventually it may be even adopted into std, but not before it would be sufficiently battle-tested and reviewed.

You forgot some important words. I'll fix for you:

And most operations in some very rare programs will stay in the domain of OsStr rather than needing to convert to String/str while the majority of programs will suffer.

Just read utf8everywhere again.

This makes everyone aware about that difference, which is wrong: in many cases libraries don't need to even know what filename even is, that's something that they may use to open file and use it, period. Such libraries work fine right now and would stop working fine in your C++-like world (C++ like because C++ does what you are proposing to do and that's one of the reasons why there are so many C++ libraries which only work on Windows or only work on Linux).

Yes, there are some overhead, but not too much compared with the speed of file operations themselves. They are awfully slow on Windows.

If you want that then documentation very explicitly makes that possible.

But only for programs and libraries that do care about these things.

For 90% (if not for 99%) of use cases “try to convert OsStr to str and panic if that doesn't work” is much better approach.

1 Like

Instead of responding to individual posts in this thread, I'm going to throw out some Rust history and explanation for the lack of API.

Pre-history: why not use a native (wide) OsStr encoding on windows? Probably due to wanting cheap conversion to UTF8/str and "UTF8 everywhere" buy-in among the lib/lang developers more generally. So you need a UTF8-esque encoding that can handle unpaired surrogates. Hence WTF8. OsStr being some sort of str superset has been effectively guaranteed by the existence of pointer conversion traits to str since before Rust 1.0, so that's not going away.[1]

But the WTF8 encoding was never meant to be exposed. This put a lot of restrictions on what you can do with OsStrs for most of Rust's history to date. Like if you want to parse a CLI arg --some-arg=a-filesystem-path while still allowing non-unicode,[2] you'll probably be dealing with bytes on Unix and the like, and on windows maybe having a wide encoding round-trip (...but more likely just punting on non-unicode paths on Windows and other no-byte-access platforms).

This was recognized as a pain point, but for a long time, the determination not to expose WTF8 prevailed. Instead, the idea was, the standard library will provide a more generic framework for doing stringy operations via traits. There was the needle API RFC... but that was eventually withdrawn. There's also the OsStr pattern RFC... which hasn't been withdrawn but has languished for 5 years. One of the challenges here is that slicing WTF8 is tricky, which is what OMGWTF8 is about (details in the RFC[3]).

So better solutions have been Coming Soon :tm: for quite some time. In the meanwhile, you're not the only one to be frustrated by the lack of API.

20 months ago or so, this thread opened to discuss exposing the WTF8 bytes anyway. This still wouldn't solve everything, because there are still splitting issues. In fact, as far as I could tell from the conversation, if the APIs above had just landed, this probably wouldn't be considered at all; exposing the bytes is simultaneously more and less than what is needed. Things moved forward on the "expose bytes" front anyway, this PR created the bytes API, and it landed as the encoded bytes API... in Rust 1.74, two weeks ago. They've tried to maintain their flexibility with their "unspecified, platform-specific, self-synchronizing superset of UTF-8" verbiage,[4] but probably the cat is out of the bag and WTF8 will start being serialized by Rust apps etc.

But we still don't have non-manual splitting and the like, and we can't get splitting on OsStr (versus str or char or "non-empty UTF8 substrings") until the splitting issues discussed in RFC 2295 are settled. There are still open APCs, etc, about these issues with tons more discussion and link-chasing opportunities for the curious.


In summary, doing the right thing to support all paths has sucked for quite some time, and still sucks, but there has been some recent incremental improvement with hopefully more to come (stripping UTF8 prefixes say). But fully realized OsStr splitting is probably still far away, if ever (in std anyway).


  1. Technically they could store multiple encodings or such, I think, but nothing remotely likely to land. ↩︎

  2. specifically non-UTF8 on unices ↩︎

  3. a solution is proposed, but it has some oddities around unexpected slice lengths ↩︎

  4. I'm partially to blame for that wording I guess, due to pointing out that the promises that have been made are stronger than "superset" ↩︎

14 Likes

Thank you for your detailed answer.

A proper API that "just works" does indeed seem very difficult. Ideally basic splitting and merging (and possibly even formatting) should "just work" (somehow). On Unix it doesn't seem difficult (it is just bytes after all). As usual Windows throws a wrench into the gears though.

For now I decided to drop support for non-UTF-8 paths and go with using camino. It is less than ideal, but I doubt any users would use anything except Unicode paths. And trying to do it properly is just too painful currently. I don't want to have to write cfg directives based on platform (especially since I can't even test on Windows or Mac OS X except in Github CI, though based on bug reports I know I have at least one user on each of those).

Should the question of non-unicode paths ever come up I'll just point here for the reason it is like it is I guess :person_shrugging:

Come to think of it: That would only work on Windows, wouldn't it? Since on *nix the superset is not WTF8 but just raw bytes (since the native OS strings are raw bytes, and WTF8 restricts us to UTF-8 + encoding unpaired surrogate pairs). Or am I missing something?

In fact, on *nix it may not even be a superset of UTF-8. What if I have a name that looks like valid UTF-8 but is actually valid (with a different interpretation) in some weird encoding such as Shift-JIS, MacRoman or some old DOS code page? I might even have things that look like invalid UTF-8 (e.g. it starts to encode a multi-byte UTF-8 symbol with leading 1 bits but then never completes it (but is perfectly valid in some other encoding).

Why “raw bytes” are not “unspecified, platform-specific, self-synchronizing superset of UTF-8”?

Any UTF-8 sequence is valid and also there are no need to self-synchronize it with anything since there are no forbidden subsequences.

Of course it's superset! If you allow any random sequence of bytes, then you allow all kinds of UTF-8 strings, too!

I think you are mixing what superset is and what subset is.

You're correct that it's still platform specific, in that you can't deserialize any sequence of bytes to WTF8. But you could already get the OsStr bytes on unix, because that matches what the paths/etc are on the OS -- bytes. The WTF8 encoding wasn't exposed[1] until two weeks ago, because it doesn't match what the paths are on that OS (sequences of 16-bit integers). But now you no longer have to use cfg blocks or WTF8-reimplementing crates or whatever to grab the bytes and dump them to a file; you can easily get the bytes on stable on any platform. So it's trivial to use them with io::Write/write! in safe code.[2]

AFAIK they're trying to hold on to the "don't de/serialize WTF8 or otherwise count on it being some stable encoding forever" dream, but IMO that's just not realistic[3] now that they've released the bytes in a simple, stable, and blessed fashion. You can go the other way (deserialize bytes to WTF8 on Windows) with std too now, albeit with unsafe. The point of the unsafe is "don't do that", but my bet is people just will,[4] because the alternative is so awful to code.[5] But possibly I'm wrong. Or maybe the number of care-about-non-unicode-on-non-unix programs/programmers is so small it doesn't really matter in practice.

I'm not sure I even mind, really; I don't have any stake in keeping WTF8 sealed up, and it's probably going to improve my life wrt. to the pains discussed in this topic. That said, it was bit frustrating from a technical or design standpoint to see this land when all that was really required for the interested parties, AFAICT, was some subset of an accepted RFC -- or even just an iterator/collector over Either<&str, &OsStr>. They didn't have to give up on their WTF8 dream, they just had to make progress elsewhere.

Side note: I'm usually right there with you on the "as usual, Windows" front, but in their defense, Windows (and Java and Javascript and others) did get screwed by the "16 bits per character [ought to be enough for anybody]" vision of Unicode version 1. Under that vision, "any sequence of 16-bit values" plays the same role as "any sequence of bytes" does on Unix: a representation that doesn't require validation and can just be treated as a sequence of values at the lower levels.[6] With the added benefit of being Unicode/UCS compatible without the whole mess of locales! But when Unicode expanded past 16 bits and the surrogate pair requirement was introduced (for UTF-16), some sequences of 16-bit values can no longer be mapped to Unicode. That put Windows and the other platforms back in the same place as Unices in the sense that not all valid OS strings are valid Unicode.[7] Meanwhile UTF-8[8] took it in stride.

Or in short: Early Unicode adaptors got screwed by buying into UCS2 when the official recommendation was to use 16-bits with the promise of being universal; UTF-8 and the admission that 16-bits is not enough came later.[9]

The (unix) OS doesn't care about your locale, it's all just bytes.[10] And "just bytes" is a superset of UTF8. Interpretation of what the exact encoding (locale) is happens in higher-level things like your shell or pager or desktop software or whatever; users may see junk at those higher levels, but all you have to do at a lower level is preserve the bytes.

This is at least somewhat true of any Unicode or mostly-Unicode platform like Windows too, because inline language tagging was deprecated in Unicode -- there's no way to know the proper way to display text without an external language tag. The consortium downplays this, but they downplay most negative aspects of Han unification; so I just don't know how big of an issue it actually is, outside of being politically important in some venues.


  1. in any sanctioned way ↩︎

  2. I imagine my cfg-laden StdOut-bound filename writers will eventually use this instead, say; this may error on Windows, but I'm currently definitely erroring on non-unicode Windows paths, so it's actually an improvement. ↩︎

  3. https://www.hyrumslaw.com/ ↩︎

  4. definitely they will if all they need is serialization/output -- no unsafe required ↩︎

  5. Or a crate will supply the safe/"safe", WTF8-assuming version. ↩︎

  6. and is in some sense thus also robust to cross-language filename sharing and some types of corruption, in that having a "garbage" file name (as interpreted by any given locale) is still usable... assuming your tools support all valid filenames as defined by the OS. ↩︎

  7. In practice no-one on those platforms seems to care about unpaired surrogates -- i.e. non-unicode OS strings. (At least until there's another decoding CVE.) Hence in part why it's common to just punt on non-unicode on Windows. The other major reason in Rust being the required re-encoding roundtrip, IMO, which also means it just can't be done in place -- i.e. by converting a reference. ↩︎

  8. which had been invented in the interim, is inherently mixed-width and validation-requiring, and has the benefit of being ASCII compatible ↩︎

  9. And the "encoding where all bit patterns are valid unicode" vision is mostly dead. ↩︎

  10. Fiddly corner case: paths and args and so on can't have NULs, but if you pass a path with an embedded NUL as a C string it will "just" get truncated or chopped into two on the other side typically. Anyway this applies equally to "bytes" or "UTF-8" since NUL is ASCII. ↩︎

3 Likes