Split a file name at the first dot

I'm trying to figure out how to implement the following transformation on a std::path::Path: given foo/bar/old_name.x.y.z, produce foo/bar/new_name.x.y.z. I can accomplish something similar using Path::file_stem and Path::extension, but these methods split a file name at the last . instead of the first. I imagine there are functions on &str that do what I want, but I'd like to avoid coercing from Path if possible to deal cleanly with the invalid-Unicode case. Is there a good way to solve this problem using std, or a third-party crate that would be helpful? Thanks!

1 Like

std::path::{Path, PathBuf} use std::ffi::{OsStr, OsString} as its backend.
Their internal representations are platform-specific, and you need to write codes to manipulate such data.

1 Like

Thanks, that makes sense. I was hoping someone else had already written the necessary platform-specific code for me to call, but it shouldn't be too hard to implement myself if not.

Interestingly the Path implementation uses unsafe code to do this:

I suppose you could reproduce that, substituting rsplitn with splitn.

1 Like

Ah, yeah, that does seem like the way to go. Thanks for the pointer.

Transmutation between [u8] and OsStr seems to heavily depend on knowledge of the libstd internals.
The property is not guaranteed, and downstream users (including us) should not copy libstd unsafe codes: they could be changed silently in future.

3 Likes

Right. The marked solution here is not the solution. Transmuting an OsStr to a [u8] depends on the internal representation that is not a part of the public API, and could potentially break or otherwise lead to UB.

Personally, if this were me, I'd convert the path to a &[u8] using the Unix specific API (which is free). On Windows, I'd attempt to convert the path a &str (and then a &[u8]) via OsStr::to_str and then handle the rest from there. If the Windows conversion fails, then I'd log an error and move on to the next path. You might feel justified in doing this because Windows paths with invalid UTF-16 are quite uncommon. bstr provides the aforementioned conversion routine for you, which would absolve you from having to write platform specific code.

I think the ideal solution to this is to wait for string APIs to come to OsStr. An RFC has already been merged: rfcs/2295-os-str-pattern.md at master · rust-lang/rfcs · GitHub

Otherwise, if you need 100% correctness today, then the only way to do it as far as I can see is to use platform specific APIs. That means manipulating &[u16] on Windows (and paying for an allocation+copy or two).

6 Likes

To expand on this, when the standard library gets a string from Windows (e.g. by listing files from a directory) it arrives as a u16 array and gets immediately re-encoded to WTF-8 (a superset of UTF-8 encoding). But WTF-8 is hidden from anything outside the standard library so the only (stable) way to operate on all possible Windows strings is to re-encode it again so as to recreate the original u16 array.

Or to put it another way, the conversion goes:

[u16] → WTF-8 → [u16]

This is totally redundant in this case but it's unavoidable at the moment unless you bypass std and call Windows APIs directly (which I wouldn't particularly recommend). As BurnSushi says, the alternative is to give up on correctness and simply use to_str. Thus accept being unable to operate on any filenames that have illegal Unicode code points.

2 Likes

Thanks, this is very helpful! (And I've updated the marked solution accordingly.)

1 Like

Thanks for pointing that out!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.