Does `OsString`'s `From<String>` allocate?

I was reading through the source code starting from PathBuf's From<String> implementation. The documentation says

This conversion does not allocate or copy memory.

It looks like it delegates to PathBuf::from::<OsString> and OsString::from::<String>.

The documentation for the former says

This conversion does not allocate or copy memory.

Since this method simply sets OsString as a PathBuf field, that makes sense.

However, the documentation for OsString::from::<String> says

The conversion copies the data, and includes an allocation on the heap.

Looking further into Buf::from_string (which OsString::from::<String> calls), it looks like there's a version for Windows and a version for everything else. The version for everything else doesn't have any clear allocations (it just stores the bytes of the string directly). The version for Windows delegates to Wtf8Buf::from_string, but that doesn't seem to allocate either.

So does OsString::from::<String> actually allocate? Is the documentation for PathBuf::from::<String> and PathBuf::from::<OsString> wrong? Did I follow the chain of delegations incorrectly somewhere?

I agree there doesn't seem to be any allocations in the code unless I too am missing something.

I see no reason why there should be because:

  • A Rust String is a UTF-8 string.
  • An OsString is defined as a superset of a UTF-8 string on all platforms. It's a UTF-8 string that may or may not contain illegal values.

So every valid String is a valid OsString. Conversion just needs to move ownership of the underlying buffer.

1 Like

Is that true for Windows OsString?

Yes. An OsString is always a superset of UTF-8. On Windows the standard library will convert the OsString from/to a UTF-16 vector on every call to the Windows API.

If you call the Windows API yourself you have to manually do the conversion (or use a crate that does it for you).

1 Like

To be specific on windows it is Vec<u8> like any other string, but encoded as WTF-8 (due to windows being quite nasty and stupid platform when it comes to Unicode)

That last part is in correct. Windows has a very strong convention of using Unicode and has done so for far longer than Linux has.

The difference is Linux applications now (mostly) use UTF-8 by default. Whereas the Windows kernel uses UTF-16. I wouldn't call UTF-16 "nasty and stupid" even if UTF-8 would be preferable nowadays.

Specifically, Rust provides encode_wide() to do this conversion on Windows. There is also a from_wide() for converting the other way.

1 Like

Windows uses a modified form of UTF-16 which allows for unmatched surrogate pairs. This is why OsString uses the WTF-8 encoding on Windows instead of simply wrapping a UTF-8 string, there are strings that Windows considers valid which are invalid Unicode strings.

To be fair, linux uses a modified form of UTF-8 which allows arbitrary byte sequence if it contains only one nul character('\0') at the end.

1 Like

That's why I called it a "strong convention". WTF-8 is only "necessary" because Rust really wants an OsString to be a sequence of u8s, not u16s.

The name is a constant source of confusion on Windows.

Linux does not use any form of Unicode, it just uses arbitrary byte sequences without internal 0 bytes, which a lot of userspace utilities then attempt to interpret as UTF-8.

No, if Rust's str was UTF-16 it still wouldn't be compatible with Windows APIs, there would have to be a separate OsString that allows unpaired surrogates still. Though there would be a zero-copy conversion from String to OsString since it would be a superset of UTF-16.

This is basically the same as what happens with str and OsStr on Linux today, str can contain a subset of what OsStr can, so there's a zero-copy conversion from str to OsStr but converting back is fallible.

3 Likes

I think I've not made myself clear. Rust's requirement that OsStr be zero-copy from a str is what makes WTF-8 necessary. If that wasn't true then an OsStr could live up to its name and be an more OS native string type.

As it is an OsStr, despite its name, is not an OS native string type.

That finally brings us back to the topic of this post. Do you have any evidence that the conversion from String to OsString isn't zero-copy (on Windows)? Here's the source of Wtf8Buf::from_string (linked in the OP).

#[inline]
pub fn from_string(string: String) -> Wtf8Buf {
    Wtf8Buf { bytes: string.into_bytes() }
}

string.into_bytes() is zero-copy, and the function isn't doing anything with those bytes: it's just setting a field value. In case you aren't convinced, OsString is internally just a Buf

pub struct OsString {
    inner: Buf,
}

and a Buf (on Windows) is just a Wtf8Buf.

#[derive(Clone, Hash)]
pub struct Buf {
    pub inner: Wtf8Buf,
}

No, I just managed to misread what I quoted :frowning:

Nope, the purpose of using the WTF-8 encoding instead of something more OS-native like [u16] is to guarantee that this conversion is zero-copy.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.