Does `OsString`'s `From<String>` allocate?

zeta12ti · May 19, 2020, 11:15am

I was reading through the source code starting from PathBuf's From<String> implementation. The documentation says

This conversion does not allocate or copy memory.

It looks like it delegates to PathBuf::from::<OsString> and OsString::from::<String>.

The documentation for the former says

This conversion does not allocate or copy memory.

Since this method simply sets OsString as a PathBuf field, that makes sense.

However, the documentation for OsString::from::<String> says

The conversion copies the data, and includes an allocation on the heap.

Looking further into Buf::from_string (which OsString::from::<String> calls), it looks like there's a version for Windows and a version for everything else. The version for everything else doesn't have any clear allocations (it just stores the bytes of the string directly). The version for Windows delegates to Wtf8Buf::from_string, but that doesn't seem to allocate either.

So does OsString::from::<String> actually allocate? Is the documentation for PathBuf::from::<String> and PathBuf::from::<OsString> wrong? Did I follow the chain of delegations incorrectly somewhere?

chrisd · May 19, 2020, 11:55am

I agree there doesn't seem to be any allocations in the code unless I too am missing something.

I see no reason why there should be because:

A Rust String is a UTF-8 string.
An OsString is defined as a superset of a UTF-8 string on all platforms. It's a UTF-8 string that may or may not contain illegal values.

So every valid String is a valid OsString. Conversion just needs to move ownership of the underlying buffer.

tesuji · May 19, 2020, 12:31pm

Is that true for Windows OsString?

chrisd · May 19, 2020, 12:34pm

Yes. An OsString is always a superset of UTF-8. On Windows the standard library will convert the OsString from/to a UTF-16 vector on every call to the Windows API.

If you call the Windows API yourself you have to manually do the conversion (or use a crate that does it for you).

DoumanAsh · May 19, 2020, 2:04pm

To be specific on windows it is Vec<u8> like any other string, but encoded as WTF-8 (due to windows being quite nasty and stupid platform when it comes to Unicode)

chrisd · May 19, 2020, 2:14pm

That last part is in correct. Windows has a very strong convention of using Unicode and has done so for far longer than Linux has.

The difference is Linux applications now (mostly) use UTF-8 by default. Whereas the Windows kernel uses UTF-16. I wouldn't call UTF-16 "nasty and stupid" even if UTF-8 would be preferable nowadays.

jameseb7 · May 19, 2020, 2:14pm

Specifically, Rust provides encode_wide() to do this conversion on Windows. There is also a from_wide() for converting the other way.

Nemo157 · May 20, 2020, 6:28am

Windows uses a modified form of UTF-16 which allows for unmatched surrogate pairs. This is why OsString uses the WTF-8 encoding on Windows instead of simply wrapping a UTF-8 string, there are strings that Windows considers valid which are invalid Unicode strings.

Hyeonu · May 20, 2020, 6:51am

To be fair, linux uses a modified form of UTF-8 which allows arbitrary byte sequence if it contains only one nul character('\0') at the end.

chrisd · May 20, 2020, 9:15am

That's why I called it a "strong convention". WTF-8 is only "necessary" because Rust really wants an OsString to be a sequence of u8s, not u16s.

The name is a constant source of confusion on Windows.

Nemo157 · May 20, 2020, 9:38am

Linux does not use any form of Unicode, it just uses arbitrary byte sequences without internal 0 bytes, which a lot of userspace utilities then attempt to interpret as UTF-8.

No, if Rust's str was UTF-16 it still wouldn't be compatible with Windows APIs, there would have to be a separate OsString that allows unpaired surrogates still. Though there would be a zero-copy conversion from String to OsString since it would be a superset of UTF-16.

This is basically the same as what happens with str and OsStr on Linux today, str can contain a subset of what OsStr can, so there's a zero-copy conversion from str to OsStr but converting back is fallible.

chrisd · May 20, 2020, 9:50am

I think I've not made myself clear. Rust's requirement that OsStr be zero-copy from a str is what makes WTF-8 necessary. If that wasn't true then an OsStr could live up to its name and be an more OS native string type.

As it is an OsStr, despite its name, is not an OS native string type.

zeta12ti · May 20, 2020, 9:52am

That finally brings us back to the topic of this post. Do you have any evidence that the conversion from String to OsString isn't zero-copy (on Windows)? Here's the source of Wtf8Buf::from_string (linked in the OP).

#[inline]
pub fn from_string(string: String) -> Wtf8Buf {
    Wtf8Buf { bytes: string.into_bytes() }
}

string.into_bytes() is zero-copy, and the function isn't doing anything with those bytes: it's just setting a field value. In case you aren't convinced, OsString is internally just a Buf

pub struct OsString {
    inner: Buf,
}

and a Buf (on Windows) is just a Wtf8Buf.

#[derive(Clone, Hash)]
pub struct Buf {
    pub inner: Wtf8Buf,
}

Nemo157 · May 20, 2020, 9:58am

No, I just managed to misread what I quoted

Nope, the purpose of using the WTF-8 encoding instead of something more OS-native like [u16] is to guarantee that this conversion is zero-copy.

system · August 18, 2020, 10:04am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Does converting a String into a PathBuf allocate new buffer? help	6	1650	June 29, 2021
Possible Mistake in OsString Documentation help	4	349	January 8, 2022
Compiling Rust binaries for Windows 98 SE and more: a journey announcements	4	2695	August 24, 2020
About OsString on Rust	4	302	March 29, 2024
Idiomatic way to convert non-UTF-8 vector slice to PathBuf help	6	1859	November 23, 2020

Does `OsString`'s `From<String>` allocate?

Related Topics