Why does Rust have so many string representations?

As I started to use Rust more often (mainly because I think it's an interesting language and I wanted to get more used to it), I started to encounter some things I found at first glance to be a little strange, like the relatively many string representations Rust has:

The first two are pretty easy to understand (both usage and raison d'etre):

  • std::str
  • std::string::String

The following are the ones where the story gets more convoluted:

  • std::ffi::CString
  • std::ffi::OsString
  • std::os::unix::ffi::OsStringExt
  • std::os::windows::ffi::OsStringExt
  • and their borrowed representation e.g: OsStr, CStr...etc

So it seems like the last list is comprised of specialized types for special use cases, but I started to encounter these types in some very simple cases like getting the filename of a Path:

// I just want a String !    
let name = String::from(path.file_stem().unwrap().to_str().unwrap());

It seems like many functions from Path make use of OsStr instead of &str or String, making it quite cumbersome to work with, at least with my current expertise in Rust, which to be frank is quite limited !

How could I refactor some code like the one in the example above to make it more readable and have some proper error handling, and why is it necessary to have all those different types ?

2 Likes

The String and &str types hold UTF-8 strings. Filenames on most platforms are not required to be valid UTF-8, so if Rust would use the standard string types for filenames, you could not access some files, and you couldn't even list them.

The two OsStringExtvariants are not types. They are traits with platform-specific extensions for the OsString type.

The CString type is used for interacting with C code. C strings have different guarantees than Rust strings. specifically they are nul-terminated, and don't have any nul bytes in the interior of the string, but they are not required to be valid UTF-8.

On Unix platforms, CString and OsString could be combined into a single type, but on different platforms they need to be seperate due to incompatible requirements (e.g. on Windows, but don't ask me about the details).

8 Likes

Have you read the docs on why OsStrings are needed? If so, do you have any specific questions about them?

10 Likes

Consider alternative. If there was a single String universal string type, it would have to be the lowest common denominator for all these usages. It would have to have undefined encoding, because encoding of filenames on unix is a bit of a mess. It would have to be NUL-terminated, because that's what C APIs expect. Borrowing of a substring couldn't be done without an allocation, because there has to be room insert the NUL terminator.

Some languages do that, where the one-true-string-type is actually an equivalent of Box<dyn StringInterface> or enum String {ASCII, UCS2, UCS4, Binary}.

Rust is guilty of doubling types for owned and borrowed variants. But apart from that, there simply are many different, incompatible string-like things in the OS and libraries Rust interacts with, and Rust chose to expose that instead of hiding the differences.

19 Likes

For a bit of history and explanation of strings in Windows (and Javascript), see WTF-8 encoding. Essentially they use a series of any 16bit values which may not be actual unicode.

For file paths this could cause a problem if you expect unicode strings and get back something that isn't. You can use to_string_lossy for display but the conversion may have to replace non-Unicode sequences. This makes it unsuitable if you need to use that name to, say, open the file.

So ideally you'd use OsStr internally and only convert to something else when you need to display or log the name. Also remember that you can turn a string literal into a Path using Path::new or into an OsStr using OsStr::new, if that helps for comparisons and such like.

Otherwise you could use a struct to handle whatever it is you're doing with paths (for example, it could simply store the real file path and then a unicode display name).

The possible solutions really depend on what it is you're doing. The important point is to be aware that file paths may not be valid unicode strings and aren't necessarily portable across platforms.

3 Likes

I think it is a good thing that Rust has so many Strings. This means a user can use Unicode if he needs to, interface with C and other languages with null-terminated strings easily via a library so programmers don't have to write that functionality independently and ditto for interfacing with the OS. It does make Rust's Strings a bit more complex, but IMHO the pros of Rust's handling of text far outweighs the cons.

BTW, @RobertBerglund, did you read these pages yet (in order of usefulness and complexity)?

https://doc.rust-lang.org/book/ch08-02-strings.html

https://doc.rust-lang.org/std/ffi/struct.CString.html
https://doc.rust-lang.org/std/ffi/struct.CStr.html

If you just read up till the ffi docs, you should already have a good idea how String and &str works. The ffi stuff explains the rationale behind CString / CStr and OsString / OsStr and the alloc and core stuff is for when you really want to dive deep in the details of memory allocation and representation and such things.

5 Likes

@L0uisc, Thank you for the answer and especially for the links. I have read through the book and some of the docs apart from the ffi docs. I would read them again since I have forgotten many of the details, and then go through the other links you provided to gain a better understanding of how Rust handles strings.
Thank you again for your help !

1 Like

@chrisd, I wasn't aware of the fact that Windows use UTF-16 internally, actually I never thought about it and had no idea WTF-8 was even a thing. Thank you for this explanation and little piece of interesting history !

@BurntSushi, thanks for pointing that out, I should read the documentation more carefully, I actually scrolled down past that part.

@smarnach, thank you for your brief and to the point explanation, especially for mentioning the limitations of using Rust standard string types for filenames.

@kornel, thanks for the concise example provided of an alternative string implementation and its drawbacks.

To be clear, Windows uses UCS-2 internally, not UTF-16. If they used UTF-16 then that wouldn't be so bad because it'd at least be valid unicode. UCS-2 allows any 16 bit value even if it isn't valid UTF-16.

2 Likes

Right, thanks for the clarification.

This is bit of a pattern with Rust with respect to beginners. There's a complexity inherent in the system you're programming against. You never knew about it, or were conscious of it, when using other languages. Things "just worked" there. Then you do the same thing on Rust and you have to jump through ten hoops. You are surprised and hate Rust for it, even though it just brought to the surface what that was there all along. When you finally see the reality you come to appreciate, and perhaps feel thankful for, all of Rust's finickiness.

I know this because because I went through the same cycle myself :grinning:.

14 Likes

An informative and language-agnostic read on the topic by Joel Spolsky:

The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

2 Likes

There is yet another option :slight_smile: @BurntSushi has a crate called BString which is a byte string with String-like methods. The idea is it is conventionally UTF-8, but not necessarily (rather like Go strings).

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.