In the last days I have worked a bit on the chapter of option types, see Option types - Rust for C-Programmers. One of the unresolved questions is, if we should use Option types for types that have a natural way to indicate the absent of a value. The most prominent example is the string type, where the empty string exists. Of course we could use Option, but does it make actually sense? In the books I read that question was not discussed, with Google I did not find much about that topic. And I myself have rarely used Option types at all -- while Nim has option types, I think I never used them there.
Is the empty string "absence of a value" or "presence of an empty value"?
Is the bottle halve full or half empty?
I just remembered a funny fact: Rust's option types are the reason, that I refused for nearly 10 years to just try Rust. When I started to create the GTK bindings for Nim in 2015, I sometimes studied gtk-rs example code, and thought: No, such a language with an unwrap() call in every second code line, that language I will never use. But of course other Rust code has not that much code to handle option values, and I begin slowly to recognize the benefits of option types. Took me at least one year.
Like many other questions, it depends on context: In your application, do you generally want to be handling the empty string differently from a populated one? If so, you probably want Option<String>
. If, on the other hand, the empty string isn't going to require special handling very often, you'll probably want just String
instead.
Yes, that seems to make some sense, thanks. I continued thinking about this topic myself, and I thought a found a good example: If I have a struct person, then name field should be an option, because all persons have names, so missing a name entry is an error. But then, what is about just born babies and these "Jon Doe" death people from TV, found somewhere but unable to identify. At least now I understand why this topic is rarely discussed, it is just a bit complicated and depends strongly on the actual use case.
There are two answers to this:
1. Is missing and empty different?
I would say that this depends on in your domain there is a meaningful distinction between "the value is missing" vs "there is a value, and it happens to be an empty one". (Similarly you might have a difference between "no value" and 0 for an integer.)
2. Do we want the compiler to enforce handling the empty case across the code base?
Another thing is that an Option forces you to think about and handle the empty case separately. You may or may not want this.
A case (not for strings) where we do want this is pointers/references. Implicit null pointers turns out to be error prone. It is better to have Option<&T>
than let pointers be implicitly null and forgetting to handle that case (this is quite a common error in C and C++).
The analog for that would be Option<NonEmptyString>
, though I don't know if there is exists such a string type that guarantees in the type system that it is always non-empty. You could make your own wrapper type that does this though. EDIT: With a quick search I found several crates providing non-empty strings.
This is all about making the compiler catch errors at compile time instead of runtime and making invalid states unrepresentable (this is a common design pattern in Rust, see for example this blog: Make Illegal States Unrepresentable | corrode Rust Consulting for an intro). You have to decide for your specific domain and problem space what properties are useful and worth it to check in the type system.
Maybe. Or maybe you want creating a Person
with a missing name to be an error instead, using just String
for the field. In that case, the rest of your program can safely assume that every Person
has a present-and-correct name because the error case got handled elsewhere.
One notable detail:
use std::mem::size_of;
fn main() {
assert_eq!(size_of::<Option<String>>(), size_of::<String>());
}
So from memory perspective, no differences, thanks to the fantastic optimizing Rust compiler.
Thanks for reminding me. I was only aware that Options does not consume additional memory for reference types, and for some simple enum types. The core for these optimizations is always, that there exists at least one unused bit pattern, that the compiler can use as None. For references, this is binary zero, as Rust references always points to an an actual entity. Again, thinking about it, this implies that Rust stores never an entity at memory address zero. A point most books just assume but not really discuss, and I am not even sure is this is guaranteed for all embedded systems. For strings, I assume that the case in which the pointer to the actual heap allocated text is zero, Rust can use this for None. But the case when capacity or length is zero might be used as well. I am not sure if the Rust compiler does this optimizations alone, or if the authorsd of Rust's standard library gave some help.
Interesting is the case, where we used subranges, e.g. weekday with int range 0 to 6. Rust does not support subrange types, so can compiler optimize Option<Weekday>?
For the niche optimization, it needs to know that some bit pattern is illegal for the type. So:
// No niche optimizations available
struct Weekday(u8);
// Has one niche; `Option` optimization available
struct Weekday(NonZero<u8>);
// Has lots of niches
// `Option` optimization available, as well as for `enums` with multiple unit variants
enum Weekday { Mon, Tue, Wed, Thu, Fri, Sat, Sun }
That’s completely true.
However, there’s a catch: A plain String has no restrictions (beyond UTF-8) on its content. It can be empty, consist of billions of spaces, or contain only control characters below ASCII code 32.
So the semantic benefit of an Option is somewhat limited compared to the potential content of a String.
From a usage perspective:
if let Some(s) = &some_option_string {
...
}
...is only a minor difference from:
if some_normal_string.len() > 0 {
...
}
(Side note: Personally, I don’t prefer !s.is_empty()
over s.len() > 0
, as Clippy suggests. The small !
character can be easily overlooked, and it goes against the widely accepted recommendation to avoid negative conditions in conditional expressions.)
However, something like this:
option_string.get_or_insert_with(String::new).push_str("appending to string");
...is, in my opinion, somewhat annoying.
In most cases, where there’s no need to differentiate between an empty string and a missing one, I personally just use a String for the sake of simpler code.
Thanks again, I think I heard somewhere of these NonZero types, but did not read about it in the official Rust book, and I think I missed it even in "Programming Rust" by Jim Blandy.
That is a fair point, and if this is something you care about (you probably should in many cases) a newtype around string that does validation when you construct the type is the way to go. Then you only need to validate your user names, emails or whatever it might be in one place.
Even better, if it is something you can parse, prefer doing that. E.g. parse a GUID and represent it as a 128 bit integer in your program. Don't handle it as a string except at the very edge of your system when ingesting/outputting it.
Yes all that are interesting points -- an UTF8 string can potentially contain invalid content, which an Option type can not prevent. A side note -- I thing Option<char> is also optimized by the compiler to same size as char (4 bytes), because UTF8 has bit patterns that are not allowed. For the s.is_empty()
-- in Nim we had no isEmpty() method, only len(), so when I first saw s.is_empty()
in Rust, I wondered if it is really useful. I am quite sure the generated machine code is identical.
If you specifically mean invalid UTF8, a &str or String cannot contain that. All safe methods of constructions validate that it is valid UTF-8 data. That follows the pattern I and others mention above of the constructor ensuring the type is valid, and other code being able to depend on it.
It can of course contain invalid data for your program (e.g. you expect an email address, but someone put in their physical address).
Char in Rust is a Unicode code point (not encoded in UTF-anything). But yes that invalid bit patterns that can be used as niches.
I don't know off the top of my head if a Option<String>
has niches, String is "just" a wrapper around Vec<[u8]>
that adds UTF8 validation and utility methods.