Why is not `Cow<str>` the default string type?

After reading the article Why your first FizzBuzz implementation may not work and seeing many others having trouble with strings in Rust, I keep wondering why (something like) Cow<str> is not used as the default the language. It should work equally well for both owned strings and static strings and many others.

Are there any (common) cases where this would be a bad idea to use?

What would be the pros and cons of such an approach? Have anything like this been tried before?

4 Likes

Cow<str> is really only needed if you need to produce an owned String at some point in your code and you need a flexibility of inputs. If you passed in a String as the Cow input, then no allocation would be needed, but if you passed in a &str, you would then have to eventually convert it to that String.

Additionally, Cow is an enum, so there is a tiny matching cost that will occur every time the value is accessed.

Consider the following function:

fn print_stuff(input: &str) { println!("Check out this awesome printing: {}", input);  }

Why should I force my caller to potentially pass in an owned string, when it is 100% not needed for my code? In many cases owned strings are not needed, which is why Cow is not the default.

In fact, more often than not (at least for me), &str is much more used, so if you would like the flexibility of taking String or &str, you might want to try using AsRef<str>, which does not have the to owned String allocation cost.

3 Likes

In their RustConf 2016 keynote, @aturon and @nikomatsakis proposed making String into a Cow-like type that could be backed by either a static string reference or a heap-allocated string, and adding a trivial coercion from string literals to String:

This thread has some discussion of that idea:

9 Likes

As far as I can tell, a Cow can always be used as a borrow. It only converts it into an owned string if you explicitly tell it to, by using .into_owned() or .to_mut(). If you do anything else with it, it will stay the way it is. If you pass a Cow<str> into (the analogue of) your function, you would either move it, clone it using .clone() (which keeps borrowed as borrowed and copies owned strings) or pass it as a reference which allows auto-deref (which always borrows).

But, yes, I get that there's a cost to using an Enum. It's very small, but it's not zero-cost. Also, this approach might cause more confusion, instead of less, which is always a bummer. There's also the cost of being more implicit and less explicit: we can't tell statically if the string is borrowed or owned.

There's a really good video on YouTube where someone does a couple benchmarks showing the time difference between using a &str vs a String. You gain something like 50ns (i.e. 0.000000050 seconds) by avoiding a couple allocations, yet the added complexity to your code isn't really worth the minuscule amount of time you save.

Once I learn that a String is (conceptually) a mutable array of char allocated on the heap and a &str is a string slice, most of the issues between String and &str disappear. I don't think I've even seen Cow<str> used anywhere in production or on GitHub.

I'd say that given the extra cognitive load, especially for beginners, you don't really gain enough performance-wise to make it worth it. Additionally, knowing exactly where in memory your string is and whether it's borrowed or owned makes things a lot easier to reason about.

Conceptually maybe, but not in reality! It's a Vec<u8>, not a Vec<char>.

1 Like

@steveklabnik, good point, although I doubt you'd want to explain the nuances between bytes and UTF-8 and characters and code points to someone who's only just starting off :wink:

My argument against using a Cow<str> as the default string type is that the extra cognitive load would make things harder than they need to be. It would also increase Rust's learning curve instead of making it easier.

Imagine trying to explain how a Cow<str> works in the Strings section of The Book and telling people that your string could be either a slice into some other string, or a heap allocated vector of bytes depending on if something has happened in the past. You've now got one string type which is represented completely differently in memory. Compare that with having two types which only have one job each and you choose between them depending on the scenario.

You are right that it will only turn into owned when you explicitly tell it to. But, to me, the huge difference between Cow and str is that with Cow, you are strongly implying that at some point in your code, you will be calling to_owned. Otherwise, you wouldn't be using Cow in the first place. On the other hand, with str, calling to owned is by no means a likely conclusion.

1 Like

I would say legitimate uses of Cow are limited to cases where you are (a) returning an object and (b) unable to determine statically whether the object will be a reference or something created on the fly. The Cow exists mostly to anchor the temporary that gets created, if it gets created.

If you are able to determine (b) then returning a concrete String or &str is preferable.

OTOH if you are not doing (a) then you can accept Borrow<str> and let the caller decide.

So in some sense, Cow is strictly less powerful than either &str, String, or Borrow<str>, because it has to accommodate both possibilities and can only do the intersection if both types' operations, and it lacks static information regarding which possibility it is (the latter makes it a nonzero cost).

1 Like

A good example of that is if you have a function like so:

fn replace(s: &str, x: char, y: char) -> Cow<str> {...}

This hypothetical function would try to replace all occurrences of x in s with y. But, it may be possible that s doesn't have any occurrences of x, and so there's nothing to mutate (and thus no reason to create a fresh String) - you can just echo back the caller's s directly.

With the above in mind, it's slightly surprising that the various str methods that perform replacement return String. I'm guessing the assumption is that a replacement is likely to occur.

1 Like

How can someone tell? I have brute force code which has more misses than hits.

1 Like

This function isn't just a hypothetical:

This function returns a Cow<'a, str>. If our byte slice is invalid UTF-8, then we need to insert the replacement characters, which will change the size of the string, and hence, require a String. But if it's already valid UTF-8, we don't need a new allocation. This return type allows us to handle both cases.

2 Likes

Thanks Steve. Any idea why the str replacement functions don't do the same?

I'm not sure; might just be an oversight.

The way Cow<str>s interact with Deref and lifetimes can be pretty annoying IMO and would be a big enough deal breaker to make them not the default type.

Returning String is simpler, seems like a good call to me. replace expects to replace something, while the from_utf8_lossy expects it will be able to pass the data through unchanged.

What I'm would question is why a static method on String is returning Cow. Why not have the method on Cow, then for example.

That's my guess as well, but see @StefanoD's comment which I can see as being reasonable in some circumstances:

Consider also the str::to_uppercase/to_lowercase methods - they also return a String, but there are use cases where someone may want to "normalize" the casing blindly, so to speak - do it always for all arguments, even if most are already in the proper case.

1 Like