Can anyone explain me Cow?

Hi, as per the title, I've read about Cow and of course I've checked the official documentation and examples but I cannot get my head around how they works and mostly why and when it's a useful pointer. I've also read a blog post where a guy was able to get a super performance boost on some code by switching Strings to Cow-ed strings in his code.

Can anyone ELI5 me when I should think "Yeah, this is the right place to use Cow"? Thanks!

I assume you mean clone-on-write.
Cow is actually an enum. It can hold either an immutable (borrowed) reference, or a mutable clone of the same. On cloning, the value of the enum changes from the immutable ref to the mutable clone.

The fact that the same pointer transparently stores both versions means that you can keep using the same pointer, not knowing if a clone occured or not.

The copy-on-write pattern saves you unnecessary copies, increasing performance in some situations. Take for example the btrfs file system, which allows you to clone files with almost no cost. The clone just points to the same blocks of data as the old file. If you change the clone, only then new blocks of data must be allocated on disk, and only for the changed part of the file. This saves disk space and time.

The Cow object makes this pattern transparent to use.

Is that helping?

3 Likes

A classic example where this is used is json parsing. Usually when parsing a json object containing a string, you can just return a slice into the original json, but since it is valid for json to contain escaped characters e.g. "abc\ntest", you cannot always make a slice, since it would have to unescape the string. A cow can be used to return a slice when possible, but if there's escaped characters to handle, a String can be returned instead.

11 Likes

Two examples:

  • A type that unifies String with &'static string literals

    Imagine having some struct with a name field, which is created many times, with some known default name: "default_name". And you want users to dynamically be able to change that name.

    • The dynamic part of the name requires that the type of the name field be something that can be heap-allocated. For instance, a String

    • The fact that many such structs are created with a default name (or within a set of default known names), means that you would like to avoid the unnecessary heap-allocation of "default_name".to_string() required to type-check against String (c.f., reason .1)

    In that case, having name: Cow<'static, str> allows to hold both Strings and &'static str, thanks to a O(1) conversion with Cow::from or the resulting .into():

    struct Struct {
        name: Cow<'static, str>,
        // ...
    }
    
    impl Default for Struct {
        #[inline]
        fn default ()
          -> Self
        {
            Self {
                name: "default_name".into(), // no heap-allocation!
                // ...
            }
        }
    }
    
    impl Struct {
        pub
        fn set_name (self: &'_ mut Self, name: String) // or even better: `impl Into<Cow<'static, str>>`
        {
            self.name = name.into(); // no copy
        }
    }
    
    • Drawback: reading the string of a Cow<'_, str> incurs in a branch ("which variant of the enum is this?")
  • perform some copy-less conversion for most cases, while keeping the option to copy and mutate as needed for some cases (c.f., @alice's great example about JSON parsing).

    An example of it in the standard libary is: String::from_utf8_lossy()

    which is like str::from_utf8, i.e., it checks and on success upgrades a slice of bytes to a valid UTF-8 slice of bytes (a.k.a. str), except that on failure, instead of returning an Err that needs to be handled, it just replaces the input bad UTF-8 bytes with the character, which only then requires that the bytes be copied into the heap. Returning Cow<'_, str> is a way to unify the return type for both cases (returning String would mean the the bytes are always unconditionally heap-copied, and returning &str is not possible when the input slice is not guaranteed to be valid UTF-8).

10 Likes

Just a side question, seeing that the documentation mentioned that Cow is a smart pointer. I recall that smart pointer is not zero cost if I am correct.

use std::{borrow::Cow, mem};

fn main() {
    dbg!(mem::size_of::<Cow<str>>()); // 32
    dbg!(mem::size_of_val::<str>(&"")); // 0
    dbg!(mem::size_of_val::<String>(&"".to_string())); // 24
    dbg!(mem::size_of_val::<Cow<str>>(&Cow::Borrowed::<str>(""))); // 32
    dbg!(mem::size_of_val::<Cow<str>>(&Cow::Owned::<str>(
        "".to_string()
    ))); // 32
    dbg!(mem::size_of_val::<str>("a")); // 1
    dbg!(mem::size_of_val::<String>(&"a".to_string())); // 24
    dbg!(mem::size_of_val::<Cow<str>>(&Cow::Borrowed::<str>("a"))); // 32
    dbg!(mem::size_of_val::<Cow<str>>(&Cow::Owned::<str>(
        "a".to_string()
    ))); // 32
}

Link to playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fa5f3cfd1d61c064f62029374b8266fe

Looking at the above code, does that means that both Cow::Borrowed and Cow::Owned requires the same storage size (32 from what I tested, may differs) even though the type is borrowed?

Correct me if I am wrong.

Cow::Borrowed and Cow::Owned are enum variants of the same type, so since they are the same type, they take up the same amount of space. And no, using a Cow is not zero cost, because accessing it requires an if to check whether it is borrowed or owned. However this would usually be much cheaper than copying the slice every time, so it's worth it.

2 Likes