Practical differences between Cow- and Borrow-based implementations of string newtypes

Hi folks. I'd like to better understand the difference between what seem to be two valid approaches to parsing a newtype from either a &str or a String. This type does not need to support mutation.

Consider a Username type that may be used in situations where we know a &str will be valid for the lifetime of the Username (such as a Username that is parsed from an HTTP request and does not outlive the request), and situations where the Username outlives the value it's parsed from (for example, when retrieving a raw username from a database).

The first approach uses Cow. This seems readable, with few generic type params flying around, and fits the idiomatic use pattern of "a str if possible, but a String if you need it"

pub struct CowUsername<'a>(Cow<'a, str>);

impl<'a> TryFrom<Cow<'a, str>> for CowUsername<'a> {
    type Error = ParseUsernameError;

    fn try_from(raw: Cow<'a, str>) -> Result<Self, Self::Error> {
        CowUsername::parse(raw)
    }
}

impl<'a> CowUsername<'a> {
    fn parse(raw: Cow<'a, str>) -> Result<Self, ParseUsernameError> {
        if raw.len() < USERNAME_MIN_LENGTH {
            return Err(ParseUsernameError::TooShort);
        }
        if raw.len() > USERNAME_MAX_LENGTH {
            return Err(ParseUsernameError::TooLong);
        }

        Ok(CowUsername(raw))
    }
}

The second take is generic over some T: Borrow<str>, which includes String and &str. This is a little less readable, and has the disadvantage of requiring separate TryFrom implementations for String and &str due to the conflicting blanket implementation of TryFrom in core.

pub struct BorrowUsername<T: Borrow<str>>(T);

impl<'a> TryFrom<&'a str> for BorrowUsername<&'a str> {
    type Error = ParseUsernameError;

    fn try_from(raw: &'a str) -> Result<Self, Self::Error> {
        Self::parse(raw)
    }
}

impl TryFrom<String> for BorrowUsername<String> {
    type Error = ParseUsernameError;

    fn try_from(raw: String) -> Result<Self, Self::Error> {
        Self::parse(raw)
    }
}

impl<T: Borrow<str>> BorrowUsername<T> {
    /// Parse a string-like object into a Username.
    fn parse(raw: T) -> Result<Self, ParseUsernameError> {
        let raw_str = raw.borrow();
        if raw_str.len() < USERNAME_MIN_LENGTH {
            return Err(ParseUsernameError::TooShort);
        }
        if raw_str.len() > USERNAME_MAX_LENGTH {
            return Err(ParseUsernameError::TooLong);
        }

        Ok(BorrowUsername(raw))
    }

    /// Create a Username without checking the validity of the input.
    pub fn new_unchecked(raw: T) -> Self {
        BorrowUsername(raw)
    }
}

Are there any other important differences to note? Any (more) clear reasons to prefer one over the other?

You can transform CowUsername (let's say to normalize some unicode shenanigans) and username can remain borrowed if normalization is not needed. With BorrowUsername underlying type is exposed so if you want to support same normalization - you'll have to change to owned type for all the names even those where normalization is not needed.

3 Likes

Are there any other important differences to note?

Using generics causes monomorphization cost: for each concrete type used, the generic code must be separately compiled to machine code. So if you have lots of code all using BorrowUsername as well as other similar generics for other pieces of data, you can multiply the total code size and compilation time.

2 Likes

That's a great point. I hadn't considered how validation requirements, etc. might change over time. Thanks!

True, but Cow is also generic over lifetime and type, so you'd pay this cost with either approach. You could argue that a string-like Cow is such a common pattern that it's unlikely to be a special case, though. I.e. in the first approach, the compiler generates the Cow<'a, str> implementation which is wrapped by all string newtypes (Username, Email, etc.), whereas the Borrow approach results in at least two implementations for EACH newtype. Good thinking.

As an aside, you may find that the most efficient way of doing is this is to make a wrapper type on str directly.

#[repr(transparent)]
struct Username(str);

This allows for zero-cost translation between str and Username as well as enabling composite types (such as Cow<'a, Username>). Note that this implementation of Username is !Sized, so most of the time you'll manipulate it via &Username or &mut Username.

1 Like

I like this line of thinking a lot but, since the type is unsized, how would you instantiate it? From the Rust docs, it looks like working with custom dynamically sized types takes some hoop-jumping that would negate the benefit of the transparent wrapper.

True, but Cow is also generic over lifetime and type, so you'd pay this cost with either approach.

Yes, Cow is a generic type, but Cow<'a, str> is a concrete type (for codegen purposes; the lifetime doesn't count). So, the Rust function

fn do_something(u: CowUsername<'_>, e: CowEmail<'_>) { ... }

produces a single machine code function, but the generic version

fn do_something<TU, TE>(u: BorrowUsername<TU>, e: BorrowEmail<TE>) { ... }

requires up to four monomorphizations just for basic strings: <String, String>, <String, &str>, <&str, String>, <&str, &str>. This isn't necessarily a significant cost for small functions, but think twice before doing it to complex “application layer” functions that do lots of things — all their code has to be duplicated.

2 Likes

You can use a little bit of unsafe code to translate. This is what #[repr(transparent)] allows us to safely do:

impl Username {
    unsafe fn from_str_unchecked<'a>(s: &'a str) -> &'a Username {
        unsafe { std::mem::transmute(s) }
    }
}

fn is_a_username(s: &str) -> bool {
    true
}

impl<'a> TryFrom<&'a str> for &'a Username {
    type Error = ();

    fn try_from(s: &'a str) -> Result<&'a Username, ()> {
        if is_a_username(s) {
            Ok(unsafe { Username::from_str_unchecked(s) })
        } else {
            Err(())
        }
    }
}

Please do not transmute() pointers under any circumstances. And especially not fat pointers (&DST). Their layout is not guaranteed. You should be creating the pointers from scratch by taking the address of a place, or by casting other valid pointers, like this:

unsafe fn from_str_unchecked<'a>(s: &'a str) -> &'a Username {
    unsafe {
        &*(s as *const str as *const Username)
    }
}

Transmute preserves bits and bytes exactly. If the layout of two pointers isn't the same, then that's wrong, it results in an incorrect value for the resulting pointer. Pointers created by taking the address of a place or by as-casting another good pointer are always guaranteed to have the correct layout because the compiler itself synthesizes them.

2 Likes

Clearly explained, thank you :slight_smile:

This is a cool optimisation (taking @H2CO3's comment on pointer transmutation into account), and not something I'd have thought of. Dipping into unsafe to express the concept of a stringish username is too big a cognitive hit for most applications, though. At least, I'm certainly not writing anything with these performance requirements :sweat_smile: