Help design a "smart string" optimization

The smartstring crate provides a "small string" optimization, where strings up to 23 bytes (on 64 bit platforms) can be stored inline. The beef crate provides space-optimized Cow representations that store the Cow in just three usize (or even two) rather than the four that std Cow requires (due to being fully generic).

So why not combine them? Sure, it might be an unholy amalgamation of things with no real use case, but it'll be an interesting learning experience. I'm playing around with creating SmartBeef, a type that is three u64 big, and can be either

  • a heap allocated growable string (String);
  • a borrowed string (&str);
  • an inline string, up to 23 bytes (str([u8; 23]));
  • or a reference counted shared heap string, because why not (Arc<str>).

(It might be useful to use a bit to indicate if the borrowed string is static, so SmartBeef<'a>::to_owned() -> SmartBeef<'static> can avoid copying a static string if it's already 'static. I don't have that allocated for yet.)

Also: the smartstring crate relies on assumptions about std's String layout. I don't want to do that, but by still laying out my type to be compatible with std String, I hope to make it possible for LLVM to optimize away copies, though I don't want to explicitly require this to be true.

Here's an "exposition only" pseudo union showing my current layout idea:

// Exposition only
#[repr(C, packed, align(usize))]
#[cfg(target_pointer_width = "64", target_endian = "little")]
union SmartBeef {
    // identifier: nonzero in bytes 16..=19; zero in bytes 20..=24
    GrowableHeap {
        pointer: ptr::NonNull<u8>,  // bytes 00..=07
        length: u32,                // bytes 08..=11
        _pad: 0u32,                 // bytes 12..=15
        capacity: NonZeroU32,       // bytes 16..=19
        _pad: 0u32,                 // bytes 20..=23
    },
    // identifier:    zero in bytes 16..=19; zero in bytes 20..=23
    Borrowed {
        pointer: ptr::NonNull<u8>,  // bytes 00..=07
        length: u32,                // bytes 08..=11
        _pad: 0u32,                 // bytes 12..=15
        _pad: 0u32,                 // bytes 16..=19
        _pad: 0u32,                 // bytes 20..=23
    },
    // identifier: 0x01..0x18 in byte 23
    Inline {
        bytes: [u8; 23],            // bytes 00.=22
        length: u8 in 0x01 .. 0x18, // byte  23
    },
    // identifier: 0xFF    in bytes 16..=23
    SharedHeap {
        pointer: ptr::NonNull<u8>,  // bytes 00..=07
        length: u32,                // bytes 08..=11
        _pad: 0u32,                 // bytes 12..=15
        _pad: 0xFFFFFFFFFFFFFFFFu64 // bytes 16..=23
    },
}

A couple notes:

  • The empty string can be encoded as ("", 0u32, 0u32, 0u32, 0u32) or ([0; 23], 0u8). I don't see a specific benefit to disallowing a zero-length borrowed string at this moment, and do see a benefit to using it over zero-length inline string. Deref being just if byte[23] == 0 { str::from_raw_parts(bytes[0..8], bytes[8..16]) } else { &bytes[0..23] } is very convenient. All zeros being a zero length inline string is possible, but means Deref has to check for it, and mutation still can't just set the length to 0 (e.g. truncating an inline string), it has to zero out the pointer as well.
  • I don't know what I want to do on big endian platforms. The obvious and simple solution is to abandon String layout compatibility, and just use the above layout and the same code path that from_raw_partss the String (as I don't plan on relying on the std layout). Perhaps the copy elision I want to see is impractical to see from just automatic optimization, and I should drop the pretense of matching std layout and just pack the obvious way that works near identically on all platforms: (ptr: NonNull<u8>, len: u32, cap: u32, _pad: [u8; 7], inline_tag).

What do you think?

5 Likes

I'm afraid I'm not knowledgeable enough to comment on the technical side but I want to note that I find the name SmartBeef brilliant :stuck_out_tongue: as well as the description "an unholy amalgamation". Sounds like a science experiment gone wrong ... I'd be tempted to try it out just on that basis.

What about Rc? That's the type that I have been using for strings in my interpreted SQL-like language implementation. Incidentally, I store strings of up to 15 bytes inline when storing them as bytes (larger strings get stored as 7 bytes inline and an 8 byte ID value). The 15 is a somewhat arbitrary number, and maybe it could be a configured ( for each table field ), but IMO most real-world "short" strings (words/names) are not more than 15 bytes.

What's the effect of the “packed” repr in your union pseudo-code?

Actually nothing in this case, it's there because

  • it indicates that there's an intention to explicitly list all alignment and padding manually, and
  • I previously was using a different notation that might've implied adding padding, though the current layout avoids that.

It's an interesting trade-off between Arc and Rc. I think I prefer staying Send/Sync here, but using Rc is a valid trade-off.

Personally, I'd rather keep all my strings continuous for this. (That's also why I'm not considering adding ropes as a potential backing representation.) Additionally, I'm not really interested in bringing in a global string table for this; the point of using Arc is to extend copy-on-write semantics to 'static shared strings as well as borrowed shared strings.

While I'd be interested in an actual measurement of this in an international environment, the 23 byte count comes from necessity. String is 3Ă—usize; the ideal is to match String's layout in the String case.

But even if we don't match String's layout, and we limit string length to u32 (4 GiB, and I don't feel happy limiting it further), 2Ă—usize is just barely enough space for Cow, not enough for also tagging an inline string. (Specifically, (pointer, len: u32, cap: u32); cap=0 implies borrowed. You could cut the capacity again (potentially all the way to 16 MiB) and use the final byte (the high byte in little-endian) as the inline tag, but taking the cut and masking that out (shifting it out in big-endian) doesn't seem great, when we can just use the extra space common for String anyway.)

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.