This is my short Umbra String implementation: https://play.rust-lang.org/?version=nightly&mode=release&edition=2021&gist=3b8ffb39ba197b7b656f805da80767a1
#![feature(vec_into_raw_parts)]
use std::{ alloc, mem::ManuallyDrop };
/// The flag bit mask for the tag
const FLAG_BIT_MASK: u8 = 0x80;
/// How many bytes of data the small string buffer can hold.
/// Calculated under the assumption that a pointer is the same size as a usize.
/// Is one of 5, 11, 23 for 16, 32, 64 bit systems respectively.
const SMALL_STRING_SIZE: usize = size_of::<*const ()>() * 3 - 1;
/// The tag of the union, indicating which variant is currently stored.
#[derive(Debug, Clone, Copy, PartialEq, Eq)]
pub enum Tag {
Small,
Big,
}
#[cfg(target_endian = "little")]
#[repr(C)]
struct SmallString {
data: [u8; SMALL_STRING_SIZE],
len: u8, // only valid for 0..=SMALL_STRING_SIZE
}
#[cfg(target_endian = "little")]
#[repr(C)]
struct BigString {
data: *mut u8,
len: usize, // NOTE: cannot exceed isize::MAX (0..=SMALL_STRING_SIZE are also valid)
cap: usize, // NOTE: cannot exceed isize::MAX
}
#[cfg(target_endian = "big")]
#[repr(C)]
struct SmallString {
len: u8,
data: [u8; SMALL_STRING_SIZE],
}
#[cfg(target_endian = "big")]
#[repr(C)]
struct BigString {
cap: usize,
len: usize,
data: *mut u8,
}
/// A Small String Optimization (SSO) type that can store small strings inline.
/// A Small String is any string thats less than or equal to `SMALL_STRING_SIZE` bytes long, which is 23 bytes on 64 bit systems.
/// The `tag` is the highest bit in the capacity field, since it can only go up to isize::MAX while being a usize and as such is unused.
/// The `SmallString` and `BigString` structs are carefully layed out such that the most significant byte of the capacity field of the big string
/// is at the same memory location as the length field of the small string.
#[repr(C)]
pub union UmbraString {
small: ManuallyDrop<SmallString>,
big: ManuallyDrop<BigString>,
}
impl UmbraString {
/// Creates a new `UmbraString` from a string slice.
/// If the string is small enough, it will be stored inline, otherwise it will be stored on the heap.
/// This is an O(n) operation.
/// Recommended if you dont have / dont want to move a heap allocated string.
pub fn new_clone(other: &str) -> Self {
let len = other.len();
if len <= SMALL_STRING_SIZE {
unsafe { Self::new_small_unchecked(other) }
} else {
Self::new_big_clone(other)
}
}
/// Creates a new `UmbraString` from a String.
/// Does not inline the string, keeps it on the heap. If inlining is desired, use `new_consume_and_inline`.
/// This is an O(1) operation.
/// Recommended if you already have a heap allocated string that you dont need anymore, otherwise use `new_clone`.
pub fn new_consume(other: String) -> Self {
Self::new_big_take(other)
}
/// Creates a new `UmbraString` from a String.
/// If the string is small enough, it will be stored inline, otherwise it will be stored on the heap.
/// This is an O(n) operation if the string is small, O(1) otherwise.
/// `new_consume` is generally the more recommended way, but this might be useful if you want to
/// free some small memory from the heap. Although it does incure an extra copy for small strings
/// and if those later need to be turned into `String`s again, then an extra allocation is also needed.
pub fn new_consume_and_inline(other: String) -> Self {
let len = other.len();
if len <= SMALL_STRING_SIZE {
unsafe { Self::new_small_unchecked(&other) }
} else {
Self::new_big_take(other)
}
}
// Creates a new inline `UmbraString` from a string slice without checking the length.
unsafe fn new_small_unchecked(str: &str) -> Self {
let len = str.len();
let mut small = SmallString {
data: [0; SMALL_STRING_SIZE],
len: (len as u8) | FLAG_BIT_MASK,
};
small.data[..len].copy_from_slice(str.as_bytes());
Self { small: ManuallyDrop::new(small) }
}
// Copies a string slice into a new heap allocated `UmbraString`.
fn new_big_clone(str: &str) -> Self {
let len = str.len();
unsafe {
let big = BigString {
data: alloc::alloc(alloc::Layout::from_size_align_unchecked(len, 1)),
cap: len,
len,
};
std::ptr::copy_nonoverlapping(str.as_ptr(), big.data, len);
Self { big: ManuallyDrop::new(big) }
}
}
// Takes a String and moves it into the new `UmbraString`.
fn new_big_take(str: String) -> Self {
let (pointer, len, cap) = str.into_raw_parts();
let big = BigString {
data: pointer,
len,
cap,
};
Self { big: ManuallyDrop::new(big) }
}
/// Get the tag of the this string.
pub fn tag(&self) -> Tag {
// TODO: this will always read the tag from the small string, even if the big string is stored. I'm not sure if this is UB.
if ((unsafe { self.small.len }) & FLAG_BIT_MASK) == 0 {
Tag::Big
} else {
Tag::Small
}
}
/// Get the string slice of this string.
pub fn as_str(&self) -> &str {
let len = self.len();
unsafe {
match self.tag() {
Tag::Small => std::str::from_utf8_unchecked(&self.small.data[..len]),
Tag::Big =>
std::str::from_utf8_unchecked(std::slice::from_raw_parts(self.big.data, len)),
}
}
}
pub fn as_mut_str(&mut self) -> &mut str {
let len = self.len();
unsafe {
match self.tag() {
Tag::Small => std::str::from_utf8_unchecked_mut(&mut self.small.data[..len]),
Tag::Big =>
std::str::from_utf8_unchecked_mut(
std::slice::from_raw_parts_mut(self.big.data, len)
),
}
}
}
/// Consumes self and returns a String.
/// If the string is stored inline, a new heap allocated String is created, otherwise the existing heap allocated String is returned.
pub fn into_string(self) -> String {
let len = self.len();
unsafe {
match self.tag() {
Tag::Small => String::from_utf8_unchecked(Vec::from(&self.small.data[..len])),
Tag::Big => {
let string = String::from_raw_parts(self.big.data, len, self.big.cap);
std::mem::forget(self);
string
}
}
}
}
pub fn len(&self) -> usize {
unsafe {
match self.tag() {
Tag::Small => (self.small.len & !FLAG_BIT_MASK) as usize,
Tag::Big => self.big.len,
}
}
}
pub fn capacity(&self) -> usize {
unsafe {
match self.tag() {
Tag::Small => SMALL_STRING_SIZE,
Tag::Big => self.big.cap,
}
}
}
}
impl Drop for UmbraString {
fn drop(&mut self) {
println!("Dropping {:?}", self.tag());
match self.tag() {
Tag::Small => {
// no need to drop anything, its just some bytes
}
Tag::Big => {
drop(unsafe { String::from_raw_parts(self.big.data, self.big.len, self.big.cap) });
}
}
}
}
#[cfg(test)]
mod test {
use super::*;
#[test]
fn test_new_clone() {
let small = UmbraString::new_clone("Hello, world!");
let big = UmbraString::new_clone(&"Hello, world!".repeat(3));
assert_eq!(small.tag(), Tag::Small);
assert_eq!(small.as_str(), "Hello, world!");
assert_eq!(small.into_string(), "Hello, world!".to_string());
assert_eq!(big.tag(), Tag::Big);
assert_eq!(big.as_str(), "Hello, world!".repeat(3));
assert_eq!(big.into_string(), "Hello, world!".repeat(3));
}
#[test]
fn test_new_consume() {
let small_but_stored_big = UmbraString::new_consume("Hello, world!".to_string());
let big = UmbraString::new_consume("Hello, world!".repeat(3));
assert_eq!(small_but_stored_big.as_str(), "Hello, world!");
assert_eq!(small_but_stored_big.tag(), Tag::Big);
assert_eq!(big.as_str(), "Hello, world!".repeat(3));
assert_eq!(big.tag(), Tag::Big);
}
#[test]
fn test_new_consume_and_inline() {
let small = UmbraString::new_consume_and_inline("Hello, world!".to_string());
let big = UmbraString::new_consume_and_inline("Hello, world!".repeat(3));
assert_eq!(small.as_str(), "Hello, world!");
assert_eq!(small.tag(), Tag::Small);
assert_eq!(big.as_str(), "Hello, world!".repeat(3));
assert_eq!(big.tag(), Tag::Big);
}
}
I'm using unions and carefully laid out structs for little and big-endian and the pointer widths. I wanted to ask if there is any UB left in this code, and if you have any concerns about my approach. One specific question I have regarding my code is if it is UB to access a union member which is not the last member I have written to, since my union discriminant is stored inside the union itself. I've already ran this through MIRI on both big and little endian systems and it didnt see anything.