Implementation detail of smartstring

The doc of smartstring say:

Given that we use the knowledge that a certain bit in the memory layout of String will always be unset as a discriminant

I want to know which bit...
From the source:

impl Discriminant {
    #[inline(always)]
    const fn from_bit(bit: bool) -> Self {
        if bit {
            Self::Inline
        } else {
            Self::Boxed
        }
    }
}

impl Marker {
    #[inline(always)]
    pub(crate) const fn discriminant(self) -> Discriminant {
        Discriminant::from_bit(if UPSIDE_DOWN_LAND {
            self.0 & 0x01 != 0
        } else {
            self.0 & 0x80 != 0
        })
    }
}

It seems the first byte is the key...

But why that bit of String is always 0?

SmartString is the same size as String and relies on pointer alignment to be able to store a discriminant bit in its inline form that will never be present in its String form

If your pointer must always be aligned to a 32-bit or 64-bit boundary, say, the lowest bits of your aligned pointers (memory addresses) will always end in some number of zeroes.

2 Likes

While it's little endian, IMO, the lowest bit of the pointer is the first bit of memory of SmartString.
But this is different from the source code...

Where am I wrong at?
Perhaps, I need more explanation about the detail...

There's no endianness involved once you are dealing with abstract numeric values. Number-based bitwise operations don't transmute everything to a native-endian blob of bytes, they treat the number as… well, a number. Thus x & 0x01 always gives you the least significant bit of an integer.

If this weren't true, you couldn't rely on basically anything staying consistent across platforms. A literal "1" would suddenly become 263, which is certainly not a useful abstraction.

See Tagged Pointers for a starter on the general technique.

2 Likes

The validate function may talk a lot:

pub fn validate() {
    let mut s = String::with_capacity(5);
    s.push_str("lol");
    assert_eq!(3, s.len(), "SmartString memory layout check failed");
    assert_eq!(5, s.capacity(), "SmartString memory layout check failed");
    let ptr: *const String = &s;
    let ptr: *const usize = ptr.cast();
    let first_bytes = unsafe { *ptr };
    assert_ne!(3, first_bytes, "SmartString memory layout check failed");
    assert_ne!(5, first_bytes, "SmartString memory layout check failed");
    let first_byte = unsafe { *(ptr as *const u8) };
    #[cfg(target_endian = "little")]
    assert_eq!(
        0,
        first_byte & 0x01,
        "SmartString memory layout check failed"
    );
    #[cfg(target_endian = "big")]
    assert_eq!(
        0,
        first_byte & 0x80,
        "SmartString memory layout check failed"
    );
}

It assumes that the first 8 bytes are the raw pointer of the String.
Then it validates that the least significant bit of the first byte(the lowest byte) is 0 on a little-endian machine and the most significant bit of the first byte(the highest byte) is 0 on a big-endian machine.

I just can't understand the validation...

Is this to say that the address pointing to heap is always 0b...xxx00 or 0b...xxx000?

Then why the big endian use the highest bit of the highest byte? @quinedot

Well, that was me relying on their English description, and it's accurate for little endian. After skimming the code, looks like they're assuming you never get memory in the upper-half of the address space for big-endian instead. (Typically that's kernel memory, though maybe you could get ahold of such a pointer via a syscall or the like.) There's also this:

// This hack isn't going to work out very well on 32-bit big endian archs,
// so let's not compile on those.
assert_cfg!(not(all(target_endian = "big", target_pointer_width = "32")));

Edit: To be more explicit, they're putting their marker bit into the first byte of the data structure, which corresponds to the first byte of the pointer, given a String. That's the low byte in little endian (so relying on alignment by using the lowest bit) and the high byte in big endian (so relying on lower-half of address space by using the highest bit).

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.