Manually adding padding to `FastStr` gains 9x boost on performance, but why?

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to it(change the type of len from u8 to usize), the performance boosts about 9x. But the root cause is still not clear.

Related PR: feat: change type of len to usize to optimize clone cost by PureWhiteWu · Pull Request #6 · volo-rs/faststr · GitHub

Furthermore, I've tried the following methods, but none helps:

  1. change INLINE_CAP to 24
  2. change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
  3. change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

I think it's a alignment issue. instead of

Inline {
    _pad: u64,
    len: u8,
    buf: [u8; INLINE_CAP],
}

can you try the following alternatives:

// alternative #1
#[derive(Clone)]
#[repr(usize)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}
// alternative #2
#[derive(Clone)]
enum Repr {
    //...
    Inline(Inline),
}
#[repr(align(8))]
Inline {
    len: u8,
    buf: [u8; INLINE_CAP],
}

the reason (I think) is due to the way how rust layout enum fields. this enum has very few variants, so I think u8 is chosen to be the discriminant, which means there are padding bytes between the discriminant and all other variant fields, but not between the FastStr::Inline variant. in pseudo code:

#[repr(C)]
struct ReprBytesVariant {
    discriminant: u8,
    // note here will be padding bytes
    bytes: Bytes,
}
#[repr(C)]
struct ReprInlineVariant {
    discriminant: u8,
    len: u8,
    buf: [u8; INLINE_CAP],
}
union Repr {
    Bytes: ReprBytesVariant,
    Inline: ReprInlineVariant,
}
1 Like

forget another alternative:

#[derive(Clone)]
#[repr(C)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

Hi, thanks very much for your reply!
I've tried to just change the type of len to usize, and seems that it also works.
The problem is that, why does the alignment causes such a big perfomance gap?

maybe sufficient alignment enables simd, or maybe it's due to extra branching instructions, or maybe overlapping with padding bytes of other variants prohibits certain optimizations. just my random guesses though, I don't really know.

if you are curious, you can check the emitted assembly code, or profile and measure it.

5 Likes

Crosspost in reddit: Manually adding padding to FastStr gains 9x boost on performance, but why?