Manually adding padding to `FastStr` gains 9x boost on performance, but why?

PureWhite · December 23, 2023, 11:09am

Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.

After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to it(change the type of len from u8 to usize), the performance boosts about 9x. But the root cause is still not clear.

Related PR: feat: change type of len to usize to optimize clone cost by PureWhiteWu · Pull Request #6 · volo-rs/faststr · GitHub

Furthermore, I've tried the following methods, but none helps:

change INLINE_CAP to 24
change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
change INLINE_CAP to 22 and add a new struct Inline without the _pad field

To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.

nerditation · December 23, 2023, 12:22pm

I think it's a alignment issue. instead of

Inline {
    _pad: u64,
    len: u8,
    buf: [u8; INLINE_CAP],
}

can you try the following alternatives:

// alternative #1
#[derive(Clone)]
#[repr(usize)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

// alternative #2
#[derive(Clone)]
enum Repr {
    //...
    Inline(Inline),
}
#[repr(align(8))]
Inline {
    len: u8,
    buf: [u8; INLINE_CAP],
}

the reason (I think) is due to the way how rust layout enum fields. this enum has very few variants, so I think u8 is chosen to be the discriminant, which means there are padding bytes between the discriminant and all other variant fields, but not between the FastStr::Inline variant. in pseudo code:

#[repr(C)]
struct ReprBytesVariant {
    discriminant: u8,
    // note here will be padding bytes
    bytes: Bytes,
}
#[repr(C)]
struct ReprInlineVariant {
    discriminant: u8,
    len: u8,
    buf: [u8; INLINE_CAP],
}
union Repr {
    Bytes: ReprBytesVariant,
    Inline: ReprInlineVariant,
}

nerditation · December 23, 2023, 12:25pm

forget another alternative:

#[derive(Clone)]
#[repr(C)]
enum Repr {
    Empty,
    Bytes(Bytes),
    ArcStr(Arc<str>),
    ArcString(Arc<String>),
    StaticStr(&'static str),
    Inline { len: u8, buf: [u8; INLINE_CAP] },
}

PureWhite · December 23, 2023, 12:44pm

Hi, thanks very much for your reply!
I've tried to just change the type of len to usize, and seems that it also works.
The problem is that, why does the alignment causes such a big perfomance gap?

nerditation · December 23, 2023, 1:04pm

maybe sufficient alignment enables simd, or maybe it's due to extra branching instructions, or maybe overlapping with padding bytes of other variants prohibits certain optimizations. just my random guesses though, I don't really know.

if you are curious, you can check the emitted assembly code, or profile and measure it.

vague · December 24, 2023, 6:16am

Crosspost in reddit: Manually adding padding to FastStr gains 9x boost on performance, but why?

system · March 23, 2024, 6:16am

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Announcing iString: a String type with small string optimization announcements	9	2580	January 12, 2023
Force padding to be zero, not uninit help	8	210	October 5, 2024
Why is f32::from_str so slow help	26	2804	January 12, 2023
Struct padding rules in Rust help	11	2890	April 6, 2022
String concatenation best practices/performance? help	5	10573	January 12, 2022

Manually adding padding to `FastStr` gains 9x boost on performance, but why?

Related topics