Hi, I'm the author of FastStr crate, and recently I found a wired problem that the clone cost of FastStr is really high. For example, an empty FastStr clone costs about 40ns on amd64 compared to about 4ns of a normal String.
After some time of investigation, I found that this is because the Repr::Inline part has really great affect on the performance. And after I added a padding to it(change the type of len from u8 to usize), the performance boosts about 9x. But the root cause is still not clear.
Furthermore, I've tried the following methods, but none helps:
change INLINE_CAP to 24
change INLINE_CAP to 22 and added a padding to the Inline variant: Inline {_pad: u64,len: u8,buf: [u8; INLINE_CAP],},
change INLINE_CAP to 22 and add a new struct Inline without the _pad field
To change the INLINE_CAP to 22 is only for not increasing the size of FastStr itself when add an extra padding, so the performance is nothing to do with it.
the reason (I think) is due to the way how rust layout enum fields. this enum has very few variants, so I think u8 is chosen to be the discriminant, which means there are padding bytes between the discriminant and all other variant fields, but not between the FastStr::Inline variant. in pseudo code:
#[repr(C)]
struct ReprBytesVariant {
discriminant: u8,
// note here will be padding bytes
bytes: Bytes,
}
#[repr(C)]
struct ReprInlineVariant {
discriminant: u8,
len: u8,
buf: [u8; INLINE_CAP],
}
union Repr {
Bytes: ReprBytesVariant,
Inline: ReprInlineVariant,
}
Hi, thanks very much for your reply!
I've tried to just change the type of len to usize, and seems that it also works.
The problem is that, why does the alignment causes such a big perfomance gap?
maybe sufficient alignment enables simd, or maybe it's due to extra branching instructions, or maybe overlapping with padding bytes of other variants prohibits certain optimizations. just my random guesses though, I don't really know.
if you are curious, you can check the emitted assembly code, or profile and measure it.