🧵 Stringlet UTF-8 Hack Option Niche?

With the release of Stringlet 0.10, I’ve looked at Option and Result niche optimization. Alas I didn’t find anything that seems applicable here.

I’ve come up with a scheme that would work for (all inline and if needed, other) strings: According to the UTF-8 standard no byte may currently (and maybe forever) be 0b1111_1xxx. That gives eight possible niche values in the first byte of non-zero sized UTF-8 byte arrays. Any way of expressing this to the compiler would be highly welcome!

I have another related need. I want to introduce a comfort wrapper unifying all kinds. Each being repr(C), the enum would implicitly (or if need be, explicitly) also be. So each, and thus the enum, would share the above niche:

enum Stringlet<const SIZE: usize> {
    Fixed(FixedStringlet<SIZE>),
      Var(  VarStringlet<SIZE>),
     Trim( TrimStringlet<SIZE>),
     Slim( SlimStringlet<SIZE>),
}

Here only VarStringlet stores the actual length in one extra byte. Since stringlets will rarely be 255 bytes big, that leaves room to store the discriminator. (Currently SlimStringlet, and hence the whole enum, is even capped at size 64, but I’m looking at relaxing that.) Again any way of expressing this to the compiler would be highly welcome!

I don't think there's any way to express things like "this particular bit pattern is impossible, trust me" without pattern types, which AFAIK are not anywhere near stabilization. The closest you could get would be a NonZero<U8> that you shift whenever you read/write it so that 0 maps to a different bit pattern.

You can't put the denominator into that place in memory just because it will rarely be used. I'm a little confused by your methodology here. What do you expect to happen when that memory does get used?

And what does "implicitly repr(C)" mean? Why would the enum itself being repr(C) allow (or it not being, prevent) the compiler to optimize your theoretical UTF-8 niche?

This works:

#[repr(u8)]
enum Utf8CodeUnit {
    CU0, CU1, CU2, ..., CU247
}

You can do it with an enum -- see this debug output, especially the largest_niche:

           largest_niche: Some(
               Niche {
                   offset: Size(0 bytes),
                   value: Int(
                       I8,
                       false,
                   ),
                   valid_range: 0..=247,
               },
           ),

(Same idea as @tczajka posted just before me)

I can, because SIZE is a generic constant. So mostly the type will have that niche, and only the three biggest sizes will not. Those will need an extra byte for the Option.

I expressed myself badly. All those four kinds are generics of the same type StringletBase. Since they are repr(C), the enum could share that same 1st possibly niche byte across all four kinds.

You guys are amazing! This sounds very promising. And I didn’t know those cool analytic tools.

I suppose tczajka’s enum is enough? Or is there any security gained from giving the explicit values?

I like to be explicit, but it's not strictly necessary.

I'm still not quite understanding you, but I think that's just my morning brain. I can see how you could theoretically take advantage of an oversized length field when you know it won't get that large, but making niches dependent on a const generic seems infeasible.

That makes sense now, my bad.

For TrimStringlet and SlimStringlet I have another, bigger UTF-8 hack. When the length is less than SIZE, I store an illegal value 0b11xx_xxxx in the last byte to encode it. With this, that would become illegal also to the Rust type-system, for some values. Luckily, for SIZE == 1 that value will be within the enum. So, when the first byte is also the last, at least they don’t collide.

This is becoming rather hacky, OMG! :face_with_peeking_eye:

The compiler currently does not support optimizing things like this:

enum AB { A = 1, B = 2 }
enum CD { C = 3, D = 4 }
enum ABCD { AB(AB), CD(CD) }

It's an obvious wish, but needs implementation work to be done.

The compiler, as of today, only supports single-value niches, but optimizing ABCD would need it to support knowing that ABCD::AB can be encoded as 1 or 2.

Wow indeed! There seems to be some special handling for Option<Option<bool>> or Option<Result<bool, ()>>, which both manage to be only one byte.

To enable this for stringlets as well, I was hoping for a bigger niche – and seem to have found it above.

Option and Result<_, ()> are not special cases; they are instances of the general case that is supported, where only one variant has data and all the other variants can be fit in niches. ABCD is a case that is not supported, because two variants have data.

Aha, I unintentionally put () as a dummy. Already Option<Result<bool, bool>> fails. I thought it would do some mapping. OTOH, that wouldn’t even be zero cost. And non-zero-non-one types complementing bool aren’t that plentiful.

Never stop learning something new. So Result niche optimisation isn’t that useful. But I’ll take whatever I can get.

Well, it does take effect in the common case of Result<(), SomeErrorEnum>. But since Results are rarely stored in data structures, and far more often returned and then immediately consumed, niche optimization isn’t that valuable for Result even when it is possible.

Note that &Result<bool, bool> can be mapped to a &bool with match, therefore an actual bool byte (0u8 or 1u8) must be present for any variant.
Also, as no methods are invoked on the Result during match, library cannot even do ad-hoc fixes like statically allocated 0u8 and 1u8.

I actually have that for my constructor size checking fits() function. So good to know!

@scottmcm is there already an RFC for the implementation of complementing enums? I could prepare my code (by making the Error enum start beyond the kind discriminator) for that, and link this as a reason.

I also just realized that union could be useful for the comfort wrapper. This way, you get to write the variant-locating code rather than compiler.

Something like this?

As an optimization it just needs someone to do the work. When it's non-guaranteed and thus could just be reverted (though obviously we'd rather not undo these optimizations, but we could), there's no need for an RFC.

That work isn't trivial, though, so as of yet nobody's picked it up.

Wow, that's some lovely magic, xie xie Hantong! You put the niche at either end, depending on endianness. Is that just an observation, or is that the guaranteed position? If the last byte were to be the niche, I'd be screwed -- even only for the earlier u8-limiting enum proposition.

That's because I'm not dealing with a number, but with a UTF-8 string. So no endiannes, but I have another hack, where the last byte of much longer strings may take all 255 values.

Also this does not seem possible for me, due to const generic limitations. You manually put a 7, but I'd need SIZE - 1.