`RTSecret<T, std::cell::Cell<u8>>` is the same size as `RTSecret<T, std::cell::Cell<u16>>`, why? And how to optimize such that former is smaller than latter?

This is related to this issue.

Context

I am the author of sosecrets-rs and right now, I am writing RTSecret<T, MEC: typenum::Unsigned>, which represents a secret that would return an Err or panics only if the value of its internal counter (a counter that measures the number of time a secret is exposed, I name it as ec) exceeds the runtime unsigned integer value represented by the type parameter MEC.

In this issue, @2e71828 helped me to write a declarative macro, named, impl_choose_int!(...) which in a brute force manner, implements the trait like so:

trait ChooseInt {
    type Output;
}

to all type level unsigned integers representable by 1 to 64 bits.

Question
The runtime secret struct RTSecret looks like so:

struct RTSecret<
    T,
    MEC: ChooseInt + typenum::Unsigned,
>(T, std::cell::Cell<<MEC as ChooseInt>::Output>);

The intent is for <MEC as ChooseInt>::Output to give u8 if U1 <= MEC <= U255, u16 if MEC <= U65535 and so on; so that RTSecret can 'save' space.

However, to my shock, Playground, RTSecret<i32, U1000> is the same size as RTSecret<i32, U10>, whereas I would expect the latter to be smaller according to std::mem::size_of.

I understand the algorithm of size_of takes into account of alignment, e.g. i32 is 4 bytes, where <U10 as ChooseInt>::Output = u8 is 1 byte, but because of alignment, 3 bytes are padded so that it is 8 bytes altogether.

However, does it really mean that the layout at runtime is going to also be 8 bytes? If so, then this optimization will be a big waste.

Yes, it does— A type T's size is always a multiple of its alignment, so that they can be placed in a slice with a stride of size_of::<T>(). Otherwise, the second element of the slice would be improperly aligned. That's what @CAD97 was referring to in their reply to your initial post:

2 Likes

This has more information about repr(packed), which you may want to use https://doc.rust-lang.org/nomicon/other-reprs.html#reprpacked

Edit: This shouln't be used when you take a reference to self.0, because that reference may not be correctly aligned if using repr(packed). This is important if, for example, you have an array of RTSecrets.

2 Likes

It won't always be 8 bytes; the compiler is padding so that T's alignment is always met if you have an array of RTSecret<T,_>. If T is 8 or 16 bits large, there will be no padding, or at most 1 byte of padding, for example.

But this is a very marginal optimization - it only applies if T has an alignment of 1 or 2 bytes (not 4), or if MEC would be 64 bits and T has an alignment of 4 bytes. Otherwise, for performance reasons, the compiler will prefer to pad.

It's in theory possible for the compiler to make use of the padding as "niches" for the niche-filling optimization (where it could put the discriminant of something like Option<RTSecret<T, _>> in the padding of RTSecret, making Option<RTSecret<T, _>> the same size as RTSecret<T, _>>, but I don't believe it does this today. (edit: see @2e71828's aside, as making this possible involves changes to semantics)

2 Likes
Aside re: niche optimization

My understanding is that this is blocked by language semantics: If there's a type T stored within an enum variant, then it's possible to get an &mut T pointing to it. Reading padding bytes is generally UB because they are allowed to be uninitialized, but writing into them is not, even if the write places uninit there.

If downstream code does this, it will wipe out the discriminant value. That will either change the active enum variant or make the discriminant uninitialized, either of which will very likely lead to UB in safe code.

The only way to make this work would be for the compiler to restrict the legal bit pattern for some or all of the padding bytes, so that copying one T over another couldn't possibly affect an enum variant stored there, but I don't know if that is too much of a change to break existing code.

4 Likes

I feel upset though; like I guess my fundamentals are really weak (I come from a Python background) and I'm perhaps too optimistic about my capability to contribute to the Rust ecosystem.

I guess the optimization still has its place, that is, if the user of the RTSecret<T, MEC> uses e.g. T = u16, and MEC = U10000 => internal counter is of type u16, then it does 'save' space as compared to (u16, usize) without this optimization.

It is just rare as pointed out by @CAD97 and @2e71828 that users will be benefitting from this.

1 Like

Hi sir @2e71828

For the macro you wrote, if I want to extend it to include type AtomicOutput = AtomicU8, AtomicU16, ..., etc. What do you think is the best way for me to extend it?

The simplest is to copy and paste the macro, create a new trait with associated type type AtomicOutput = AtomicU8, etc. But I think that is not elegant, but I am unsure how to make say B0 => u8 into type AtomicOutput = AtomicU8 since in the macro implementation, $out is u8 - i.e. I'm unsure how to convert u8 to AtomicU8 within a macro implementation.

Don't feel bad about it; this sort of stuff (alignment restrictions) is something that you only encounter when you get deep into how CPUs work, and it's not something the vast majority of programmers even know about, let alone care about.

Most of the time, the things that matter to us are bigger picture than these details; getting the big picture right, then worrying about the details is a lot more efficient a way of working than getting the details right and then having to work out if you can use any of this code or if the big picture is wrong and the details don't matter.

6 Likes

The easiest way is probably to define a helper trait:

trait AsAtomic {
    type Atomic;
}

impl AsAtomic for u8 {
    type Atomic = AtomicU8;
}

// also impls for u16 and u32…

Then, in the macro, you can refer to <$out as AsAtomic>::Atomic.

2 Likes

Haha @2e71828, thank you for your reply;

I went to do it the 'macro' way, Playground.

macro_rules! convert_to_atomic {
    (u8) => {
        core::sync::atomic::AtomicU8
    };
    
    (u16) => {
        core::sync::atomic::AtomicU16
    };
    
    (u32) => {
        core::sync::atomic::AtomicU32
    };
    
    (u64) => {
        core::sync::atomic::AtomicU64
    };
    
    (u128) => {
        core::sync::atomic::AtomicU128  
    };
}

And changed the $out:ty to $out:tt.

But I honestly think your solution is so much simpler and elegant, shall adopt yours.

Atomics is because I'm writing RTSyncSecret soon which is the threadsafe version of RTSecret.

1 Like

Hi @farnz

Thank you for taking the time to pen this encouragement. I appreciate it. It shows that the Rust community is an inclusive and encouraging one. :heart:

Yeah, I think at least having an idea, implementing it and testing it out; and then, only to find out it doesn't work as intended or at least as ideally as I hope it would be, is a great lesson. I learned about alignment, etc. from a textbook in the past but obviously, I had forgotten all of them - I guess it's time to pick them up again. :wink: Thank you once again.

1 Like

Hi sir @2e71828

Sorry to bother you again.

Looks like I encountered an insurmountable issue.

Playground

// other impls omitted
impl AsAtomic for u64 {
    type Output = core::sync::atomic::AtomicU64;
}

impl AsAtomic for u128 {
    // if cpu word size is not 128 bits, then associated type `Output` exists
    #[cfg(not(target_pointer_width = "128"))]
    type Output = core::sync::atomic::AtomicU64;
}

In the macro impl which you had kindly provided:

// omitted ...
// Implement one
    (
        @prev_args ($($prev_args:ident,)*);
        @prev_num $prev_num:ty;
        @rest_args ($arg:ident, $($rest_args:ident,)*);
        @rest_out ($out:tt; $($rest_out:tt;)*);
    )
    => {
        impl<$($prev_args,)* $arg> ChooseInt for UInt<$prev_num, $arg> {
            type Output = $out;
            // if put #[cfg(not(target_pointer_width = "128"))], then lost support
            // for AtomicU64 and below sizes too.
            type AtomicOutput = <$out as AsAtomic>::Output;
        }
        impl_choose_int!{
            @prev_args ($($prev_args,)* $arg,);
            @prev_num UInt<$prev_num, $arg>;
            @rest_args ($($rest_args,)*);
            @rest_out ($($rest_out;)*);
        }
    };

Question
Seems like type AtomicOutput = <$out as AsAtomic>::Output; compiles as shown in the example expansion:

impl<B00, B01, B02, B03, B04, B05, B06, B07, B10, B11, B12, B13, B14, B15,
    B16, B17, B20, B21, B22, B23, B24, B25, B26, B27, B30, B31, B32, B33, B34,
    B35, B36, B37, B40, B41, B42, B43, B44, B45, B46, B47, B50, B51, B52, B53,
    B54, B55, B56, B57, B60, B61, B62, B63, B64, B65, B66, B67, B70, B71, B72,
    B73, B74, B75, B76, B77, B80, B81, B82> ChooseInt for
    UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UInt<UTerm,
    B00>, B01>, B02>, B03>, B04>, B05>, B06>, B07>, B10>, B11>, B12>, B13>,
    B14>, B15>, B16>, B17>, B20>, B21>, B22>, B23>, B24>, B25>, B26>, B27>,
    B30>, B31>, B32>, B33>, B34>, B35>, B36>, B37>, B40>, B41>, B42>, B43>,
    B44>, B45>, B46>, B47>, B50>, B51>, B52>, B53>, B54>, B55>, B56>, B57>,
    B60>, B61>, B62>, B63>, B64>, B65>, B66>, B67>, B70>, B71>, B72>, B73>,
    B74>, B75>, B76>, B77>, B80>, B81>, B82> {
    type Output = u128;
    type AtomicOutput = <u128 as AsAtomic>::Output;

type AtomicOutput = <$out as AsAtomic>::Output; compiles as AtomicU64 in a 64 bit machine like mine and will NOT compile in a 128 bit machine (if it exists), is there any way to handle this case? Because if there is no way, then support for sync u128 will be pointless as well.

There is no AtomicU128 (as yet), so you cannot work on 128 bit values atomically in Rust.

Some CPUs have instructions that could implement it, but you'd then need a fallback to a Mutex<u128> or similar for CPUs that don't have an AtomicU128.

1 Like

Thank you for your reply @farnz

In your opinion, do you think it is worthwhile to reach to atomic_traits for implementing compile time determined AtomicU* in my users' crates? Or maybe I don't need the crate if all I need are probably compare_and_exchange and fetch_add.

And also, seems very difficult to support u128 in non-sync and simultaneously, an atomic version of it in sync version; I guess as you had suggested, I might have to reach out to Mutex<u128> for the + functionality.

Anyway, is there any way for conditional compilations to detect the CPUs that don't support atomics of certain size so that fallback to mutexes can be granted in these scenarios?

Thank you.

You can use the target_has_atomic conditional-compilation flag for this. The easiest way to make this work is probably to add associated functions to the AsAtomic trait for the operations that you need, and then implement it twice for each integer type:

(untested)

#[cfg(target_has_atomic = "64")]
impl AsAtomic for u64 {
    type Output = AtomicU64;
    fn replace(target: &Self::Output, x: Self)->Self {
        target.swap(x, Ordering::SeqCst)
    }

    // and so on...
}

#[cfg(not(target_has_atomic = "64"))]
impl AsAtomic for u64 {
    type Output = Mutex<u64>;
    fn replace(target: &Self::Output, x: Self)->Self {
        std::mem::replace(&mut target.lock().unwrap(), x)
    }

    // and so on...
}
1 Like

The other thing to think about is whether u128 is worth supporting at all.

First, we start with a couple of useful approximations: 210 is approximately equal to 103. One year is roughly 225 seconds. A clock tick at 1 GHz takes 1 nanosecond; you can look up the clock speed and "peak instructions per clock" for a CPU to determine how much work you can do with that CPU in 1 nanosecond.

A program accessing the secret once every nanosecond is accessing the secret 109 times every second, which approximates to 230. With a u32 permitted accesses counter, this hypothetical program runs out of accesses in 22 seconds - or 4 seconds. With a u64 permitted accesses counter, it runs out in 234 seconds, which is approximately 29 years, or 500 years.

Given those approximations, is there a use case for a larger counter (u128), or is u64 "so big it'll be someone else's problem" if it runs out? For example, if you expect to access the secret 1,000 times per nanosecond, you might need a larger counter (but then you come into the domain of "is my CPU fast enough that this is plausible?"); if you expect to access once a microsecond (1,000 nanoseconds), then a 64 bit counter is now worth 500,000 years worth of accesses, making a u128 harder to justify.

4 Likes

Good day to you @farnz and @2e71828

First, @2e71828, I will take up your (and @farnz) idea on using Mutex on architectures that do not support atomics.

Second, @farnz, I owe you a lot on this. Thank you for taking your time to think through this thoroughly and even doing the calculations. I completely agree with you and will drop support for u128 in both sync and non-sync versions of runtime Secret.

I will also add both of you onto the Credits section of the README.md of the Repo.

1 Like

Hi @farnz and @2e71828

Just to be absolutely sure, since I am very new to implementing stuff across different architectures, it is usually sufficient to implement for target_pointer_width = "32" and "64", right? Or is it necessary to consider 'weird' width like 7, 31, etc?

1 Like

No problem; I've tried to explain my calculations clearly so that you can repeat the process I went through for "why not u128" in your later work. If anything I did is unclear, please do ask; these sorts of calculations (which you can do in your head if you know enough usable approximations) are useful tools to avoid spending too much time on low value things.

You should only care about widths that you know you're going to support. 32 and 64 matter because you know you'll be supporting at least one of x86, RISC-V or ARM, where 8, 16 and 32 bit atomics always exist and 64 bit atomics definitely exist on the 64 bit variants, but until you know that you're dealing with a machine that has 7 or 31 bit atomics, don't bother adding support. Basically, don't add code to support a case that you're not yet confident you want to handle.

3 Likes