Best type alias for uXX type corresponding to usize

I need to do something like

#[cfg(target_pointer_width = "64")]
pub type PlatformWord = u64;
#[cfg(not(target_pointer_width = "64"))]
pub type PlatformWord = u32;

because I need a uXX type identical in size to usize. usize does not participate to Into/From conversions except for trivial cases, and I need to write trait bound like usize: Into<u32>.

The (debatable) logic is that target_pointer_width is a "natural" type for the architecture.

Possibilities are: PlatformWord, TargetWord, ArchWord, HostWord, but I'm leaning recently towards Usize.

I'm sure someone else had the same problem, so I'd love to hear suggestions and comments. This is just cosmetic, of course.

EDIT: I forgot to say that in the intended usage, values of type PlatformWord are basically never indices. So in an ideal world there would be a best integer type for a platform, with size possibly different from usize, and I'd use that.

NativeWord is another contender.

UPDATE: I found another motivation, that has nothing to do with the “best word“ angle that caused so much controversy.

In sux, we methodically delegate through structures all traits that can be delegated. If you put a rank structure around a bit vector, you'll still get all the bit vector read-only access by delegation, beside the ranking methods. The traits are many, and delegations (courtesy of ambassador) are complicated.

Some structures work both for u32 and u64, and so they get twice the delegations. In some cases, I'd like to use for these structure usize because it scales with the platform, and in those cases it's reasonable because the structures are related to indexing (e.g., the selection structure on an Elias–Fano representation), and usize is for indices.

If I don't have NativeWord, I get thrice the delegations because I have to delegate for usize, too. That's a very good reason to have a type alias like the one I'm asking for.

Yes, you can cfg your default choice into the structure, but then every crate using it has to cfg their usage of the structure if they ever have to write an explicit type, which is a nightmare.

The idea of “word”, or there being an “ideal integer type for the platform” at all, is largely obsolete.

In today’s hardware, conversions between integer sizes are cheap (single cycle instructions, most likely) and what is expensive is accessing memory — the round-trip time taken for the response to a read request to arrive at the CPU, and how much of your data can fit in CPU cache before some of it has to be evicted.

Therefore, in many cases, you should design your data structures to use the integer size that is as big as needed to fit your data, and no bigger, so that more data can fit in the same amount of cache or transfer time.

There are exceptions where somewhat larger is better, such as ensuring alignment for SIMD operations, or fitting a structure to the size of a cache line to avoid false sharing in multithreaded programs, but these are specific situations where you design a specific data structure (and verify the results with benchmarks and other testing), not cases where just using the “platform word” will get you better results.

Therefore, you should probably not define this type alias, because it will not actually help you.

6 Likes

Well, that depends on the type of computation—memory-intensive or computation-intensive. It's a bit of a generalization.

If what you say was true, we'd all use u128—and it's not happening.

For example, if you have to allocate large sequences, size becomes a problem. At that point you have two forces: larger words contain larger values, usually necessary for larger structures (e.g., counters), but larger structures need smaller component to not exceed the core memory.

Resources are not infinite; size matters. Hence a suggested "reasonable middle", lacking more information, will always be a good idea. It is a natural way to scale things.

In fact, I'd rather have a native-word crate with cfg for various architectures in which the word is not necessarily the same as usize.

For local variables (that wouldn't ever go into memory) isize /usize is the best choice

You're exactly advocating a "native word"—simply, you don't realize.

So, back to my question: I want to use isize / usize (your idea—not mine). But I need to have them so that I can use the Into/From framework to write bounds for conversions.

What name would you use?

PS: I know there's the az crate with strict conversions. It's just too much for what I need to do.

I don’t agree with @khimru that “for local variables (that wouldn't ever go into memory) isize /usize is the best choice”. In my experience of micro-optimization, choices at the “local variables” level often have no effect or the opposite effect than you might expect. You should always either:

  • pick a type that is adequate to contain the data and call it done, or
  • measure which choice has better performance.

General principles often don’t hold up when you actually benchmark.[1]

Choosing usize is certainly the right answer when “the data” that you need to hold is “any possible slice index”, but it is not the right answer for performance. Trying and measuring is the right answer for performance.


  1. And your benchmark is not necessarily measuring what you wanted it to, either. This stuff is not simple. ↩︎

4 Likes

I mean, you can and it's often fine.

Take something like this, where you have all the local variables as u128 and do the * and + in u128:

pub fn demo(z: &mut u32, a: &u32, b: &u32, c: &u32) {
    let a = *a as u128;
    let b = *b as u128;
    let c = *c as u128;

    *z = (a * b + c) as u32;
}

If you look in LLVM (https://rust.godbolt.org/z/5aMKEa9bq), it shrinks the intermediate result anyway:

define void @demo(ptr noalias noundef writeonly align 4 captures(none) dereferenceable(4) initializes((0, 4)) %z, ptr noalias noundef readonly align 4 captures(none) dereferenceable(4) %a, ptr noalias noundef readonly align 4 captures(none) dereferenceable(4) %b, ptr noalias noundef readonly align 4 captures(none) dereferenceable(4) %c) unnamed_addr {
start:
  %_6 = load i32, ptr %a, align 4
  %_8 = load i32, ptr %b, align 4
  %_10 = load i32, ptr %c, align 4
  %_12 = mul i32 %_8, %_6
  %_11 = add i32 %_12, %_10
  store i32 %_11, ptr %z, align 4
  ret void
}

So use whatever type you want in locals; it tends to not actually matter.

4 Likes

I dunno about ARM, but for x86-64, 32 and 64 bit values are mostly identical, maybe you need an extra sign-extending instruction going from 32 to 64, maybe the slightly longer encoding most 64 bit operations have reduces your instruction cache hit rate... but I really doubt it. Heck, even that longer encoding (an added REX byte) was probably already being used to access the upper 8 registers so even that doesn't really matter.

It's overwhelmingly likely that data cache is your bottleneck, and packing more data in is going to matter far more than any "native word size", to the point that you could probably run a dozen instructions to unpack each value and come out ahead in many common cases.

However;

this is something that does kind of annoy me: it would be good to be able to declare my code doesn't work on 16-bit architectures so there's an infallible From implementation for u32, like for u16. For now, something like.try_into().expect("requires at least 32-bit pointer size") should already be optimized to be a noop.

4 Likes

I've just asked a personal opinion on a name... is anybody going to answer instead of dispensing unrequested knowledge? :joy:

Let's rephrase it: how would you call usize/isize if you couldn't call them usize/isize?

uptr/iptr or uaddr/iaddr, something like that.

1 Like

Given that u/isize are defined as:

The pointer-sized (un)signed integer type.

It seems sensible to reference that, rather than any concept of "machine word size"

That said, you then have a strong question of what the actual intent is, given you say it's not intended for indices (the natural usage of a pointer-sized integer)

Is there anything more semantic behind:

because I need a uXX type identical in size to usize.

that you could use to describe this type, given that it's not really anything to do with "machine word"?

You then say:

So it's very vague what you're looking for.

At the moment the closest I can figure is something like stdint's int_fast{N}_t which are defined by cppreference as:

fastest signed integer type with width of
at least 8, 16, 32 and 64 bits respectively

which leads to this internals thread:

In Fortran there is integer, which defaults to 4-byte integer but may also be other size depending on platform or compiler options. So why not simply int (so, iint and uint after prefixing)?

Language history had gone away from this, though, and Fortran has evolved to be more explicit on requested maximum representable integer values.

Ok, but it's a type alias. Types are usually camel case in Rust. That'd by Uptr/Iptr or Uaddr/Iaddr. Not very nice visually. The starting "I" looks like an "l" in many fonds. Well, they're indistinguishable in the font I'm using now.

Once again, I'm not advocating pro or con the idea of a "best machine word".

All I need is a nice type alias for the same integer of the size of usize/isize because they do not participate to Into/From impls, except for trivial cases.

Seriously, I'm going with Usize.

To be even clearer—I want to be able to write something like

PrimitiveNumberAs<WhateverWillItBe> + Into<WhateverWillItBe>

using num-primitive, to be able to use PrimitiveNumberAs::as_to::<WhateverWillItBe> (possibly with a as usize after) with no precision loss.

Sure. I was only suggesting a name that carried a similar meaning in at least one other language, since from previous replies we have no obvious name with this meaning in rust.

I would argue that using integers in data structures that depend on the pointer size of the architecture is a design flaw.

That said, if you regularly need to convert arbitrary (unsigned) integers into usize, a simple trait gets the job done safely:

    pub trait ToUsize{
        fn to_usize(&self) -> usize;
    }

    macro_rules! to_usize {
        ($type:ident) => {
            impl ToUsize for $type {
                fn to_usize(&self) -> usize {
                    *self as usize
                }
            }
        };
    }

    to_usize!(usize);

    #[cfg(target_pointer_width = "64")]
    to_usize!(u64);

    #[cfg(any(target_pointer_width = "64", target_pointer_width = "32"))]
    to_usize!(u32);

    to_usize!(u16);
    to_usize!(u8);

I have also wanted this. Example use case: bignum arithmetic. You want to operate on maximum words that have hardware addition, multiplication, division. Using smaller or larger words would be less efficient.

That may sometimes be true, but often it is possible to organize your computations in a cache-efficient way so that memory access is not the bottleneck.

After re-reading the thread I see you are just asking for a word.
Sorry for the late response, I got confused by all the responses.

"Usize" is wrong in my mind, because it confused me. I looked it up in the docs, could not find it in the reference with that capitalization, but did find something in std::simd that had capital Usize, and that sent me sideways. Then I realized it was a word you made up.
I vote for NativeWord because it is obvious and sticks out. You can probably alias it locally to "Usize" with a "use as" line locally if you want.

Haven't we discussed that, already? What would you do with modern 32bit x86 platform? Where 64bit addition and multiplication exist (via paddq/pmullq), while division doesn't exist.

Would you want it to be 32bit or 64bit? What about less modern platform where addition is 64bit, but 64bit multiplication is not available?

You either accept some simple rule that provides you with good, but not perfect result or you need to benchmark every little thingie… in both cases “machine word” doesn't give you good estimates.