Tuples inside an Array

I have this app where in one method I create an array of UInt64 pairs of values, and store them consecutively in the array. Then this array is passed to, and used, in another method which uses these values, and updates and stores them back into the array.

For example, if the Array has n elements (always known beforehand), then let's say the values represent a left and right pair (l, r). I store them consecutively as ary[0] = l1; ary[1] = r1, ary[2] = l2, ary[3] = r2, etc. Then in the method that uses them they are read out, and put back, in the same order.

But this requires 2 writes for each pair, and when they are used, it requires 2 reads to use the values, and 2 write to put back the updated values.

So I was thinking (ominous practice) that maybe it would be faster and/or more memory efficient to store them as a tuple {l, r} for each memory location, and then when I use them I can do just one memory read and write like this (This would also make the code conceptually simpler to read and understand, because it would be clearer that these 2 values should be considered a pair that are always processed together):

l, r = arry[i]

......

ary[i] = {l', r'}

and save a memory read and write.

So... is this possible the way I described, and would it really be of substantial benefit, in terms of performance increase or mem efficiency (I DO NOT want to create a 2D array)?

Consider what would happen if you eg indeed stored them as an array of tuples.
First you access the tuple, then you need to read/write either l or r from/to that tuple. So that's 2 reads/writes, for just 1 value¹. And 3 reads/writes for one of each type of value.

If you stored the values in 2 separate arrays, an l array and an r array. You need to read/write from/to the l array for the first, then r for the second. So that's 2 reads/writes.

And your original interleaved array is also 2 reads/writes. But I wouldn't recommend that representation if each value in a tuple is conceptually different e.g. a course-grained temperature measurement and a pressure measurement in mbar should be newtyped instead, for type safety reasons.

¹ I'm purposely ignoring any possible compiler optimizations and/or caching behavior, for the sake of simplicity. Such a compiler optimization could e.g. be that the tuple access is optimized out entirely.

It looks like both versions do two separate memory reads: Compiler Explorer

However, the version with tuples does generate slightly less machine code, because it does only one bounds check instead of two. (I tried doing things like pre-slicing the array or reversing the order of the reads to fix this, but I couldn't find a way that works without unsafe.)

1 Like

For some access patterns, the tuple version does manage to use SIMD registers to update both locations at once, while the bounds checks in the non-tuple version prevent this: Compiler Explorer

Rust code:

pub fn put2(v: &mut [(u64, u64)], i: usize) {
    let (l, r) = &mut v[i];
    *l += 1;
    *r += 1;
}

Assembly:

example::put2:
        push    rax
        cmp     rdx, rsi
        jae     .LBB1_2
        shl     rdx, 4
        movdqu  xmm0, xmmword ptr [rdi + rdx]
        pcmpeqd xmm1, xmm1
        psubq   xmm0, xmm1
        movdqu  xmmword ptr [rdi + rdx], xmm0
        pop     rax
        ret
.LBB1_2:
        lea     rax, [rip + .L__unnamed_3]
        mov     rdi, rdx
        mov     rdx, rax
        call    qword ptr [rip + core::panicking::panic_bounds_check@GOTPCREL]
        ud2

Do these SIMD optimizations also take effect on other platforms e.g. the various WASM targets?

It doesn't appear to generate SIMD code on wasm32.

On armv7 (I tried both with and without NEON enabled), the generated code looks like this. It doesn't use SIMD arithmetic, but it does use some instructions like stm to combine reads and writes into fewer instructions than the non-tuple version:

example::put2:
        push    {r11, lr}
        cmp     r2, r1
        bhs     .LBB1_2
        ldr     r1, [r0, r2, lsl #4]!
        ldmib   r0, {r2, r3, r12}
        adds    r1, r1, #1
        adc     r2, r2, #0
        adds    r3, r3, #1
        adc     r12, r12, #0
        stm     r0, {r1, r2, r3, r12}
        pop     {r11, pc}
.LBB1_2:
        ldr     r3, .LCPI1_0
        mov     r0, r2
.LPC1_0:
        add     r3, pc, r3
        mov     r2, r3
        bl      core::panicking::panic_bounds_check
        .inst   0xe7ffdefe
.LCPI1_0:
        .long   .L__unnamed_3-(.LPC1_0+8)

On aarch64, the generated code looks like this. I don't really know aarch64 assembly and am not certain if this is using SIMD registers/instructions, but I think it is?

example::put2:
        str     x30, [sp, #-16]!
        cmp     x2, x1
        b.hs    .LBB1_2
        lsl     x8, x2, #4
        ldr     q0, [x0, x8]
        mov     w9, #1
        dup     v1.2d, x9
        add     v0.2d, v0.2d, v1.2d
        str     q0, [x0, x8]
        ldr     x30, [sp], #16
        ret
.LBB1_2:
        adrp    x8, .L__unnamed_3
        add     x8, x8, :lo12:.L__unnamed_3
        mov     x0, x2
        mov     x2, x8
        bl      core::panicking::panic_bounds_check
        brk     #0x1

I forgot to pass the --target-feature=+simd128 flag. With the feature enabled, it does use SIMD on wasm32: Compiler Explorer

example::put2:
        block           
        local.get       2
        local.get       1
        i32.ge_u
        br_if           0
        local.get       0
        local.get       2
        i32.const       4
        i32.shl 
        i32.add 
        local.tee       2
        local.get       2
        v128.load       0:p2align=3
        v128.const      1, 1
        i64x2.add
        v128.store      0:p2align=3
        return
        end_block
        local.get       2
        local.get       1
        i32.const       .L__unnamed_3
        call    core::panicking::panic_bounds_check
        unreachable
        end_function

System: PCLinuxOS, 5.13.10, Rust 1.54.

My laptop specs:
System76, Intel I7 6700HQ (SkyLake, 14nm) cpu, 2.6-3.5 GHz clock, 4C|8T, and 16GB mem.
This was bought in 2016 (5 yrs old now),

After making the requisite changes where needed, the performance is the same, and the (stripped) binary is also the same size for both version. But the source code is a couple of lines shorter, and a little bit easier to read|understand (IMHO). I would be interested to see if it would be different on a newer processor.

So if nothing else, I learned a new|different option to put in my toolbox.

Thanx to all!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.