Fast increment of a pointer


#1

I am trying to push the peddle to the metal with Rust. I increment a pointer like so:

self.begin = self.begin.offset(1);

The disassembly comes up with:

lea rcx, qword[rcx + 1]

What I want is

inc rcx

I am no expert in assembly, so can anyone tell me if the second is faster, and if so, how to achieve it in (unsafe!) Rust.

Steve


#2

It looks like rcx contains the address of whatever your self type is, and begin is the first value in that type. So, the lea here dereferences that address (pointer) to get the value of the begin field, increments it, and returns the computed address in rcx. inc rcx would therefore increment the wrong pointer. You could use inc with a memory address, instead of register, similar to how lea is done here. inc affects some flags, whereas lea doesn’t so it could avoid partial flag stalls. lea may also be executed by the address generation unit, rather than the ALU like inc. It’s hard to say much more than that without seeing surrounding code.

Compilers will use both instructions as they see fit, depending on register allocation/nearby code, target, etc.

I doubt you’re really bottlenecked by this though? :slight_smile:


#3

Thanks for expert response. Yeah it’s the inner loop, so important. Fiddling with data structures improved nothing, however I manually unrolled the loop into 4 segments and got a 50% speedup.


#4

It doesn’t matter. Stricly speaking LEA can be executed in smaller number of execution units than INC on certain CPUs. For instance, on Haswell LEA can be executed in two different execution units (1, 5), and INC can be in four different execution units (0, 1, 5, 6). That said, compiler should be careful enough to not add LEA instruction when it would cause extra latency that INC wouldn’t. Performance is identical on pretty much every 64-bit CPU, including oldest ones.

On certain CPUs like Intel Atom or AMD K8/K10, LEA is executed in AGU instead of ALU. This means on Intel Atom, LEA is actually faster than INC, when surronded by arithmetic instructions, as LEA isn’t executed in ALU responsible for doing integer calculations. This means LEA can be executed in parallel to other arithmetic instructions.

Rust compiler for x86_64 platform by default tries to generate code for generic x86 64-bit CPU, which includes Intel Atom, which is why it optimizes usage of ALU and AGU. There is no real cost in doing so, other than making code faster on those CPUs.

You can try compiling the code for specifically CPU you are using with -C target_cpu=native option, as opposed to trying to optimize for all CPUs that exist. Note that this code may not work on other CPUs, as it will use features implemented by that CPU.


#5

The compiler probably knows what it’s doing, LEA can often be faster than INC/ADD. Relevant Stack Overflow question: