Measuring CPU clock cycles for a single instruction

Hey everyone!

I would like to measure the time, in CPU clock cycles, it takes to execute a single instruction. Something along the likes of this pseudocode:

pre = read_current_clock_cycle_counter();

my_instruction();

post = read_current_clock_cycle_counter();

print(post - pre);

How would I do this in Rust? I already found the tsc-timer crate but this only measures reference cycles, not actual CPU clock cycles.

C example

I know that in C you can just use the rdtsc (or rdtscp on newer cpu models) instruction like this:

static inline uint64_t bench_start(void)
{
    unsigned cycles_low, cycles_high;
    asm volatile("CPUID\n\t"
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        : "=r" (cycles_high), "=r" (cycles_low)
        ::"%rax", "%rbx", "%rcx", "%rdx");

    return (uint64_t) cycles_high << 32 | cycles_low;
}

static inline uint64_t bench_end(void)
{
     unsigned cycles_low, cycles_high;
     asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
         "mov %%eax, %1\n\t"
         "CPUID\n\t"
         : "=r" (cycles_high), "=r" (cycles_low)
         ::"%rax", "%rbx", "%rcx", "%rdx");
     return (uint64_t) cycles_high << 32 | cycles_low;
}

Well, I mean, inline asm is stable in Rust afaik. So you can do just this?

There is a crate wchih does htis, but I;m not sure it does it properly either.

You can't really measure a single inst well, so run it in a loop and either have another thread stop it or break out after so many iterations.;

RDTSC will give you clock ticks (kinda), but you want to serialize before it so poeople used to use CPUID before it, but now there is RDTSCP on more modern artchs, So you want to do,,,

x = RDTSCP
loop {
  TIMETHIS
   n += 1
}
y = RDTSC (no P)

And keep a counter. This does the minimal amount of overhead, but you also nee to black box the result so the cmpiler doesn't optimize it out. And doing single isntructions ins't useful anways. They depend too much on context, so THISTIS needs to be a little more substantive. Overall this is just really hard to do well, but I'm sure there's a perf crate somewhere has or might have it done properly?

While researching a bit more I found out that we have access to _rdtsc() through core::arch::x86_64.
Given this and the formula given by Intel in this paper for calculating clock cycles like this:

# of cycles = (TSC_post - TSC_pre) / iterations

I should be able to compute the clock cycles of a single instruction like this:

let pre = core::arch::x86_64::_rdtsc();
for _i in 0..1000_000 {
    my_instruction();
}
let post = core::arch::x86_64::_rdtsc();
let cycles = (post - pre) / 1000_000;

Or did I miss something?

On modern processors, this is essentially not a sufficient question, because they can combine things, run multiple at the same time, do register renaming, etc.

I suggest just referencing an existing table, such as https://www.agner.org/optimize/instruction_tables.pdf

(And if you want to benchmark, use Criterion.rs - Criterion.rs Documentation to benchmark a useful unit of work.)

3 Likes

The first rdtsc need to be an rdtscp instruction or be preceded by the cpuid instruction. The rdtscp instructions was created for the particular use case. So you can do either one.

Benching a single isn't very effective. A lot of compiler optimizations get in the way. It's easy to turn those off.

The CPU issues are much harder to deal with. It will often reorder instructions to get around dependencies or hide latencies - modern Intel CPUs devote a huge amount of silicon for this and have become extremely good at it.

Also CPU pipelines mean instructions have a minimum time in the core. Kind of. More silicon is devoted to hiding this latency too so it might look like an instruction completes while it still hasn't been committed yet.

All these mean your timings are meaningless.

The latency number is usually calculated as the number of cycles an instruction adds to an instruction chain. If you have a 7 cycle deep pipeline, a 4 cycle latency means the chain will take 4 cycles more. However the instructions latency can vary depending on the instructions around it and that latency hidden so it really only makes sense to measure instructions in small conceptual unit of work.

I'm not saying you need to do a full transaction or even a full function, but maybe a small loop body (even those the CPU can "help" you out a little and ruin your timing by cache the fetch and decode of the loop instructions if it is small enough). Or some instructions get fused with another op (this happens in both the macro op and micro op domains (seen a lot in address calculations and test/jumps).

The latency you care about has a context to it, so you need to measure the context too. The good part about that is you don't have to measure as much since the CPU will hide a bunch of that. Shaving a few cycle by changing instructions as gauged by looking single instruction latency counts might not do anything at all or even perform worse.

I once saw someone trying to speed up a unit of code by using smaller registers and trying to do multiple tests at once. That didn't work. Using the partial registers just caused a register dependency because those partial registers aliased the larger registers in hardware. This is also why I try to never write assembly -- he or she doesn't, I mean. It was faster on paper is my only defense.

So if it is still interesting to do this, make sure you do the rdtscp/cpuid+rdtsc instruction(s) and try to keep much of your loop machinery out of the benched code as possible and run it a ton. It is useful to do this under VTune or similar and collect cpu stats to see what is happening (eg, if you have a test to terminate the loop, you want to make sure it is always predicted. Other things like pinning to a core and removing all other work from that core can also help, but scheduling shouldn't pop up during these tiny test. And don't bother trying to measure memory latencies. It is really hard to intentionally work around the cpu and you wind up measuring things like how large the store-forward buffer is.

I don't know anything about the Criterion(?) packet that people use. I haven't looked at it and don't know how it does things.

Good luck.

1 Like

I'm curious as to what you are hoping to achieve by measuring the execution time of a single instruction.

On modern day processors this hardly makes any sense anymore. A single instruction may get run in parallel with other operations with the multiple-dispatch made available today. So it may well have zero execution time! On the other hand if that instruction is accessing memory it may cause a cache miss, which will cause it to run 10, 100 or more times slower.

I'm sure there is a bunch of other things that can make execution time very variable. Gone are the days when one could simply look at the number of clocks for an instruction in the manual.

That's a little bit of an overstatement, I think. While the rename and macro fusion can do somethings that come close to removing ops, they still take up a decoder , a reservation station, an reg entry and probably a few other. Even a properly predicted branch will take up some resources (ignoring the btb and branch predictor entries).

If you have a memory dependency, you will prime your L1 or move everything to registers first to make sure you aren't just being dominated by memory costs.

These micro benches are difficult to do and some ops are close to impossible, but people still do them and get usable results. The required technical understanding can be quite high in some cases though. There are uop instruction tables that were derived experimentally. Somebody has to do the work and they don't from the cpu manufacturers.

Some people around here seem to forget it is important to encourage experimentation (and I don't mean by holding back known answers). So try to give the current best knowledge, but I would never tell somebody not to do something like this.

Only a little. Intel Skylake and AMD Zen achieve greater that 2 instructions per clock in the right circumstances. Other architectures come close: Fig. 2. Instruction-per-cycle (IPC) for 400.perlbench on recent x86...

To my mind if I need two things doing and they both get done in the same time, then one of them has zero execution time.

Of course that all depends on the instruction sequence and its register dependencies being friendly to such parallel dispatch.

One could do that. But again one is preparing the context, the situation, the instruction runs in to get the result. In other contexts, like in your actual applications, that result no longer has any useful meaning.

I'm with you. I'm all for people experimenting and finding out for themselves. I don't mean to say "don't do that", I only mean to point out the issues one might need to take into account.

My experience of such micro-benchmarking has often led to a lot of head scratching and frustration. Changing things that should not have any effect turn out to have effect. Like simply moving code or data around in memory, even if it is logically the same. Even just getting the same result twice for the exact same code is tricky.

That's not a good way to think about hardware which is incredibly parallel. It has multiples of pretty much everything. One of those instructions is still doing alot of work that the instruction behind it could have been doing instead.

There are some optimizations that come very close to taking no resources after the decode though, butmost still have some research usage attached to them and it is helpful to have a proper understanding of this is you want to be really good in the a field that deals with at the edges of performance - you don't have to know everything in detail but you should know the model well enough instead of just pointing to magic faery dust as an explanation of why unintuitive things happen.

What then is the better way to think about it?

Well, if there are 3 instruction to be done and two of them get done at the same time then three instructions have happened in 2 clocks. Does it matter which instruction is which? Sounds like 1 of them is happening in no time to me.

As far as I can tell none of us knows anything in detail. What actually happens inside these processors is not disclosed by their manufacturers.

I was not pointing to any "magic faery dust". I was pointing at the IPC claims of those manufacturers. Which as far as I can tell are supported by independent benchmarks.

To my mind a measured IPC of two already indicates half the instructions are happening "for free" as it were, even if we have no idea about the internal juggling of the processor. We can just measure the results.

I'd be very interested to hear of a better way to look at the situation.