Reference Speed

Are there statistics about how fast an array lookup or pointer lookup is in Rust? I know it depends, but a rough sense of the range and factors it depends on.

There are some numbers related to different types of pointer accesses here: Latency Numbers Every Programmer Should Know.

5 Likes

It's the same as in C, and reduces to the speed of memory accesses on your hardware. Typically, if the address is in L1 cache, it's fetched in a single cycle (i.e. fractions of a nanosecond). Each higher cache level roughly increases the time tenfold. So L2 cache is ~10ns, L3 is ~100ns, and fetching from RAM is on the order of microseconds. That's a ballpark estimate, specific times and factors will differ.

Oh, and some range of addresses may be memory-mapped to a file in your file system (which can be located on an SSD in the typical case, on an HDD if you're unlucky, and a network mount if you're very unlucky). Trying to access an address in this range will generally cause a page fault, pause your program and make the OS go read/write the relevant parts of the file. This means that a simple pointer dereference may take milliseconds, and in the worst case take unbounded time.

2 Likes

I think you mean microseconds.

No. A 1 GHz CPU cycle would mean 1 cycle = 1 nanosecond.

There's no number that can usefully be added up to get a total time.

Because of The Myth of RAM, in practice the time to read memory depends on the order in which you're doing it, and the size of the working set involved.

And that's assuming that there even is a pointer lookup involved. Just because you wrote one doesn't mean that LLVM actually emits one.

6 Likes

That was about RAM access which is, actually, closer to microsecond than nanosecond. Around 50ns since the beginning of XXI century with very little change over time.

More practical document to read if you want to know more would be What Every Programmer Should Know About Memory.

But the basic story: forget about what you were taught in college about O(1) operations and constant memory access.

These things were correct back when they were written, in the 1960th-1970th, when computers were slow and RAM was fast. They are not even remotely close to being correct today, when the same operation can take one CPU cycle or 500 CPU cycles.

Yes, physics is a bitch.

6 Likes

Are there good heuristics for when the L1, L2, and Main are being used?

Main latency is 100ns according to the document you posted. L1 size is generally <100KiB, L2 size <10MiB. How can this execute in 0.5s as it does?

It is looking up memory 100m times and too big for the L1 or L2. It theory it should take 100ns * 100m = 10s. What is a better way to apply the helpful heuristics you posted?

    let mut vec: Vec<usize> = (0..N).collect();
    vec.shuffle(&mut rng);
    let start = chrono::Local::now();
    let mut total = 0;
    for i in &vec {
        total += vec[*i];
    }
    println!("time {:?}", chrono::Local::now() - start);
    println!("total {:?}", total);

MacBook Pro (16-inch, 2021)
Processor: 3.2GHz Apple M1 chip

This isn't testing latency. You can issue lots of loads at once for this kind of reduction. To test latency you'd want to do some kind of pointer-chasint.

1 Like

Reasoning about the timing of memory gets really complicated because of all the moving parts.

In your example, you are iterating over a long vector and using the value you get to figure out which element to add to the total.

That sort of random access pattern is normally really bad for cache invalidation, but it might be that the pre-fetcher was taught to understand this pattern. If you squint it's very close to walking a linked list, except you are loading an offset instead of a pointer. This sort of pointer chasing is used all over the place (trees, linked lists, etc.) so it's quite reasonable for the pre-fetcher to be optimised for that pattern.

The processor can also take advantage of the fact that there's no inter-dependence between values and initiate multiple loads at the same time or do the additions out of order because it has some elements in cache that it can add while waiting for other elements to load from main memory.

A useful tool for investigating these sorts of memory questions is cachegrind. It gives you a breakdown of the instruction and data caches at different levels:

==31751== I   refs:      27,742,716
==31751== I1  misses:           276
==31751== L2  misses:           275
==31751== I1  miss rate:        0.0%
==31751== L2i miss rate:        0.0%
==31751== 
==31751== D   refs:      15,430,290  (10,955,517 rd + 4,474,773 wr)
==31751== D1  misses:        41,185  (    21,905 rd +    19,280 wr)
==31751== L2  misses:        23,085  (     3,987 rd +    19,098 wr)
==31751== D1  miss rate:        0.2% (       0.1%   +       0.4%)
==31751== L2d miss rate:        0.1% (       0.0%   +       0.4%)
==31751== 
==31751== L2 misses:         23,360  (     4,262 rd +    19,098 wr)
==31751== L2 miss rate:         0.0% (       0.0%   +       0.4%)
4 Likes

Can you explain how it's doing that? Is it magically doing things in parallel even though the code is specific sequentially?

But even if the cache stores the answers, the cache is not big enough to store many of them, is it? And I am looking up all the answers, so this code must at least reference main memory at least N (100,000,000) times, no?

Yes, it's called "out of order execution". While the CPU is waiting for a piece of RAM, it continues to execute other instructions. So you have multiple requests to RAM being processed at the same time.

No. There's cache prefetching which try to predict, in advance, which bytes may you need to retrieve from memory in the near future.

And in your code access is linear thus it's very easy to predict.

Without this mechanisms the steady march of DDR's would have made no sense! SDR, DDR, DDR2, DDR3, DDR4, DDR5… they all have approximately the same latency — and may deliver more and more data in the same request! DDR5 is even designed to fulfill two independent requests from RAM at the same time!

That's all is an attempt to hide difference between RAM latency and L1 cache latency because it's so huge, between 500 and 1000 times by now!

1 Like

There's no magic involved. Also, what might appear sequential in code might not be compiled to what you expect or be executed the way you expect. Other posters have explained out-of-order execution. In your main loop you're looking up vec[vec[j]] in sequence. All of these lookups are independent and the inner lookups vec[j] are highly predictable.

If you want to see memory latency in action, construct a random cycle: make a random permutation x of the integers from 0 to n, then construct an array y by setting y[x[i]] to x[(i + 1) % n]. Then do something like

let mut i = y[0];
let mut j = 0;
while i != 0 {
    j = (2 * j + i) % 1009; // do something with i
    i = y[i];
}

Now each lookup depends on the previous one so they cannot be done in parallel, and since y is a random cycle of length n, you are guaranteed to execute for n steps and most of the lookups will not be in cache.

3 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.