Zero cost abstraction or misunderstanding?

Hey every one ^^,
I'm just playing with rust pointer and simply try to catch some runtime cost of borrow checker !
I have scripted this few lines of code on Rust play ground ... and get random result
with no relevant results some time one way win some time its the other so since im not competent enough for checking the disassembly , can I stand on the zero cost abstraction of rust or my method is simply not contrasted enough and the os machine introduce some noise in the result ?
since its a fun topic I hope to get some quick answer ^^ thanks any way.

use std::cell::RefCell;
#[allow(unused_imports)]
use std::sync::atomic::{AtomicU64, Ordering::Relaxed};
use std::time::Instant;
const BENCH_ITER_COUNT:u64 = 100001;
fn main() {

    // Performace benchmark Test:
    
    // loop & results memory config... 
    let mut result_data: Vec<String> = Vec::new();
    let mut raw_val: Vec<u128> = Vec::new();
    let mut refcell_val: Vec<u128> = Vec::new();
    let mut count = 0u64;
    
    // benchmark loop...
    while count < BENCH_ITER_COUNT {
    
        // raw way...
        unsafe {
            let mut m3 = SomeMemory::new();
            let checka = Instant::now();
            //(*(&mut m3 as *mut SomeMemory)).mem2.store(count, Relaxed);
             (*(&mut m3 as *mut SomeMemory)).mem4 = count;
            let ref value = *(&m3 as *const SomeMemory);
            let timea = checka.elapsed().as_nanos();
            result_data.push(format!(
                "raw -> val:{:?} memory mutation perf:{}(ns)",
                value, timea
            ));
            raw_val.push(timea);
        }
        
        // Refcell way...
        
        let m4 = RefCell::new(SomeMemory::new());
        let checkb = Instant::now();
        let ref m5 = m4;
        //m5.borrow_mut().mem2.store(count, Relaxed);
        m5.borrow_mut().mem4 = count;
        let timeb = checkb.elapsed().as_nanos();
        result_data.push(format!(
            "RefCell -> val:{:?} memory mutation perf:{}(ns)",
            m5.borrow(),
            timeb
        ));
        refcell_val.push(timeb);
        count += 1;
    }
    
    let raw_min = raw_val.iter().min().unwrap();
    let refcell_min = refcell_val.iter().min().unwrap();
    
    // result display.
    println!(
        "\nOS exception:\nraw way: {}ns (at best)\nrefcell way: {}ns (at best)\n",
        raw_min, refcell_min
    );
    
    if raw_val.len() == refcell_val.len() {
        // compute the average.
        let raw_average = raw_val.iter().sum::<u128>() / (raw_val.len() as u128);
        let refcell_average = refcell_val.iter().sum::<u128>() / (refcell_val.len() as u128);
        
        // result display.
        println!(
            "Average performance counter :\nfor raw: {}ns\nfor RefCell: {}ns",
            raw_average, refcell_average
        );
    }
}
#[allow(dead_code)]
#[derive(Debug)]
struct SomeMemory {
    mem1: AtomicU64,
    mem2: AtomicU64,
    mem3: AtomicU64,
    mem4: u64,
}
impl SomeMemory {
    fn new() -> Self {
        Self {
            mem1: AtomicU64::new(0u64),
            mem2: AtomicU64::new(0u64),
            mem3: AtomicU64::new(0u64),
            mem4: 0,
        }
    }
}

(Playground)

Output:


OS exception:
raw way: 711ns (at best)
refcell way: 713ns (at best)

Average performance counter :
for raw: 784ns
for RefCell: 832ns

Errors:

   Compiling playground v0.0.1 (/playground)
    Finished release [optimized] target(s) in 1.20s
     Running `target/release/playground`

Micro-benchmarking is super-hard. Anything under hundreds of milliseconds it's critically important to use something like Criterion.rs - Criterion.rs Documentation to have it be more likely a meaningful result.

8 Likes

RefCell is not intended to be a zero-cost abstraction: it defers borrow checking to run time, where you have to pay for it with extra computational work. Which is one reason you should avoid it and use normal references or Cell whenever reasonably possible.

2 Likes

But what I see is RefCell are not so expensive in runtime cost next to raw pointer (some time even better in average time... (see it by yourself you just have to increase the const value)... on a large set of mutable memory which the point of my issue... is this a LLVM trick or so ?)

1 Like

"zero cost" does not mean low, non-zero cost; it means no added cost.

1 Like

Of course, zero cost abstraction doesn't fit there, compiler optimisation would be the right pick for explaining the fact that on a large set of mutable value I don't really pick some huge differences (?...), ReffCell is often faster than just raw pointer itself which is unexpected... but that is certainly not the right way to contrast the borrow checker runtime cost ... im going to look at micro bench for that and forget average values...

I wouldn't be sure that's what you're actually measuring. The measurement windows in your OP are so small (3-4 asm instructions) that any difference is probably dwarfed by variations in calling Instant::now() and elapsed().

6 Likes

And in a larger context, safe Rust such as RefCell has more optimization potential, due to the LLVM noalias attribute, than unsafe Rust, as benchmarks of significant size might demonstrate.

3 Likes

How do you easily target the instructions loop rust disassembly is too spaghetti at first glance ? because it's just what need too look at in fact...

1 Like

good, I have fond this and that's exactly what I need:

thanks

1 Like

Final diagnostic:
with cargo asm and --rust flag i'm able locate the instructions quality
and ReffCell sim to be surpinsingly efficient... the runtime tradeoff worse the benefit
at my sense, when you don't have the control of the context of your function call...

(with raw pointer: 1 instruction)
(*(&mut m3 as *mut SomeMemory)).mem4 = count;
str x27, [sp, #216]

(with RefCell smart pointer: 2 instructions)
m5.borrow_mut().mem4 = count;
str x27, [sp, #128]
str xzr, [sp, #96]

1 Like

In GodBolt, right-click on the Rust code in question and "Reveal linked code."

Something like

        mov     qword ptr [rsp + 248], r15
        lea     rax, [rsp + 224]
        mov     qword ptr [rsp + 280], rax

vs.

        cmp     qword ptr [rsp + 96], 0
        jne     .LBB13_96
        mov     qword ptr [rsp + 128], r15
        mov     qword ptr [rsp + 96], 0

on whatever the default GodBolt target is. (Plus whatever impact having the panic paths imparts).

2 Likes

oh great !! (godbolt.org) its amazing ...

1 Like

Borrow checkers purely run at compile time. Those lifetime informations are not passed to the compiler backend like llvm. It's guaranteed that the lifetime system never affect the produced machine code. It can only reject some code that failed to pass the check.

What you're testing here is the runtime cost of the RefCell, which is a highly efficient single threaded lock, but not the borrow checker itself.

4 Likes

There are so many ways you can fool yourself when measuring.

  • You're using Instant to measure something that is a few instructions long. That's like using a tape measure to measure a few atoms. You're only measuring the manufacturing tolerance of the instrument itself, not the thing you're actually trying to measure. Try taking a measure of doing no work at all (Instant::now().elapsed()) and you see that you get numbers around one microsecond. To accurately measure something very small, you need to repeat it a lot of times within a single measurement. You can for example make a loop that runs a million times between starting and stopping the timer.
  • Rust is a very high-level language. Instructions are not layed out the same way as your source code is structured. What looks like a very concrete assignment in the source code will actually not be compiled directly as an assignment, but will just tell the compiler “hold my assignment” :beer:, and the compiler can then insert an assignment wherever needed, or nowhere at all if redundant. Your assignments can happen before the timer starts, after the timer stops, or not at all if the compiler can prove that the memory already has the value it needs to be assigned. There can also be other irrelevant instructions placed in the section that is measured by the timer.
  • Operations on RefCell are inlined and can be optimized as a zero-cost abstraction, but this entirely depends on the context. In one context the lock may be optimized away, and in another one it will not, so you have to look at the actual context where you consider using a RefCell.

Let's now look at the actual code that comes out of compilation.

unsafe pointer operations

Here you can see that two instructions that are irrelevant to what you're trying to be measured have been inserted in the section where the timer is running.

call qword ptr [rip + std::time::Instant::now@GOTPCREL] call Instant::now
mov qword ptr [rsp + 120], rax store first half of Instant on stack
mov qword ptr [rsp + 128], rdx store second half of Instant on stack
mov qword ptr [rsp + 248], r15 store counter in struct
lea rax, [rsp + 224] compute something irrelevant for later use
mov qword ptr [rsp + 280], rax store something irrelevant on the stack for later use
lea rdi, [rsp + 120] set up function argument with pointer to Instant
call qword ptr [rip + std::time::Instant::elapsed@GOTPCREL] call Instant::elapsed

RefCell

Here you can see that the assignment that locks the RefCell has been optimized away, but the instruction that check that the cell is unlocked and the instruction that unlocks it have not.

call qword ptr [rip + std::time::Instant::now@GOTPCREL] call Instant::now
mov qword ptr [rsp + 280], rax store first half of Instant on stack
mov qword ptr [rsp + 288], rdx store second half of Instant on stack
cmp qword ptr [rsp + 80], 0 compare RefCell lock to 0
jne .LBB13_96 panic if non-zero
mov qword ptr [rsp + 112], r15 store counter in struct
mov qword ptr [rsp + 80], 0 set RefCell lock to zero
lea rdi, [rsp + 280] set up function argument with pointer to Instant
call qword ptr [rip + std::time::Instant::elapsed@GOTPCREL] call Instant::elapsed
5 Likes

( for info u128 was what get out natively of an std::time::Instant struct ) and that may not be the right pic form purely measuring the runtime... but from the main idea to calculate the average runtime cost over millions of iterations to :
Looking straight down to assembly code we can now have a clear idea of what's going on under the hood and see that LLVM still the final master in rust on how code will be optimised
and the most efficient calliper was in fact the asm instructions count not an Instant timing evaluation.

by the way how can we directly write asm instruction from rust code ?

(exceeding the rust compiler policy is always a such things... I know machine often win but rust can only be competitive with c++ with this kind of fine tuning verbose and that why Im enjoy rust code so much)

Have a nice day/night all and thanks all for all your comments that was instructive.

( additional infos: the asm scope from rust ):
from asm - The Rust Unstable Book

#![feature(asm)]
let i: u64 = 3;
let o: u64;
unsafe {
    asm!(
        "mov {0}, {1}",
        "add {0}, {number}",
        out(reg) o,
        in(reg) i,
        number = const 5,
    );
}
assert_eq!(o, 8);

Correct.

(Although it has been proposed that Rust shall do some optimizations before handing over to LLVM. We'll see in a few years.)

Look at the story behind Matt Godbolt's Compiler Explorer. The point was to show that you don't need a lower level language than C++. You can write your C++ code and see that the assembly code that comes out of it is just as good as what you would write in a lower level language. Same thing applies to Rust. The typical way to optimize a program is not to write assembly, but to look at Compiler Explorer what assembly code comes out of the compiler, and then modify your C++ or Rust source code until the assembly code looks good enough. You can find his talks on Youtube.

3 Likes

No need to wait, MIR optimization is already a thing.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.