Rust array performance

Hi, I'm working on some trading application in Rust and, doing some components, I've few points that I don't understand what is happened, despite few days trying to do. I'm not (hope not yet) a confirmed Rustacean, so sorry if those questions seems trivial.

The first component that I've is a buffer based on array, statically defined, using const generic.
For the size required (1_000_000 items) , I need to create it in the heap, so here the code of this buffer:

use crate::interfaces::StandardBuffer;
pub struct SimpleRingBuffer<T, const S:usize> 
where T: Sized + Clone + Default + Debug
{
    buffer: Box<[Option<T>;S]>,
    index : Cell<u64>
}

impl <T, const S: usize> SimpleRingBuffer<T, S> 
where T: Sized + Clone + Default + Debug
{
    fn vec_to_boxed_array() -> Box<[Option<T>;S]> {
        let boxed_slice = vec![Some(T::default()); S].into_boxed_slice();
        let ptr = Box::into_raw(boxed_slice) as *mut [Option<T>; S];
    
        unsafe { Box::from_raw(ptr) }
    }
   
}


impl<T, const S:usize> StandardBuffer<u64, T, S> for SimpleRingBuffer<T, S>
where T: Sized + Clone + Default + Debug,
{
    
    fn new() ->impl StandardBuffer<u64, T, S> {
        SimpleRingBuffer {
           
            index: Cell::new(0),
            buffer: Self::vec_to_boxed_array()
        }
    }
    
    fn insert_at(&mut self,elt: Option<T>, idx: u64) {
        self.buffer[idx as usize % (S-1)] = elt;
    }
    
    fn get(&self, idx:u64) ->Option<T> {
        self.buffer[idx as usize % (S-1)].clone()
    }
    
    fn get_mut(&mut self, idx:u64) ->Option<&mut Option<T>> {
        Some(&mut self.buffer[idx as usize % (S-1)])
    }
    
    fn delete(&mut self, idx:u64) {
        let idx= idx as usize % (S-1);
        if let Some(_) = self.buffer[idx]{
            self.buffer[idx] = None;
        } 
    }
    
    fn display_buffer(&self) {
        for cpl in self.buffer.iter().enumerate(){
            println!("The value of the elements after insert ring elements are {}, {:?}", cpl.0, cpl.1);
        }
    }
   
    fn get_next_free_index(&self) -> u64 {
        
        self.index.replace(self.index.get() + 1)
        
    }
    
}

So,I'm going to do some micro-benchmarks on it:
Environment: Intel7-7700HQ, 2.8 GH, 4 physical cores, 8 logical processors, 8 GB memory, windows 10

cargo.toml
nohash = "0.2.0"
crossbeam-channel="0.5.13"
mimalloc = "0.1"
[profile.release]
codegen-units = 1
lto = "fat"
panic = "abort"
opt-level = "s"

first bench: bench1

fn test_buffer_array_int(){
    println!("The test of ring buffer: insert u64");
    let mut hm = SimpleRingBuffer::<u64,1_000_000>::new();
    let base_index = OrderIdProvider::new().get_range_id();
    for _ in 0..10 {
        let t0 = Instant::now();
        for i in 0..1_000_000{
            let id=OrderIdProvider::get_full_order_id(base_index, hm.get_next_free_index());
            hm.insert_at(Some(i), id);
        }
        let elapsed = t0.elapsed().as_secs_f64();
        println!("insert 1000 000 u64 in the buffer.Time elapsed: {:.3}", elapsed);
    }
}

I precise that the static method get_full_order_id takes totally negligeable time
Here is code

use rand::Rng;
use rand::rngs::ThreadRng;

static MASK_L :u64 =0b0000000000000000000000000000000011111111111111111111111111111111;
static MASK_H :u64 =0b1111111111111111111111111111111100000000000000000000000000000000;

pub struct OrderIdProvider {
    rand_gen:ThreadRng
}

impl OrderIdProvider {
    pub fn get_base_id(full_id: u64) ->u64{
        (full_id & MASK_H)>>32
    }

    pub fn get_order_index(full_id: u64) ->u64{
        full_id & MASK_L
    }

    pub fn get_full_order_id(id_base: u64, order_index: u64) ->u64 {
        let id_shift:u64 = id_base << 32;
        id_shift + order_index
    }

    pub fn get_range_id(&mut self) ->u64{
        self.rand_gen.gen_range(10u64..=(u64::MAX)>>32)
    }
    pub fn new() -> OrderIdProvider{
        OrderIdProvider{
            rand_gen: rand::thread_rng(),
        }
 }

results bench1:
The test of ring buffer: insert u64
insert 1000 000 u64 in the buffer.Time elapsed: 0.004
insert 1000 000 u64 in the buffer.Time elapsed: 0.003
insert 1000 000 u64 in the buffer.Time elapsed: 0.003
insert 1000 000 u64 in the buffer.Time elapsed: 0.002
insert 1000 000 u64 in the buffer.Time elapsed: 0.002
insert 1000 000 u64 in the buffer.Time elapsed: 0.002
insert 1000 000 u64 in the buffer.Time elapsed: 0.005
insert 1000 000 u64 in the buffer.Time elapsed: 0.004
insert 1000 000 u64 in the buffer.Time elapsed: 0.003
insert 1000 000 u64 in the buffer.Time elapsed: 0.003
So as expected is really fast

Second test using simple struct: bench2

fn test_perf_buffer_insert_point(){
    let mut buffer = SimpleRingBuffer::<Point,1_000_000>::new();
    //run insertion
    println!("The test of ring buffer with struct like point: insertion");
    let base_index = OrderIdProvider::new().get_range_id();
     println!("base index {}:",base_index);
    for _ in 0..10{
        let t0 = Instant::now();

        for k in 0..1_000_000 {
            let id=OrderIdProvider::get_full_order_id(base_index, buffer.get_next_free_index());
            buffer.insert_at(Some(Point{x: 45,y: 78 + k, id}), id);
        }
        let elapsed = t0.elapsed().as_secs_f64();
            println!("Time elapsed: {:.3} sec", elapsed);
    }
    
}

results bench2:
The test of ring buffer with struct like point: insertion
base index 301404768:
Time elapsed: 0.006 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec
Time elapsed: 0.006 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec
Time elapsed: 0.005 sec

Well, t'ill now results are consistent, we expect move the struct (size: 16bytes) from the stack to the heap (correct me if I'm wrong, but as I know, structs are value types first allocated in the stack) should take few more time than move u64
So, the time (as cpu execution) seems some fonction approximativley as O(ax), where x is the size of the element and a is a some number.
NB: I've tried to create the instances of Point in the heap using Box, but, and I not understand really why, the performances are worse significativelly.

So, now I'm trying with more complex struc:
Here the struct:

pub struct Order<'a,T=OrderType> 
where T: Debug+ Default + Copy + PartialEq
{
    pub order_data: Option<OrderData<'a>>,
    pub order_variant: Option<T>,
    pub order_status: OrderStatus,
    pub order_id: u64,
    pub basket_id:Option<u32>,
}

The size can be variable bcz OrderData contains several fields with string slice.

But in the run, I use defaulted value:
tierd test using Order: bench3

fn test_perf_buffer_insert_order(){
    let mut buffer = SimpleRingBuffer::<Order<_>,1_000_000>::new();
    //run insertion
    println!("The test of ring buffer with struct like order simple: insertion");
    let base_index = OrderIdProvider::new().get_range_id();
     println!("base index {}:",base_index);
    for _ in 0..10 {
        let t0 = Instant::now();
    for k in 0..1_000_000 {
        let id=OrderIdProvider::get_full_order_id(base_index, buffer.get_next_free_index());
       
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), id);

        
    }
    let elapsed = t0.elapsed().as_secs_f64();
        println!("Time elapsed: {:.3} sec", elapsed);
    }
}

results of bench3:
The test of ring buffer with struct like order simple: insertion
base index 1838332593:
Time elapsed: 0.040 sec
Time elapsed: 0.043 sec
Time elapsed: 0.032 sec
Time elapsed: 0.034 sec
Time elapsed: 0.034 sec
Time elapsed: 0.033 sec
Time elapsed: 0.034 sec
Time elapsed: 0.033 sec
Time elapsed: 0.032 sec
Time elapsed: 0.034 sec

OK...At this point, I'm not comfortable with that, why it is so expensive move a struct to the corresponding heap cell in the array? I think it should be some like memcopy(x, sizeof(x) ..) There are some reason to explain the cost? There are way to improve that?
Creating object in the stack takes nothing...
Hum, but now (and sorry for the above annoying code..) I've just experiment following:

fn test_perf_buffer_insert_order(){
    let mut buffer = SimpleRingBuffer::<Order<_>,1_000_000>::new();
    //run insertion
    println!("The test of ring buffer with struct like order simple: insertion");
    let base_index = OrderIdProvider::new().get_range_id();
     println!("base index {}:",base_index);
    for _ in 0..10 {
        let t0 = Instant::now();
    for k in 0..1_000_000 {
        //let id=OrderIdProvider::get_full_order_id(base_index, buffer.get_next_free_index());
       
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 4578125899);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 12305);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 100045789555);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 7840012543);
    }
    let elapsed = t0.elapsed().as_secs_f64();
        println!("Time elapsed: {:.3} sec", elapsed);
    }
}

And the results here:
The test of ring buffer with struct like order simple: insertion
base index 4013720174:
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec
Time elapsed: 0.003 sec

So, now it is really faster despite the fact that we are moving same structs to the heap, so why this doesn't take time?
And more, why using variable index for insertion (we can replace id with the loop variable k, the results are the same) it takes x10 times more that using constant number as index?
I believe access to array elements uses pointer arithmetic, so it is just constant time: address[0] + offset; isin't it?
Do have someone ideas about this?
Thanks for your time and help

Please read the pinned formatting guide and fix your post.

```
code or pasted terminal output goes here
```
2 Likes

It doesn't matter if your data is on the stack or on the heap. What you are probably measuring is the write to RAM performance.

  • bench1: Option<u64> has a size of 16 byte. Your write ~16MB.
  • bench2: The size of Option<Point> is not visible from the code. I'd guess 24 bytes. This means you write 24MB.
  • bench3: From your description it is reasonable that the size of Option<Order> is about 64 bytes which would mean you write 64MB in total.

This fits to the time you measure if the write speed is about 3-5GB/s.

Some comment on your code:

    fn vec_to_boxed_array() -> Box<[Option<T>;S]> {
        // No unsafe and the unwrap is optimized away
        vec![Some(T::default()); S].into_boxed_slice().try_into().unwrap()
    }
3 Likes
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 4578125899);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 12305);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 100045789555);
        buffer.insert_at(Some(Order::new_on_draft(OrderType::Limit(LimitOrderData{price: 45.2, quantity: 458 + k, tif: TimeInForce::Day,side:OrderSide::Buy}),&OrderData::default())), 7840012543);
    

And more, why using variable index for insertion (we can replace id with the loop variable k, the results are the same) it takes x10 times more that using constant number as index?

These writes are all to the same location, so the better performance is likely due to your cache.

1 Like

Use cargo asm to check what the code compiles to:

Make sure your benchmark is accurate (there are many pitfalls in using Instant):

1 Like

Hi Kornel, thanks for your advice!
Should use!

Hi Richardscollin, thanks for your response, I wasn't considerd caching here....

Hi Bruecki
Many thans for your answer and the time to review.
So, you mean that the performances here are practically determinated by the hardware, make sense indeed. Do you have maybe some way to have more performant code on that? Some strategies to improve? Thanks for your help

If your program is limited by memory bandwidth, then to go faster, you need to use less of it per unit of work.

  • Look for ways to make your data structures more compact — storing the same information in fewer bytes. You can afford to spend many CPU cycles on “unpacking” the data.
  • Reduce the number of times you read the same data again, at a different time. The optimizer can handle cases where it is obvious, e.g. accessing the same element twice within a single loop iteration, and the CPU cache will handle such cases when the optimizer doesn’t, but if you, for example, make two passes over a large vector, the second pass will have to read all the data again, because it won’t be in the cache, and the optimizer won’t rewrite your algorithm to eliminate the second pass.

And, always, when you’ve made changes, benchmark them to confirm whether they improve performance. It would be bad if you complicated your code to try to make it faster, but did not in fact make it faster — besides the effects on readability, it’s also then harder to find other things to try.

1 Like

I have not really looked at what you are doing there but when dealing with large amounts of data there are some simple things to consider:

  1. Modern processors are very fast but their memory systems cannot keep up. Hence we have fast cache memory on the processor to speed up work and minimise access to the slow main memory. Think of it like keeping a 12 pack your fridge rathe than running to the store everytime you want a beer.

  2. As a result of 1) traversing a 2D array in row major order rather than column major order will be much faster as it works on data in the cache more often.

  3. Also, in the same vein, working through an array sequentially will be faster than randomly accessing elements. Great if you algorithms allow that.

  4. It helps to minimise the size of elements so that you can get of the data you are working on into fast cache. That leads to the idea of "struct of arrays" rather than "arrays of structs" and "Entity Component Systems".

Take a look at this video: " CppCon 2016: Timur Doumler “Want fast C++? Know your hardware!" https://www.youtube.com/watch?v=BP6NxVxDQIs for a nice discussion of this and other techniques to write hardware friendly software. It's using C++ but everything there applies to Rust as well.

Hi ZiCog
Top many thanks for that, this is so interesting topic "mechanical sympathy"...Thank you so much!

Hi Kpreid
Thanks for your time and advices, I'll try to do something about my structs in order to be more compact and aligned.
Yes, sure about the best practices about readeability vs complexity, thanks to say that.