Dyn Traits performance

Hi,

For my business requirements (basically we need flexibility to create new type of order, as structs, easily without to deliver entire platform, I need to use dyn Trait instead of enum of each order.
See below the elements of the code required to understand my concerns...
What is really insane is the performances using dynamic binding...

Structs (orders)

pub struct FooOrder<'a> 
{
    pub order_data: Option<OrderData<'a>>,
    pub order_variant: Box<dyn TOrderType>,
    pub order_id: u64,
    pub basket_id: Option<AtomicCell<u32>>,
    pub action_counter: AtomicCell<u16>,
}

impl<'a> FooOrder<'a> {
 pub fn static_execute(&mut self){
    self.order_id = 45;
    let exec_data = ExecutionData{
        order_quantity:45,
        side: OrderSide::Buy,
        last_execution_price:12.3,
        average_price:45.5,
        last_execution_quantity:14,
        remaining_quantity:11,
        cumulative_quantity:11
    };
    _=self.order_variant.execute(&exec_data);

 }
}

pub trait TOrderType{
    fn execute(&self, exec_data: &ExecutionData) ->Result<bool, OrderErrorMessages>;
    fn update(&self, value: Box<dyn TOrderType>);
}


pub struct LimitOrderData {
    pub price: f32,
    pub quantity: u32,
    pub tif: TimeInForce,
    pub side: OrderSide
}


pub struct MarketOrderData {
    pub quantity: u32,
    pub tif: TimeInForce,
    pub side: OrderSide
}

//there are lot of order types

pub struct ExecutionData {
    pub side: OrderSide,
    pub last_execution_price: f32,
    pub average_price:f32,
    pub last_execution_quantity: u32,
    pub remaining_quantity: u32,
    pub order_quantity: u32,
    pub cumulative_quantity: u32
}

pub enum OrderSide{
    Buy,
    Sell
}
pub enum TimeInForce {
    Day,
    Gtc,
    Gtd(u64),//expiration time
    Ioc,
    Fok,
    Moo,
    Loo,
    Moc,
    Loc
}

So, here the test:

fn test_dynamic_trait(){
    let t0 = Instant::now();
    for _it in  0..100_000_000{
        let ord = FooOrder {
            order_data:Some(OrderData::default()),
            order_variant: Box::new(MarketOrderData{
                quantity: 25,
                tif: TimeInForce::Day,
                side: OrderSide::Buy
            }),
            basket_id:Some(AtomicCell::new(0)),
            order_id:25,
            action_counter: AtomicCell::new(12)
        };
        _=ord.execute(&ExecutionData{
            order_quantity:45,
            side: OrderSide::Buy,
            last_execution_price:12.3,
            average_price:45.5,
            last_execution_quantity:14,
            remaining_quantity:11,
            cumulative_quantity:11
        });
    }
    let elapsed = t0.elapsed().as_secs_f64();
        println!("test dynamic invoke Time elapsed: {:.3} sec. ", elapsed);
}

This takes 4.9 sec!!! 40 ns by dyn call. How is it possible? In C++, Java, C#, virtual methods takes, in my pc between, 0.8 and 2 ns by call.
Last versions of JIT are using PGO, but anyway, using old versions has well, the time for dynamic invocation are under 5ns.

Using static invocation, the time for the same example (using as order_variant an enum wrapping order structs) is 0.052 sec, so really as expected using rust.
(The time to create objects takes around 0.052 sec, the time to call methode is under 1ns)

There are way to understand what happens? I believe that the differences should be around x2 or x3 but not > x25...

Standard question: are you running in --release mode? Without it Rust doesn't care about speed and everything is 10-100× slower.

Use bencher or criterion to test speed of Rust code reliably.

Use cargo asm to see generated assembly to check if it's doing what you expect.


In your code you're including cost of Box::new creating a new allocation and deallocation. Allocations are relatively expensive, and this may dominate the time in the benchmark.

Rust supports &mut dyn Trait (typically used in function arguments) which can be used with any borrowed data without necessarily boxing on the heap.

5 Likes

Hi Kornel,

Yes of course, I' m using release mode:

cargo.toml
nohash = "0.2.0"
crossbeam-channel="0.5.13"
mimalloc = "0.1"
[profile.release]
codegen-units = 1
lto = "fat"
panic = "abort"
opt-level = "s"

about Box::new, this take really short time: doing only this,

fn test_dynamic_trait(){
    let t0 = Instant::now();
    for _it in  0..100_000_000{
        let ord = FooOrder {
            order_data:Some(OrderData::default()),
            order_variant: Box::new(MarketOrderData{
                quantity: 25,
                tif: TimeInForce::Day,
                side: OrderSide::Buy
            }),
            basket_id:Some(AtomicCell::new(0)),
            order_id:25,
            action_counter: AtomicCell::new(12)
        };
/*
        _=ord.execute(&ExecutionData{
            order_quantity:45,
            side: OrderSide::Buy,
            last_execution_price:12.3,
            average_price:45.5,
            last_execution_quantity:14,
            remaining_quantity:11,
            cumulative_quantity:11
        });
*/
    }
    let elapsed = t0.elapsed().as_secs_f64();
        println!("test dynamic invoke Time elapsed: {:.3} sec. ", elapsed);
}

it takes <1ms for the 100_000_000 iterations, so we can exclude the cost of that...No it is only the dynamic call of execute which takes abnormal time ...And I don't understand why. Did you have similar use case? What are your results? Thanks for your help!

How much work is the execute method doing in this test? It may be inlined in the non-dyn test, but it cannot be inlined with dyn. In general, it is usually the lack of inlining that causes the performance difference with dyn, not the dyn function call itself.

As a separate issue, to determine whether the boxing is the issue, I suggest boxing with both tests (enum and dyn) rather than removing the call to execute.

1 Like

Flamegraph might be useful to give you some additional insight.

I'm not sure it makes a difference in this particular case, but your release profile is optimizing for binary size (opt-level = "s") rather than performance (opt-level = 3). See also Profiles - The Cargo Book


Also, just to be sure, it's not enough to place a [profile.release] section in Cargo.toml. You actually have to build or run with the commands cargo build --release or cargo run --release. Sorry if that was obvious.

2 Likes

Try without opt-level = "s"? This option prioritizes binary size over performance. It may significantly degrade performance since compiler becomes more conservative with code inlining.

Your loop doesn't measure the time it takes to use Box, because this result is not used for an external side effect, so the entire struct and all of its allocations are optimized out (LLVM understands Rust's alloc/dealloc and treats it as side-effect-free).

You should use std::hint::black_box or a proper benchmark tool.

5 Likes

Hi Rikiborg

Thanks,
Yes all of my test are run with cargo run --release
thanks for the opt-level value, I've forgot change the value indeed...
The exec method do nothing!!

impl<'a> TOrderType for FooOrder<'a> {
    fn execute(&self, exec_data: &ExecutionData) ->Result<bool, OrderErrorMessages> {
        self.order_variant.execute(exec_data)
    }

    fn update(&self, value: Box<dyn TOrderType>) {
        self.order_variant.update(value)
    }
}

and order_variant is in this exemple


impl TExecution<MarketOrderData> for MarketOrderData{
    fn execute(&mut self, exec_data: &ExecutionData) ->Result<bool, OrderErrorMessages> {
       return Ok(exec_data.remaining_quantity == 0)
    }

    
}	

I'll change the opt-level= 3 and retry..

Hi..
Trying with opt-level = 3, run cargo run --release
the time still poor...
test dynamic method invoke Time elapsed: 4.888 sec
I do not understand...sorry asking that, do you have in your side some proven use case with intensive dynamic binding calls?

Hi Kornel

Thanks for your reply...
But when in my loop I do just create the objects (without calling the method) as I mention above, this take 0 time...
Do you think there are some subtle thing related to that which crashes dynamic invoke in the same loop?

Hi Kornel,

Sorry, I misunderstand your answer...
So do you think it is really the memory allocation as the most important cost, calling the dynamic method?

Hi Kornel

Thanks for your answer,
i agree, but we can't use dyn Trait inside structs as field, which is what I'm are using, so I need some container such as Box...
Maybe you have better approach to do that, I'll appreciate

pub struct FooOrder<'a> 
{
    pub order_data: Option<OrderData<'a>>,
    pub order_variant: Box<dyn TOrderType>,
    pub order_id: u64,
    pub basket_id: Option<AtomicCell<u32>>,
    pub action_counter: AtomicCell<u16>,
}

Do you have that many order types as to need dynamic dispatch for it?

Hi jumpnbrownweasel
Thanks for your suggestion indeed, I'll try it and see..

Hi firebits.io

Yes there are a lot and business can ask to create one more as specific strategy, so it is really unflexible approach using enum..

Hi cyypherus

Hum interesting, I'll try to use it, but what I'm surprising is that using just basic way to invoke a dyn Trait method, I'm facing with this pb. But you right, I'll use this tool to trying to see what happens as well

Hi Kornel,

Hum you right, I've moved the order creation out of the loop (which indeed it is more relevant, in the real world the orders are created from another workflow and the executions are applied to existing orders..) and now the method invokation takes 0sec:

fn test_dynamic_trait(){
    let t0 = Instant::now();
    let ord = FooOrder {
        order_data:Some(OrderData::default()),
        order_variant: Box::new(MarketOrderData{
            quantity: 25,
            tif: TimeInForce::Day,
            side: OrderSide::Buy
        }),
        basket_id:Some(AtomicCell::new(0)),
        order_id:25,
        action_counter: AtomicCell::new(12)
    };
    for _it in  0..100_000_000{
        // let ord = FooOrder {
        //     order_data:Some(OrderData::default()),
        //     order_variant: Box::new(MarketOrderData{
        //         quantity: 25,
        //         tif: TimeInForce::Day,
        //         side: OrderSide::Buy
        //     }),
        //     basket_id:Some(AtomicCell::new(0)),
        //     order_id:25,
        //     action_counter: AtomicCell::new(12)
        // };
        _=ord.execute(&ExecutionData{
            order_quantity:45,
            side: OrderSide::Buy,
            last_execution_price:12.3,
            average_price:45.5,
            last_execution_quantity:14,
            remaining_quantity:11,
            cumulative_quantity:11
        });
    }
    let elapsed = t0.elapsed().as_secs_f64();
        println!("test dynamic method invoke Time elapsed: {:.3} sec. ", elapsed);
}

I've learn a lot today....

1 Like

This is why benchmarking tools provide things like setup isolation and warmup time; so that your measurement is focused only on the interesting part.

Benchmarking is a deep rabbit hole. But getting useful information from a benchmark is so much more than "just run it in a loop".

4 Likes

Hi parasyte
Yes indeed, sorry to provide here so simple "benchmark"
I'll progress on my knowledge about Rust and its best practices, thanks for you all here, its is greatfull to have a good and experienced community ..

2 Likes