Understanding clone vs clone_from, and in place calculations

Hello,

I have two different implementations of a simple rk4 solver below. The first was my original attempt which did a whole lot of cloning while pushing results to a vec. I was hopeful in my second attempt that by preallocating, and performing all operations in place, that I would see some big performance gains. In the end, the benchmarks were actually pretty consistently 10% slower than the first implementation. I'm guessing that maybe there's some compiler optimizations that make this a wash, or that under the hood I'm basically doing the same thing using clone_from as I would be cloning to a vec::with_capacity. Any insights to help me understand why this was the case? Any recommendations on a better way to perform these operations? I cant make the state copyable since the state is created dynamically at run time through a ui.

pub fn solve_fixed_rk4<F, T>(solver: &Solver<F, T>) -> (Vec<f64>, Vec<T>)
where
    F: Fn(T, f64) -> T,
    T: Integrable,
{
    let Solver {
        func,
        x0,
        tstart,
        tstop,
        dt,
        ..
    } = solver;

    assert!(dt.abs() > f64::EPSILON, "0.0 dt not allowed!");

    let half_dt = dt / 2.0;
    let mut x = x0.clone();
    let mut t = *tstart;

    let result_length = ((tstop - tstart) / dt).ceil() as usize;
    let mut result = Vec::with_capacity(result_length);
    let mut time = Vec::with_capacity(result_length);

    result.push(x.clone());
    time.push(t);

    while t <= *tstop {
        let k1 = func(x.clone(), t);
        let k2 = func(x.clone() + k1.clone() * half_dt, t + half_dt);
        let k3 = func(x.clone() + k2.clone() * half_dt, t + half_dt);
        let k4 = func(x.clone() + k3.clone() * *dt, t + dt);

        x = x + (k1 + k2 * 2.0 + k3 * 2.0 + k4) * *dt / 6.0;
        t += dt;

        result.push(x.clone());
        time.push(t);
    }

    (time, result)
}
pub fn solve_fixed_rk4<F, P, T>(solver: &Solver<F, P, T>) -> (Vec<f64>, Vec<T>)
where
    F: OdeFunction<P, T>,
    T: Integrable,
{
    let Solver {
        func,
        x0,
        parameters,
        tstart,
        tstop,
        dt,
        ..
    } = solver;

    assert!(dt.abs() > f64::EPSILON, "0.0 dt not allowed!");

    let half_dt = dt / 2.0;
    let x = &mut x0.clone();
    let mut t = *tstart;

    let result_length = ((tstop - tstart) / dt).ceil()  as usize + 1;
    let mut result = Vec::with_capacity(result_length);
    result.extend(std::iter::repeat(x0.clone()).take(result_length));
    let mut time = Vec::with_capacity(result_length);
    time.extend(std::iter::repeat(0.0).take(result_length));
    let mut idx = 1 as usize; // 0 will be manually stored

    result[0].clone_from(x0);
    time[0].clone_from(&t);
    t += dt;

    //rk4 stages
    let k1 = &mut x0.clone();
    let k2 = &mut x0.clone();
    let k3 = &mut x0.clone();
    let k4 = &mut x0.clone();

    //tmp variables for in place stage calculation
    let k2_tmp = &mut x0.clone();
    let k3_tmp = &mut x0.clone();
    let k4_tmp = &mut x0.clone();
    let x2 = &mut x0.clone();
    let x3 = &mut x0.clone();
    let x4 = &mut x0.clone();

    while t <= *tstop {
        x2.clone_from(x);
        x3.clone_from(x);
        x4.clone_from(x);

        //k1
        func(k1, &x, parameters, t);
        //k2
        k2_tmp.clone_from(k1);
        *k2_tmp *= half_dt;
        *x2 += k2_tmp;
        func(k2, x2, parameters, t + half_dt);
        //k3
        k3_tmp.clone_from(k2);
        *k3_tmp *= half_dt;
        *x3 += k3_tmp;
        func(k3, x3, parameters, t + half_dt);
        //k4
        k4_tmp.clone_from(k3);
        *k4_tmp *= *dt;
        *x4 += k4_tmp;
        func(k4, x4, parameters, t + dt);

        //in place chaining for (k1 + 2k2 + 2k3 + k4) * dt/6

        *k2 *= 2.0;        
        *k3 *= 2.0;
        *k1 += k2;        
        *k1 += k3;        
        *k1 += k4;
        *k1 *= dt / 6.0;
        *x += k1;

        result[idx].clone_from(x);
        time[idx].clone_from(&t);
        idx += 1;
        t += dt;
    }

    (time, result)
}
  1. Be sure to run with --release.
  2. Dynamically created state can be Copy just as easily as any other state.
  3. It's impossible to know how expensive cloning is without knowing what T is. If it's just f64, then I wouldn't expect any improvement.
  4. It seems that your Integrable trait is a subtrait of Add<Output = Self>, but you probably can also make it a subtrait of for<'a> Add<&'a Self, Output = Self>. This will let you change x.clone() + k2.clone() to x.clone() + &k2.
1 Like

I probably should have included the benchmark. This is for the in place version, but the out-of-place version is basically same thing.

use criterion::{black_box, criterion_group, criterion_main, Criterion};
use differential_equations::{
    solver::{Solver, SolverMethod},
    Integrable,
};
use std::ops::{AddAssign, DivAssign, MulAssign};

// PendulumState represents the state of the pendulum with angle `theta` and angular velocity `omega`.
#[derive(Debug, Clone, Copy)]
struct PendulumState {
    theta: f64,
    omega: f64,
}

impl AddAssign<&mut Self> for PendulumState {
    fn add_assign(&mut self, other: &mut Self) {
        self.theta += other.theta;
        self.omega += other.omega;
    }
}

impl MulAssign<f64> for PendulumState {
    fn mul_assign(&mut self, rhs: f64) {
        self.theta *= rhs;
        self.omega *= rhs;
    }
}

impl DivAssign<f64> for PendulumState {
    fn div_assign(&mut self, rhs: f64) {
        self.theta /= rhs;
        self.omega /= rhs;
    }
}

struct PendulumParameters {
    pub g: f64,
    pub l: f64,
}

// Define the equations of motion for the pendulum.
fn pendulum_dynamics(dx: &mut PendulumState, x: &PendulumState, p: &Option<PendulumParameters>, _t: f64) {
    let p = p.as_ref().unwrap();    

    dx.theta.clone_from(&x.omega);    
    dx.omega = -(p.g/p.l) * x.theta.sin();
}

fn run_simulation() {
    let initial_state = PendulumState {
        theta: 1.0,
        omega: 0.0,
    }; // Initial state of the pendulum

    let p = PendulumParameters {g: 9.81, l: 1.0};
    
    let solver = Solver {
        func: |dx,x,p,t| pendulum_dynamics(dx,x,p,t),
        x0: initial_state,
        parameters: Some(p),
        tstart: 0.0,
        tstop: 1000.0,
        dt: 0.1,
        solver: SolverMethod::Rk4Classical,
    };

    let (time, results) = solver.solve();

}


fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("pendulum_simulation", |b| b.iter(|| black_box(run_simulation())));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);
  1. Dynamically created state can be Copy just as easily as any other state.

I am very interested to understand this. It would solve a lot of my problems. Is there some crate, datatype, or concept that I'm missing that allows me to copy a variable sized collection?

  1. It's impossible to know how expensive cloning is without knowing what T is. If it's just f64, then I wouldn't expect any improvement.

I've tried to make my solver general for any types that impl Integrable, but the immediate use case is something like this simplified case, where I'm passing MyState to the solver as T.

struct MyState {
     b: Vec<B>,
     c: Vec<C>,
}

#[derive(Clone,Copy)]
struct B {
     state_1: f64,
     state_n: f64,
     //etc.
}

#[derive(Clone,Copy)]
struct C {
     state_2:f64,
     state_n:f64,
     //etc.
}

In the future, it's expected that the f64 will be come much more complicated generic Ts, but they will always be copyable. The only thing I can't figure out how to make copyable in MyState is the Vec<B> and Vec<C>, since those are created and added to A dynamically by the user.

  1. It seems that your Integrable trait is a subtrait of Add<Output = Self>, but you probably can also make it a subtrait of for<'a> Add<&'a Self, Output = Self>. This will let you change x.clone() + k2.clone() to x.clone() + &k2.

Thanks, indeed the out-of-place version impl Add which was my first attempt, but when I impl AddAssign for the in-place version I ran into that issue and impl Integrable as you suggested.

pub trait Integrable:
    for<'a> AddAssign<&'a Self> + for<'a> MulAssign<f64> + for<'a> DivAssign<f64> + Sized + Clone + Debug
{
}

You shouldn't expect your benchmark with PendulumState to have the same performance characteristics as MyState since they're very different structs. PendulumState is extremely cheap to clone since it's only 16 plain bytes, so changing things involving cloning is not going to make much difference.

The reason Copy isn't implemented for Vec isn't because it has dynamic length or built from user input, it's because it's an owned heap allocation. You can use a copyable alternative like tinyvec::ArrayVec, but if there's a lot of data, copying will be slow whether or not it actually implements Copy.

What issue? The code you showed should work, though you don't need the higher-ranked lifetimes on MulAssign<f64> or DivAssign<f64>.

pub trait Integrable:
    for<'a> AddAssign<&'a Self> + MulAssign<f64> + DivAssign<f64> + Sized + Clone + Debug
{
}

Okay, everything above is either irrelevant, minor, or unavoidable when it comes to performance. But your original question was about clone_from, and there is a simple, non-obvious thing you can do to make clone_from faster.

One of the first things to do when making a new struct is to derive Clone. This should be done for almost every type, for a few reasons:

  • Almost every type can be cloned.
  • For most types, the derived Clone::clone impl is ideal.
  • Clone::clone_from is rarely used, so usually it's not important to override it even if the default isn't ideal.

Lets do that for MyStruct.

#[derive(Clone)]
pub struct MyState {
    b: Vec<B>,
    c: Vec<C>,
}

This expands into roughly this code:

impl Clone for MyState {
    fn clone(&self) -> Self {
        Self {
            b: self.b.clone(),
            c: self.c.clone(),
        }
    }
}

That leaves the default clone_from unchanged, which looks like this:

fn clone_from(&mut self, source: &Self) {
    *self = source.clone()
}

Lets look at what happens when you use clone_from in your function.

// x2 and x are MyState
x2.clone_from(x);
// becomes
x2 = x.clone();

It's just a clone! We've gained nothing. It doesn't reuse x2's heap allocation like we want. We can fix this by overriding clone_from.

fn clone_from(&mut self, source: &Self) {
    self.b.clone_from(&source.b);
    self.c.clone_from(&source.c);
}

For this, we're trusting that Vec has a good clone_from function, which of course it does. If it didn't, we could do clear and then extend_from_slice. You could do something similar if you're using a different collection that doesn't have a good clone_from.

The final version:

pub struct MyState {
    b: Vec<B>,
    c: Vec<C>,
}

impl Clone for MyState {
    fn clone(&self) -> Self {
        Self {
            b: self.b.clone(),
            c: self.c.clone(),
        }
    }

    fn clone_from(&mut self, source: &Self) {
        self.b.clone_from(&source.b);
        self.c.clone_from(&source.c);
    }
}
1 Like

Ahh ok very interesting. I'll be implementing this code for my actual custom State struct today. I reverted back to the out of place calculations for now but will benchmark the actual use case with your recommendations and report back any findings.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.