Is there a faster way to Clone a Vec?

Today I thought I would see if I could optimise the implementation of Clone for my BTreeMap. Previously I was using a CursorMut to append cloned values ( which was already slightly faster than the std implementation incidentally), so I thought I would see what would happen with a more "direct" implementation, and it was indeed much faster ( about 5 or 6 times faster ), probably partly because there are no re-allocations. As part of that, I had to implement Clone for my Vec types, here is one:

impl<T> Clone for ShortVec<T>
where
    T: Clone,
{
    fn clone(&self) -> Self {
        let mut c = Self::new(self.cap as usize);
        c.allocate(self.alloc as usize);
        for i in 0..self.len() {
            c.push(self.ix(i).clone());
        }
        c
    }
}

( From btree_experiment 0.1.94 - Docs.rs )

So it is allocating the same as the original, then pushing the cloned values. This is working well enough, but I wondered if there was a way to clone more directly using raw pointers somehow that would be more efficient, especially for types that are copy (and so can just be copied with a ptr copy operation ). Am I missing something?

I know that std does some Clone/Copy specialization tricks, but AFAIK there’s no stable mechanism for that. You could write a separate method that requires T:Copy, which either calls out to memcpy explicitly or moves the values in a loop (trusting the optimizer to do its job).

As always, you should benchmark this stuff because the optimizer might already figure this stuff out for you. You might also be able to help that along by deinterlacing the length bumps from the clones.

1 Like

I think I want something bit like this, but for a variable length slice:

( although that is experimental/nightly anyway ).

But the ptr module doesn't seem to have anything. It may well be the compiler is doing a good job of optimising this anyway, I haven't tried looking at the code and it seems to be running very fast anyway....!

That one is basically the same as yours, and it is for variable length slices. It has the copy_from_slice counterpart that you could also make for yours.

You can be more efficient with iteration. But calling clone on every element is unavoidable. You can only avoid that if something implements Copy, and there's no way to check that. The closest thing is needs_drop, which you probably should use for your drop impl, but that's just for drop, and types that don't need drop sometimes can't be Copy.

3 Likes

Oh, it looks like I could use that function, except for it being experimental/nightly, I only just found it and didn't look properly at what it was.

I wonder if some of the specialization mess could be sorted out by a similar best-effort function try_coerce<T, U>(T)->Result<U, T>, which will generally return Ok when T: CoerceUnsized<U> but may have spurious failures if the compiler can’t prove that the lifetimes always work out.

I would expect the optimizer to be able to devirtualize most code that branches on this when U = dyn Something. It wouldn’t help with Copy directly because it’s not object safe, but there may be some ways around that by defining a helper trait.

If all you're doing is a statically fixed number of calls to drop_in_place (e.g. calling drop_in_place::<[T]>), then you shouldn't check needs_drop, as drop_in_place already includes an equivalent check and doesn't do any work for types without any drop glue.

needs_drop is for collections more like HashMap where iteration work can be saved when the collection elements can just be immediately forgotten about.

Indeed, a false-negative-having weak form of specialization can be sound for arbitrary bounds. Specifically, the specialization must be "weak" in that it only sees whatever bounds are already known to hold in the caller (with limited transitivity). Thus it becomes essentially equivalent to the caller choosing one of a set of functions to call based on the bounds it knows hold. If lifetimes are never transitive for this purpose, then no unsound specialization on lifetimes exist, so long as the compiler is sufficiently careful during impl selection.

However, allowing this makes polymorphization become semantics changing, and this is broadly considered a desirable transform for the compiler to be permitted to do, for both compile and runtime perf benefits. More generally, weak best-effort specialization just ends up being somewhat chaotic in which arm gets selected, which while not technically speaking unsound, is still far from desirable.

It ends up somewhat similar to the currently std-internal const_eval_select, where the caller is asked to prove (on pain of calling it UB) that which implementation arm ends up selected is externally unobservable (of course excluding certain sidechannels declared "unobservable" such as timing). The compiler can make the decision to take the more performant arm when possible, but refuses to allow you to comprehend how or why it makes the decisions it chooses to make in this domain yet.

2 Likes

I thought I would see what std::vec::Vec does, and it is a bit mysterious, to me at least:

impl<T: Clone, A: Allocator + Clone> Clone for Vec<T, A> {
    #[cfg(not(test))]
    fn clone(&self) -> Self {
        let alloc = self.allocator().clone();
        <[T]>::to_vec_in(&**self, alloc)
    }

    // HACK(japaric): with cfg(test) the inherent `[T]::to_vec` method, which is
    // required for this method definition, is not available. Instead use the
    // `slice::to_vec` function which is only available with cfg(test)
    // NB see the slice::hack module in slice.rs for more information
    #[cfg(test)]
    fn clone(&self) -> Self {
        let alloc = self.allocator().clone();
        crate::slice::to_vec(&**self, alloc)
    }

    fn clone_from(&mut self, other: &Self) {
        crate::slice::SpecCloneIntoVec::clone_into(other.as_slice(), self);
    }

What is an "inherent" method?

A method that's defined on a type directly, in contrast with a trait method or standalone function.

struct S;
impl Clone for S {
    // trait method
    fn clone(&self) -> Self { S }
}

impl S {
    fn inherent_method(&self) {}
}

fn standalone(_: &S) {}
2 Likes

Well in the end I wrote this function:

unsafe fn clone_to_ptr<T: Clone>(src: &[T], mut p: *mut T) {
    for x in src {
        p.write(x.clone());
        p = p.add(1);
    }
}

Using this turned out to be a bit faster than multiple calls to push.

I started trying out criterion, results for cloning a BTreeMap with 1,000 or 2,000 elements:

     Running benches\crit_bench.rs (target\release\deps\crit_bench-9ddef60c69b7e48c.exe)
Gnuplot not found, using plotters backend
Clone/Exp/1000          time:   [559.30 ns 576.82 ns 600.50 ns]
Found 13 outliers among 100 measurements (13.00%)
  1 (1.00%) high mild
  12 (12.00%) high severe
Clone/Std/1000          time:   [6.2996 µs 6.3124 µs 6.3266 µs]
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  9 (9.00%) high severe
Clone/Exp/2000          time:   [1.3082 µs 1.3125 µs 1.3181 µs]
Found 9 outliers among 100 measurements (9.00%)
  3 (3.00%) high mild
  6 (6.00%) high severe
Clone/Std/2000          time:   [12.711 µs 12.746 µs 12.791 µs]
Found 10 outliers among 100 measurements (10.00%)
  2 (2.00%) high mild
  8 (8.00%) high severe

So my clone is about 10 times faster than standard library.

1 Like

The implementation lives in the slice module since the code for turning a slice into a vec and cloning a vec is shared.

1 Like

According to git blame the code is 4 years old. I don't think the comments on better code generation hold anymore. My experiments with the compiler explorer show a better code gen with zip:

The std code has two branch checks in the inner loop for cloning [String]. The fact that cloning [u32] is not converted to a memcopy in the current version shows that LLVM does not understand it as good as the zip based variant.

Your code does not handle a panic in clone gracefully, i.e. the already cloned values will be forgotten when a following clone panics. This is sound but might lead to unexpected bugs if the user is not aware.

2 Likes

Oh right, I had completely forgotten to take proper account of panics, my implementation of retain is also all wrong and needs fixing.

New implementation of retain:

    pub fn retain<F>(&mut self, mut f: F)
    where
        F: FnMut(&K, &mut V) -> bool,
    {
        // struct is needed to fix invariants if f panics.
        PairRetainer::new(self, &mut f).go();
    }

with

struct PairRetainer<'a, K, V, F>
where
    F: FnMut(&K, &mut V) -> bool,
{
    pv: &'a mut PairVec<K, V>,
    f: &'a mut F,
    i: usize,
    r: usize,
}

impl<'a, K, V, F> PairRetainer<'a, K, V, F>
where
    F: FnMut(&K, &mut V) -> bool,
{
    fn new(pv: &'a mut PairVec<K, V>, f: &'a mut F) -> Self {
        Self { pv, f, i: 0, r: 0 }
    }

    fn go(&mut self) {
        unsafe {
            let pv = &mut self.pv;
            while self.i < pv.len() {
                let (k, v) = pv.ixm(self.i);
                if (self.f)(k, v) {
                    if self.r != self.i {
                        let kv = pv.get(self.i);
                        pv.set(self.r, kv);
                    }
                    self.r += 1;
                } else {
                    pv.get(self.i);
                }
                self.i += 1;
            }
        }
    }
}

impl<'a, K, V, F> Drop for PairRetainer<'a, K, V, F>
where
    F: FnMut(&K, &mut V) -> bool,
{
    fn drop(&mut self) {
        unsafe {
            let pv = &mut self.pv;
            while self.i < pv.len() {
                if self.r != self.i {
                    let kv = pv.get(self.i);
                    pv.set(self.r, kv);
                }
                self.r += 1;
                self.i += 1;
            }

            pv.len -= (self.i - self.r) as u16;
            pv.trim();
        }
    }
}

from btree_experiment/src/vecs.rs at main · georgebarwood/btree_experiment · GitHub

My latest attempt ( hopefully correct ) to quickly clone my Pairvec ( it does seem to run fast, just as fast as my previous faulty code ):

impl<K, V> Clone for PairVec<K, V>
where
    K: Clone,
    V: Clone,
{
    fn clone(&self) -> Self {
        let mut c = Self::new(self.capacity, self.alloc_unit);
        c.allocate(self.alloc as usize);
        let mut n = self.len;
        if n > 0 {
            unsafe {
                let (mut sk, mut sv) = self.ixp(0);
                let (mut dk, mut dv) = c.ixmp(0);
                loop {
                    let k = (*sk).clone();
                    let v = (*sv).clone();
                    dk.write(k);
                    dv.write(v);
                    c.len += 1;
                    n -= 1;
                    if n == 0 {
                        break;
                    }
                    sk = sk.add(1);
                    sv = sv.add(1);
                    dk = dk.add(1);
                    dv = dv.add(1);
                }
            }
        }
        c
    }
}

from btree_experiment 0.1.96 - Docs.rs

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.