Copying stuff in memory


#1

I’m researching the topic of copying stuff in memory from one location to another. Trying to figure out what “zero-cost abstractions” mean in this context and when do they break. Compare these three ways of copying a vector:

fn copy_iter(src: &[u8], dst: &mut Vec<u8>) {
    for &byte in src {
        dst.push(byte);
    }
}

fn copy_index(src: &[u8], dst: &mut Vec<u8>) {
    let mut index = 0;
    unsafe { dst.set_len(SIZE); }
    while index < src.len() {
        dst[index] = src[index];
        index += 1;
    }
}

fn copy_push_all(src: &[u8], dst: &mut Vec<u8>) {
    dst.push_all(src);
}

On my machine, on 10M vector they run approximately in:

  • iter: 0.010
  • index: 0.007
  • push_all: 0.002

I’m trying to figure out the reason for such a big difference. My ideas so far:

  • push_all is the fastest because it forgoes any border checking during the loop using unsafe access to pointers
  • index does index check on both src[index] and dst[index]
  • iter, on every Vec::push also does an extra shift of the base pointer to current length

Questions:

  • Where am I wrong?
  • As far as I understand the concept of zero-cost abstractions, the iterator solution, ideally, should be possible to be made as fast as the index-based one. Or not?
  • Why Vec::push_all does set_len() on every iteration instead of settings it once at the end?
  • What’s the theoretically fastest way of copying things in memory? My current understanding is that the vectorized loop is as fast as it gets, but I admittedly don’t know very much about it.

May be there’s some blog post about it already. If not, I’d like to write one!


#2

You might be interested in the discussion around push_all, which was stabilized for interesting reasons:


#3

Thanks, will look into it.

btw, I just found a similar discussion about zipping iterators where I found out that by turning my slices into explicitly sized ones (let src = &src[..SIZE]) I can eliminate bounds checks during the loop and make index as fast as push_all.

Still, what bothers me the most, is the iterator-based code. It’s too idiomatic to be slow! :slight_smile:


#4

Hm, I don’t quite understand, how does copy_index works. What is this SIZE? It is an unbound variable. If you are calling set_len(x) on the vector, you must to make sure, that its capacity is at least x, otherwise you will violate memory safety.

I haven’t measured it, but it seems that the main work during vector extension is not the index access, but the multiple reallocations of the internal buffer. That is, calling reserve should make copy_iter version faster.

copy_index bypasses this reallocations completely, possible violating memory safety =)

copy_push_all of cause allocates the necessary buffer in one steps.

Also, your numbers say that index is the fastest, there must be a typo =)

PS: I’d love to see the actual code of the benchmark to check my hypothesis about reserve


#5

Yeah, I omitted all the preparation for brevity. Everything is pre-allocated to the hard-coded SIZE, here’s the entire source.

And yes, I’ve made an order of magnitude typo in numbers. Fixed now, thanks!

PS: I’d love to see the actual code of the benchmark to check my hypothesis about reserve

Yeah, that would be the slowest. (Update: 1.6x slower on a non-preallocated vector with iter)


#6

Yes, this makes more sense now, and I am completely unable to explain any figures here :slightly_smiling:

However I think that there is a major conceptual difference between iter and others. In theory, iter does not know that the buffer was pre allocated, so it has to extend the vector by one element at a time (in practice, it is likely inlined).

The most idiomatic way to write copy would be, imo dst.extend(src.iter()). And this can in theory be efficient, because the type of [u8].iter() is the ExactSizeIterator, and so this version has the same knowledge as copy_index and copy_push_all. But to do so, you need specialization.

Interestingly enough, if I replace

for &byte in src {
    dst.push(byte);
}

with

dst.extend(src.iter());

Then cargo run --release -- iter consistently becomes 1.5 times faster, BUT cargo run --release -- index becomes 3 times slower on my machine =)


#7

Shouldn’t you add a set_capacity to your iter_push_all? That way, it doesn’t reallocate more than once.


#8

That actually cleared up a lot, thanks again.

So I understand that push_all/extend_from_slice is intended to be the fastest idiomatic way of copying data in memory, and the very talk of specializing extend means that the general case of copying from an iterator is un-optimizable to that level. Suddenly it seems very obvious :slight_smile:


#9

Then cargo run --release -- iter consistently becomes 1.5 times faster, BUT cargo run --release -- index becomes 3 times slower on my machine =)

Something’s wrong here :slight_smile: Calling .push() is pretty much what extend() does anyway. And it certainly shouldn’t affect index! I find that results on my machine vary over time depending on if it’s on AC power vs. battery and the overall state of memory, I believe. May be you’re seeing those effects too?


#10

I find that results on my machine vary over time depending on if it’s on AC power vs. battery and the overall state of memory, I believe. May be you’re seeing those effects too?

The result is reproducible. Here is what I’ve got just now: https://gist.github.com/matklad/81b68284e0e6ba70c4c2 (rustc 1.5.0 (3d7cd77e4 2015-12-04)

My point here is not to try to explain the numbers I get with extend, but to question validity of numbers from the original benchmark :slightly_smiling:

Calling .push() is pretty much what extend() does anyway.

Not really, it tries to use size information, although not as efficiently, as it is possible for ExactSizeIterator: https://github.com/rust-lang/rust/blob/master/src/libcollections/vec.rs#L1337


#11

Yes, it’s very bothersome, but we haven’t been able to solve it yet. I think of this as the “.extend()” problem, that we haven’t been able to make a generic extend that matches push_all when you input a slice. There’s been some resolutions suggested, from trusted iterator length, to impl specialization. Hopefully we’ll get there eventually.

.zip() being nonoptimal is also very annoying.


#12

But .extend() doesn’t use the size information, it does a straightforward while let ... .next(), and on each iteration it does the same thing as .push(): checks the length, reserves more space if needed, gets the mutable pointer to the end, writes a value, increments the length. The only difference is in the method of getting the pointer, but I think it’s immaterial.


#13

What I originally meant was that the normal iterating code (for &byte in src { dst.push(byte) }) — not .extend() — can’t be made as fast as push_all(). And I was wondering if it’s inevitable. If I understand it correctly, the difference is that repeatedly calling .push() implies a bounds check and length increment on every iteration. And it’s actually the bounds check that kills the performance. I wonder why branch prediction doesn’t eliminate this.


#14

It uses size_hint to reserve memory, while push just doubles the amount of storage. That is in general .extend for slices will reallocate buffer at most once, while combination of iter and push will do several reallocations. For this particular benchmark we know that this should not matter, because of preallocated vector, but the compiler may not be allowed to make this conclusion.


#15

Ah! I’ve been skipping that part because it’s not relevant for the test, as the buffer is pre-allocated. Thanks for pointing this out.


#16

So that the vector length is correct. The Clone::clone() call is arbitrary user code that may panic, and correct length means that all currently pushed elements will drop correctly if there is a panic. Your efficient loop of course depends on clone being trivial, inlined, and thus known to be without panic, so the set len can be moved out of the loop by the optimizer.


#17

memcpy is the fastest. It’s supplied by the platform’s libc and is usually very optimized. See also “How to memcpy bytes in stable rust”. clone_from_slice is replaced by memcpy by the optimizer if it’s a &[u8] slice.


#18

So it’s special-cased in the compiler?


#19

In libstd or rustc no, but in llvm yes; It’s got a thing called loop idiom recognize that will replace recognized loops with memcpy, memset and some other candidates where poissible. Which means that it’s a special case you can use yourself:

let mut data = [1u8; 1024];
for elt in &mut data { *elt = 0; }

llvm will recognize this and call memset.


#20

Ok, got it. I heard about llvm doing this and was hoping that it’ll recognize a loop with .push() too but now I understand that with bounds check and gowing memory it’s a bit too complex to look like a simple mov instruction.