Into_iter from Vec<Vec<_>> memory consumption

I am using a structure that is basically a Vec<Vec<_>>. I do some map calls to transform the structure using the into_iter method for both levels but the memory consumption seems to sky rocket. I guess this is because the drop(free) occurs after the big iterator is dropped, not after each element is. This means that even if you use vec_of_vec.into_iter().map(...).collect() the memory usage is doubled. Is this true?

If this is true, will calling std::mem:drop on the first level map cause the consumed element to be deallocated early? Something like:

let v =// Vec<Vec<_>>
...
v.into_iter().map(|p| {
   ...
   std::mem::drop(p);
}

When you call into_iter, the vector is immediately destroyed and ownership of the allocation is moved into a new value of the std::vec::IntoIter type, which then owns the allocation of the vector. The vector no longer exists. The allocation is freed when the value of type IntoIter is dropped.

You can't call drop on the vector after caling into_iter, as the call to into_iter destroys the vector.

Regardless, how would you expect that to work? Collect is implemented as a loop, that takes one element at a time from the vector, does something to it, and puts it in the resulting vector. Thus you can't deallocate the big vector until you moved every element into the return value of collect.

As for the inner vectors, their ownership is given to the closure in map, which is free to deallocate them or pass them on to be stored in the output vector. Calling drop explicitly shouldn't change anything — if it is not returned by the closure, it will still be automatically dropped when it goes out of scope at the end of the closure.

1 Like

I see. Thank you for your answer.

Then by doing into_iter of into_iter, you should run the drop for each element.

Do you know of any ways to profile memory allocations then?

I know there's a tool called massif which you can run using valgrind. I haven't used it in ages, but since Rust currently uses the system allocator, I'd guess you can use it in the same way as you'd profile e.g. a C application.

If you look up old guides on the matter, they may tell you to turn of jemalloc, but that's no longer necessary as it's off by default.

A flat vec.into_iter().collect() will temporarily use up to 2.5 times the of memory of vec, because:

  1. Memory backing the original vec can't be partially freed, so it is freed only after collect() finishes
  2. collect will need memory for the new result
  3. If collect can't guess the final size it will need, it will keep doubling and reallocating the size.

The third point can be fixed by using extend on Vec::with_capacity(), but the first two are unavoidable.

However, the nested vecs will be freed one by one, unless you choose to keep them in the newly collected vec. If you just pass them through from one vec to another, they won't reallocate and won't increase memory usage.

2 Likes

So the original vector's heap will not dealloc until the collect is completed, while the heap zones used by the internal vectors will, one by one, if they are not referred further?

1 Like

Indeed.

1 Like

Iterator::size_hint?

1 Like

Yes, that's how it guesses the size. Some combinators preserve the size hint, while others can't. For example if you just use map on an iterator it knows the size hint is preserved, but if you use filter or flat_map, it won't know how to predict the correct size, so the size hint wont tell it anything useful.

An example where you might want the with_capacity + extend method: You're using flat_map with an iterator you know always produces two elements, i.e. the capacity is doubled. However the flat_map code would have to look at the size hint of every item to tell the size in advance as there is no way to signal that every sub-iterator has length two. In this case you can pre-initialize it to 2*len to get the benefit anyway.

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.