When one uses `fold` with something like HashMap what happens to the HashMap since it is not copy?

When one uses fold with something like HashMap what happens to the HashMap since it is not Copy?

//a silly example
		(1..10).fold(HashMap::<String, u8>::with_capacity(10), |mut s, c| {
        //here what happens to the hashmap? does it get copied? 
			s.insert(c.to_string(), c);
			s
		})

You can't use fold on a HashMap directly. You have to use it on an Iterator, which means you either borrow the HashMap or take ownership of it (depending on how you construct the iterator), wherein it will be released or dropped, respectively, when the iterator goes out of scope.

I am sorry seems I wasnt very clear I will add an update to my question

Let's just annotate the types in question:

(1..10).fold(
    HashMap::<String, u8>::with_capacity(10),
    |mut s: HashMap<String, u8>, c: u8| -> HashMap<String, u8> {
        s.insert(c.to_string(), c);
        s
    },
);

You can see that the callback called by fold takes and returns the owned HashMap, i.e. it is first moved from fold into the callback and then from callback back to fold.

I meant what exactly happens and what’s the cost of it since it’s a value isn’t it moved constantly from inner stack to the outer and Vice verse

In debug builds, yes, it will be moved back and forth. In optimized builds, the moves will be optimized away.

And it's not moving the entire hash table data (since that's allocated separately on the heap), only the few bytes of the “root” of the data structure.

6 Likes

Remember that HashMap isn't Copy (and it can't be, because it needs to do cleanup when dropped). So it'll only get copied by some explicit operation -- like .clone() -- and thus it can't be being copied here because you didn't call such a thing. (And you can look at fold to see it doesn't have a Clone bound on anything, so it's not copying stuff either.)

4 Likes

I meant something like memcpy in and out on every closure

In debug builds, there will be a shallow memcpy in and out on every closure. The HashMap’s main (heap) data is untouched, but it will copy size_of::<HashMap<String, u8>>() many bytes, which are the pointers and other metadata that are part of the HashMap value on the stack, not any data on the heap. Let’s see what that is…

use std::mem::size_of;
use std::collections::HashMap;
fn main() { dbg!(size_of::<HashMap<String, u8>>()); }
[src/main.rs:3] size_of::<HashMap<String, u8>>() = 48

So there you have it, it copies 48 bytes on the Rust playground (in current stable rust), which are 6 machine words (in the Rust playground, which is 64-bit).

…this post is edited after I learned that 48/8 is not 4…

But you shouldn’t worry about debug builds; Rust’s monomorphization means that in an optimized build, there will most likely be no copying at all. And even if it copies 6 words between stack frames twice per iteration, that’s most likely negligible compared to allocating and filling the string, and inserting into the map. In particular, calling insert will probably read these 4 words that make up (the outer / shallow layer of) the HashMap multiple times, too, and follow all the pointers, etc…


For completeness let’s look at what these 48 bytes / 6 words actually represent. HashMap is implemented in the hashbrown crate, here, consisting of the hasher, and a “RawTable”, and it also talks about an “allocator”. The allocator is actually a zero-sized struct (in the default case), and the hasher contains a key of two u64s. The raw table contains the remaining 4 words; apparently 3 usizes and one pointer, as defined here. The pointer points (in)to the HashMap’s heap data, and the usize values keep track of the size, the number of items, and the number of items that still can be added before a resize. (Sounds like it might be slightly redundant, but perhaps it’s more efficient than to try working just with 2 usize values.)

So that’s what’s getting copied when you move a HashMap, 2 u64s, 3 usizes, and 1 pointer. (All of this is of course subject to change; HashMap could at any point be changed to be slightly more than 6 or slightly less than 6 words; it most likely won’t ever be changed to be significantly larger than 6 words.)

7 Likes

I see. I was worried that the buckets would be copied as well I thought they were somehow created in the stack rather than the heap but if to think of it after your explanation that's pretty silly since they are dynamically sized objects.

Nit: The 48-byte size is 6 64-bit words, 8 bytes each. The extra 2 you didn't account for would be the RandomState as the default S: BuildHasher. Note that std's default S is different than hashbrown's. Other deterministic hashers do have zero-sized builders, often based on BuildHasherDefault.

1 Like

God… don’t tell anybody that I’ve studied mathematics for 3 years. Yeah 48/8 = 6, great. And – on second thought – of course an instance of RandomState isn’t zero-sized :see_no_evil:, it doesn’t randomize each Hasher, but only once for the whole BuildHasher

I have edited my previous post accordingly.

5 Likes

For any non-trivial amount of work being done in the closure, this is irrelevant.

That said, moving the container every time is something that LLVM can't completely remove (assuming things haven't changed in the past ~18 months). See Iterator::fold is a little slow compared to bare loop · Issue #76725 · rust-lang/rust · GitHub for some notes about it, and Lint for `fold` closure that never moves the accumulator · Issue #6053 · rust-lang/rust-clippy · GitHub for adding a clippy lint about it.

So, in general, if you don't need move access to something, then you can just not use fold. Use for_each instead:

let mut s = HashMap::<String, u8>::with_capacity(10);
(1..10).for_each(|c| {
    s.insert(c.to_string(), c);
});

And then it's quite obvious that the hashmap isn't getting copied.

(Of course this should just be map+collect for the specific example, which will get the size_hint correct and thus doesn't need the manual with_capacity, but since you said it's a "silly example" I assume your real case actually needs the distinction for some reason.)

2 Likes

Tangent, but is there a way to get a deterministic HashMap, that doesn't use rand() at all?

RandomState is initialized with randomized keys to create its DefaultHashers, but you can also use DefaultHasher::new or its Default implementation to create deterministic hashers. So for a HashMap, that would be S = BuildHasherDefault<DefaultHasher>, instead of S = RandomState. The actual hashing algorithm has no randomness, because it has to compute consistent hashes for a given map.

But that determinism is a trade-off against denial-of-service protection. If that's fine in your scenario, you may also want to try a faster hashing algorithm. The Rust compiler itself does this with rustc-hash, where there are type aliases for S = BuildHasherDefault<FxHasher>.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.