What's the most memory efficient way to store a sparse Vec (while preserving the addresses!) without sacrificing a significant amount access speed?

Hello,
I have a sparse vec (something like [1, 2, 0, 0, 0, 6, <a few hundred more zeros>, 2, 8, <few hundred more zeros>, etc], which (obviously) takes up a lot of space on memory. Any way to use less space while still preserving the right addresses (for example, the "6" in the example vector (address 5) needs to still have index "5", but the in memory representation could be different). It also needs to be almost as fast as Vec. Any ideas?

Thanks,
ndrewxie

What about using a std::collections::HashMap<usize, Value>?
You set the key to be the index, and the value to be, well the value. This might take abit more boilerplate code, but it could probably be eliminated.


Edit:

In fact, I bet that it could be edited to not use the standard hasher, and instead use a new one that doesn't actually hash the value and instead just be a passthrough, as it's just going to be hashing the one usize (index).

In case you haven't already seen it, sprs looks like a pretty good solution.

How fast is the access time?

I suppose I worded the question incorrectly - yes, that sparse Vec is used as a hash - from Vec<[usize; 16]> to [usize; 16]. What I'm doing is computing the hash of the Vec<[usize; 16]> (the hash must be a usize), then returning the data at the hash address from the Vec. The question is, what is the best hash algorithm to avoid having lots of unused space in the Vec?

I believe that this is a double misunderstanding, I talking about it like this:

struct Value;
let my_sparse_vec: HashMap<usize, Value> = HashMap::new();
my_sparse_vec.insert(0, Value);
my_sparse_vec.insert(300, Value); //my_sparse_vec doesn't contain 300 empty values

This would probably work nicely for really sparse vecs like you originally described, but not well with long ones, because the size of the indices would always be added, while it's not necessary with a Vec<Value> seeing as it is continuous in memory.


What I was talking about in the hasher originally was about just making it a passthrough, seeing as it is unnecessarily hashing the indices while they are going to all be unique, and they are already a set of numbers.

Yeah, I'm trying to avoid hashmap here

I did some benchmarking, it isn't really that fast, so I'm using a custom one (beat hashmap by quite a bit), but it only mapped Vec to usize. Now I have to expand to mapping Vec<[usize; 16]> to [usize; 16], and now the algorithm uses too much memory.

The prevoius algorithm (mapping Vec to usize) relied on the fact that for my purposes, the key (the Vec) is the same length every time, and each "dimension" has a known maximum value, so if the Vec was this: [2, 3] and the maximum was [5, 6], then the hash would be 2 * 5 + 3 * 6. My algorithm then goes over to the sparse Vec and fetches from location 2 * 5 + 3 * 6.

Hmm, I don't think that I completely understand this.

So, basically, I'm making a custom mapping from a known - size Vec of usizes to a usize. e.g.:

[2, 3] maps to 5
[3, 4] maps to 3
[2, 4] maps to 9

etc. But it's very sparse, most of the Vec<usize> keys don't have a value associated with them.

To make things simple, I'll make the assumption that the Vec is of length 2 (but it can easily be extended to longer lenghts).

So, let's say the input was this:

[2, 4] -> 2
[3, 5] -> 3

Then, for each "dimension" in the input, I get the maximum, so the maxima would be [3, 5]. Then, I convert each vector to an address ("flattening"), by multiplying each coordinate with it's respective maxima (so the first one would have address 2 * 3 + 4 * 5 = 26). Then, I store the value for that key at that address in the Vec (so after processing the first line, the vec would be):

[<26 zeros>, 2, ]

So "2" is at address 26. The problem is that that was for mapping Vec to usize, now I have to map Vec<[usize; 16]> to [usize; 16]. I could extend the algorithm (I tried), but now the "map vector" has a few thousand zeroes. Definitely not a good use of memory. Now that I type it, however, it seems that that way is the only way to get the hash without getting collisions (2 different Vec<[usize; 16]>s getting the same "flattened" address). Any ideas?

So you're trying to emulate something similar to this (In pseudo-rust):

type Index = Vec<usize>; //This is your [2, 4] stuff
let data = Vec<Option<usize>>; //This is your data ([2, 4] -> 2)
fn to_usize = (index: Index) -> usize {
    //Your algorithm to get to your index
}
//`data` contains Options because it may evaluate to 0
//and so `data` would be the "sparse" vec in question

Yup! The problem is that the algorithm to get the index results in a VERY sparse vector. I read the documentation for the Hash in the standard library, and it contains a lot of features and stuff that make it more general, I don't need that (that's probably what's causing the slowdown).

Also, the thing I'm trying to emulate is what I did in the past (and it worked very well, despite the fact that it was sparse - that wasn't a problem back then). The probelm is that now instead of:

type Index = Vec<usize>;
let data = Vec<Option<usize>>;

I have to do:

type Index = Vec<[usize; 16]>;
let data = Vec<Option<[usize; 16]>>;

Great! :smile:
So the HashMap would still work in this case:

type Value = usize;
//PassthroughHasher is fictional, but does what its name suggests
let data = HashMap<usize, Value, PassthroughHasher> = HashMap::new();

but as you mentioned this is slow, so perhaps another solution? The mentioned sprs crate might work...


Wait so your index is actually a Vec<[usize; 16]>? I was actually referring to a [usize; 16] but I stupidly put in a vec...

I don't really want to use another crate right now, just trying to fix the suboptimal way to get the index

So wait, do you or do you not want to use a HashMap? Because if you could make the hasher basically be a nop/passthrough kind of thing then that would probably work just fine as long as you had a function:

type Index = //Whatever you want it to be
fn to_usize(index: Index) -> usize {
    //Your alrogirthm
}

the hash map would still work.

Hmm, interesting, I didn't know you could do that! I'll try it. Also, just curious, how does the HashMap save that memory internally (because a sparse Vec is obviously not how they do it lmao)?

Wait, I think the big misunderstanding is that HashMap<K, V> is actually HashMap<K, V, S = DefaultHasher>, so you can actually replace the hasher...
Anyway, I'm not actually sure how they do it; but I presume it's something similar to this:

struct HashMap<K, V, S> {
    kv_pairs: Vec<(u64, V)>
}
impl HashMap<K, V, S: Hasher> { //Hasher isn't the real trait bound, it's supposed
                                //to be a trait that allows it to instantiate it
    pub fn insert(&mut self, k: K, v: V) {
        let hasher = S::new();
        hasher.hash(k);
        let key: u64 = hasher.end();
        self.kv_pairs.push((key, v));
    }
    pub fn get(&self, k: K) -> Option<&V> { //This actually takes &K
        let hasher = S::new();
        hasher.hash(k);
        let key: u64 = hasher.end();
        for (k, v) in &self.kv_pairs {
            if k == key {
                return Some(v); //v is a reference
            }
        }
        None
    }
}

(But like 100x more optimized)

OK, thanks!

1 Like

So.... Solved? :sweat_smile: