I'm working on a program that counts unique substrings of a given length within a string sequence of data. I iterate through the sequence, inserting a Vec as a hashmap key for each unique substring.
I received some feedback that I haven't been able to get:
"Keep in mind that a Vec<T> holds 24 bytes even when empty - and when not empty, you need to count the allocation plus its overhead. If all your key vectors are the same length (which they are, but the specific length is defined by user input), you can save quite some space (and speed, by avoiding indirections) by switching to something like HashMap<[u8; N], u32> , where N is the number of characters."
I'm looking to follow this advice about creating fixed-length byte arrays to improve on what I was kind of doing before, which was:
for i in 0..(sequence.len() + 1).saturating_sub(substring_len) {
*data_hashmap.entry(&sequence[i..i + substring_len]).or_insert(0) += 1;
}
As my user input for N, the length of the unique substrings my program counts, is not a constant, continuing the discussion from Attempt to use a non-constant value in a constant, where, @kornel suggests to use vec![Default::default(); n] or Vec::with_capacity(n), I wrote the following, but I'm not sure if this is helping:
for i in 0..(sequence.len() + 1).saturating_sub(substring_len) {
let mut bytes_vec: Vec<u8> = vec![Default::default(); substring_len];
bytes_vec.extend_from_slice(&sequence[i..i + substring_len]);
*data_hashmap.entry(bytes_vec).or_insert(0) += 1;
}
I think this advice you received is wrong. You can only use the type [u8; N] if the length N is a compile-time constant. If the length is determined at run-time, this option will not work for your use case.
This code looks great. Since your hashmap keys are just borrowed slices &[u8], you are able to create them without any allocation or copying. You should probably continue using this code.
This creates a Vec filled with zeroes and then appends the substring to it, so you end up with a vec that is twice as long as you intended, which looks something like [0, 0, 0, 0, 1, 2, 3, 4].
If you do need to use a Vec<u8> key instead of an &[u8] key, you could do it like this:
let bytes_vec = sequence[i..i + substring_len].to_vec();
However, if your code works with &[u8] keys, then you should continue using those, and avoid allocating new Vecs.
I'm updating a DashMap, using rayon to process multiple sequences in parallel, so I've needed to use Vec<u8> for the keys, whereas I ran into lifetime issues if I tried to use &[u8].
You've made my day with this suggestion! I think it's the solution I've been looking for very unclearly. This has sped things up:
for i in 0..(sequence.len() + 1).saturating_sub(substring_len) {
let boxed_key: Box<[u8]> = Box::from(&sequence[i..i + substring_len]);
*data_hashmap.entry(boxed_key).or_insert(0) += 1;
}