How to read sub-slices of a &[u8] - where read is only what it takes to compare

I'm using a HashMap to take inventory of unique occurrences of [u8] and counting them - for all practical purposes, a word frequency count.

  match counter.get_mut(bytes) {
       Some(count) => {
          *count += 1;
       }
       None => {
          counter.insert(bytes.to_owned(), 1);
       }
  }

To help scream through the data, I have an index that indicates the offset in a Mmap such that index and (index + 1) lookup the two offsets that make-up the subslice of Mmap. I skip some constant number in the index, lookup the values at (index) and (index + 1) -> subslice of Mmap and so on. In the event the value of the slice is not already in the counter, only then, do I copy the bytes.

The question is, given that all I need is to establish equality with a pre-existing key (using the HashMap), what is the very minimum required to "read-in" the subslice to perform this lookup? For instance, do I have to create and manage a buffer? At the level of the CPU I have to load the bytes of that subslice into a register, but what does that (load and compare) translate to?

Thanks to anyone with a mania and expertise for speed reading.

- E

Assuming you are using memmap::Mmap, you can access slices of it using indexing syntax. This does not require copying or allocating a buffer. The slice is simply a pointer into the memory-mapped data.

fn increment(counter: &mut HashMap<Vec<u8>, usize>, mmap: &Mmap, i: usize, j: usize) {
    let bytes = &mmap[i..j];
    match counter.get_mut(bytes) {
       Some(count) => {
          *count += 1;
       }
       None => {
          counter.insert(bytes.to_owned(), 1);
       }
    }
}

(Playground)

Thank you both. @mbrubeck I am using memmap. I was using a function between the counter and the lookup. In doing so I was getting caught up in how to either share a ref to be mutated in the body of the fn to avoid lifetime issues vs other approaches that likely involved memcopy.

I suspect that for some tasks the bindings associated with a function call have a “side-effect” or artifact of some sort that prevent a clean read of the slice as you have presented it here. Perhaps there is a limit to indirection in that it’s not always free. I look forward to giving it (the code presented) a go tomorrow morning.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.