Possible to hold HashMap Entry but release/clone the key?

Is it possible (in stable Rust) to release a borrowed key when grabbing/creating an Entry in a HashMap?

I made a crude 1BRC (one-billion-row challenge) solution using my sipp parser crate and noticed that a lot of time is spent allocating String objects for the "station" names found in the input (even though each station name will appear many times). In an attempt to reduce the run time, I wrote a parser method which returns a Cow<'_, [char]> instead of an owned String, with the intention of adjusting my 1BRC solution to use the following logic:

  1. If the HashMap already contains the station name, then grab the existing Entry and release the borrowed key (the Cow) and also the parser object.
  2. If the HashMap does not already contain the station name, then clone the borrowed key and use it to create a new Entry in the HashMap, releasing the borrowed key and also the parser object.
  3. Use the (now released) parser object to continue reading the data (for the current station) and then use the Entry to stash the data under the corresponding key within the HashMap.

But no matter how I attempted this, the borrow of the parser object remained, making it impossible to continue using the parser object (to extract the data that follows the station name).

Is there something about HashMap which makes it impossible to release the initially borrowed key in this scenario, or am I just not approaching it correctly?

Please show the code, at least the full types you're talking about and the code that does the borrowing. It could be a simple implementation mistake, or it could be a design mistake.

There's simply no API for this. The closest is raw_entry_mut, but it's nightly. Or you can use hashbrown directly and use entry_ref.

Other things you can try:

  • If you have all 1 billion rows in memory, you could just store &str refs to it.
  • You could use an array-based string like this to avoid allocation costs.
1 Like

Your borrow problems may stem from needing to always have the same borrow duration on your Cows, which probably requires an arena or such to make sense.

(But it's hard to judge your actual problems from prose.)

I binned all of the previous attempts which were dead ends. So I just spent ten minutes knocking together another attempt, but in doing so I realised that both HashMap::entry and HashMap::insert expect to take ownership of the key, so my attempt to pass in a reference does seem doomed to fail.

Yeah, I'm belatedly realising that you're right. The stable HashMap doesn't seem to be geared for this purpose.

My parser crate streams a few bytes at a time, so the all-in-RAM option is out, though in honesty I don't fancy trying to hold the 1BRC input file all in RAM (it's 12.6 GiB in size).

I like the look of the hashbrown method entry_ref, though. I might swap that in as a replacement for the std HashMap, and see if I can get this to play nice. (Has Rust's std HashMap taken code from that crate?)

1 Like

Yes, std is currently some flavor of hashbrown underneath.[1]


  1. implementation detail subject to change ↩ī¸Ž

2 Likes

I'm not sure whether this counts as an "arena", but I worked around the problem in the way shown below. Now the HashMap just maps from the station name to a usize (avoiding the need for lingering references), and then a separate Vec is used to hold the actual record objects, with the mapped usize representing the index within the Vec where the corresponding station record can be found.

This (I believe) avoids the need to clone the station name every time, and instead just allows the station name to be used to find out whether we already have a Record. Does feel like we're losing the safety of Rust by manually managing memory (vector index) locations, though.

Sadly, the time to process the 1BRC input file is only reduced from 2m to 1m49s (about 9.7% faster) so it's not the huge saving I was hoping for.

// Map from station name (received from parser as Cow<[char]>) to Vec index.
let mut record_ids: HashMap<Cow<'_, [char]>, usize> = HashMap::new();
// And use a vector to hold the actual Record objects.
let mut records = Vec::with_capacity(625);

// Keep parsing until no content remains.
while parser.has_more()? {
    // The read_chars_up_to method returns a Cow<[char]> instead of a String.
    let name = parser.read_chars_up_to(';')?;
    let id: usize = match record_ids.get(&name) {
        // If station name already exists, grab the id (index) for it.
        Some(id) => *id,
        // If the station name never seen before, determine next free
        // id (index) and shove it into the map.
        None => {
            let new_id = records.len();
            let key = Cow::Owned(name.into_owned());
            record_ids.insert(key, new_id);
            new_id
        }
    };
    parser.require(';')?;
    let value = match parser.accept('-')? {
        true => parse_negative_number(&mut parser)?,
        false => parse_positive_number(&mut parser)?,
    };

    if id == records.len() {
        // This is a newly found station name, so create new Record and
        // push onto the end of the vector (so that its position equals the id).
        records.push(Record::from_initial_observation(value));
    } else {
        // We've seen this station name before, so just grab its record and
        // update its data with the newly found value.
        let record = records.get_mut(id).unwrap();
        record.add_observation(value);
    }
    parser.accept('\n')?;
}

You'd probably call this string interning. But there's nothing different lifetime-wise than the original method. The thing that makes this work is that you're doing two lookups, which you could do in a HashMap<String, Vec<Record>> just as easily.

Also, converting str to [char] is probably taking a lot of the time.

1 Like