Reuse map from outer scope?


#1

Hi, I came across some code logic where I read each line from a file as String, then store each str segment in a map for later use.

fn main() {
    let file = File::open("input.txt").unwrap();
    let reader = BufReader::new(file);

    let mut record: HashMap<&str, u8> = HashMap::new();

    for line in reader.lines() {
        match line {
            Ok(l) => {
                UnicodeSegmentation::graphemes(l.as_str(), true)
                    .for_each(|x| *record.entry(x).or_insert(0) += 1);
                for (s, count) in record.drain() {
                    // do something fancy
                }
            },
            Err(_) => (),
        }
    }
}

the above code won’t compile because the &str in the inner loop, which is stored into the map from outer scope does not live long enough.
Is there a way I can prove to the compiler that the map will be empty after each iteration? By that I can avoid moving the map initialization into the loop so reuse the same map.


#2

It’s because graphemes are taken from l, which is taken from line, which is discarded on every iteration of the loop, so the record hash map would become invalid after one loop iteration.

You can collect all lines into a Vec first, so that they have a permanent place to live for longer than one loop iteration.


#3

But that will need to allocate a Vec, the reason I would like to reuse the map is avoid allocation as much as possible.


#4

But how else do you imagine this to work? e.g. how would you do it in C?

Without keeping track of what strings have been created, it’s literally impossible to free them later. If you do Box::leak(line.into_boxed_str()) it’ll be safe and it’ll work, but you won’t have the strings later to free their memory.


#5

You can also read the whole file into memory with fs::read_as_string(); and use s.split('\n') to iterate lines. Then the lines will be held in the file’s memory.


#6

All the l will be of no use after one loop, where I put record.drain() just to clear its content.


#7

Your code does equvalent of this:

for line in reader.lines() {
        record.entry(&line);
        drop(line);
    }

which is like:

record.entry(&line);
drop(line);
// search in second iteration of the loop to compare against previous line
record.keys().any(|x| *using x crashes*)

#8

the best I can do now is:

fn main() {
    let file = File::open("input.txt").unwrap();
    let reader = BufReader::new(file);

    for line in reader.lines() {
        match line {
            Ok(l) => {
                let mut record: HashMap<&str, u8> = HashMap::new();
                UnicodeSegmentation::graphemes(l.as_str(), true)
                    .for_each(|x| *record.entry(x).or_insert(0) += 1);
                for (s, count) in record.drain() {
                    // do something fancy
                }
            },
            Err(_) => (),
        }
    }
}

In such I can avoid read the entire file into memory, but allocate a map per loop.


#9

My intent was like:

record.entry(&line);
// do something in current iteration
record.clear();
drop(line);

#10

That .drain() is not taken into account by borrow checker. It doesn’t know what it does.

You can:

  1. Hack it with unsafe. Transmute &line to &'static line, so that the borrow checker will not prevent you from storing potentially-unsafe pointers.

  2. Create that temporary hashmap each iteration of the loop. Perhaps use with_capacity if you have a good estimate to reduce number of reallocations. Hopefully freeing followed by allocation of the same-sized block will pick the same block again from some hot cached free list, so the overhead will be small.


#11

Hack it with unsafe . Transmute &line to &'static line , so that the borrow checker will not prevent you from storing potentially-unsafe pointers.

How can I manually drop the transmuted variable afterwards?


#12

You don’t. References are never freed by definition. The compiler will drop the owned line regardless of what you unsafely do to its references.


#13

Are you sure you need to, though? Did you measure? Just a small anecdote: My code iterates over millions of entities, and for each iteration step, I need a Vec that will hold up to 2 elements. I also though that I should reuse the same Vec, so I made a proper benchmark, then hoisted the Vec out of the loop to reuse it… and it made no difference! I’m not sure why, maybe the allocator was smart or the compiler or both of them, but that shows that maybe you don’t need to try to do what you want right now.


#14

In my case, it’s roughly 5% performance difference.