Understanding hashmaps

I have a function that I want to cache, like this:

pub fn is_match(&self, regex: Regex) -> bool {
        let mut list = self.matches.borrow_mut();
        match list.get(regex.as_str()) {
            Some(x) => *x,

            None => {
                let value = regex.is_match(self.get_name());
                list.insert(regex.as_str().to_string(), value);
                value
            }
        }
    }

matches is a RefCell<HashMap<String, bool>>. The problem is I am getting a lot of cache misses. In an app that uses this function about 200K times, with just about 200 combinations, I should be getting almost no cache misses. But it's as if it never inserts anything there.

Tips?

See https://github.com/frosklis/dinero-rs/issues/41#issuecomment-792122867 for more context.

It seems strange that you are taking a Regex by value. Doesn't it mean that you are cloning it many times?

I don't see anything obvious. How do you know you're seeing cache misses (and not some other kind of slowdown)?

1 Like

Maybe. I ran profiling and the time is mostly spent on the line regex.is_match so it does not seem to matter.

Because in my local version I put some println! to see what's going on and it seems to be the case. Like so:

if self.get_name().ends_with("nómina") {
    print!("{}     --{}--", self.id, regex_str);
}
match list.entry(regex_str) {
    Entry::Vacant(entry) => {
        *self.misses.borrow_mut() += 1;
        if self.get_name().ends_with("nómina") {
            println!("{}     MISS", self.get_name());
        }
        *entry.insert(regex.is_match(self.get_name()))
    }
    Entry::Occupied(entry) => {
        *self.hits.borrow_mut() += 1;
        if self.get_name().ends_with("nómina") {
            println!("{}     HIT", self.get_name());
        }
        *entry.get()
    }
}

self.id is a random usize created when the object is created. I wanted to check whether I was actually calling the same object, which was my first suspicion.

In the greater scheme of things, my app is about 3 times slower than the ledger-cli, in a real-life scenario, I am doing about 200K regex comparisons which take slightly more than half of the running time and they could be reduced by a factor of 1K if that piece of code worked as intended.

Are you sure that self is pointing to the same value each time, or could you be creating a new object with an empty cache each time?

Have you tried printing out the contents of the cache?

dbg!(&*list);
1 Like

This is strange, let's see, the current function is like so:


    pub fn is_match(&self, regex: Regex) -> bool {
        // todo delete printlns
        let mut list = self.matches.borrow_mut();
        let regex_str = regex.as_str().to_string();
        // *list
        //     .entry(regex_str)
        //     .or_insert(regex.is_match(self.get_name()))

        if self.get_name().ends_with("nómina") {
            print!("{}     --{}--", self.id, regex_str);
            dbg!(&*list);
        }
        match list.entry(regex_str) {
            Entry::Vacant(entry) => {
                *self.misses.borrow_mut() += 1;
                if self.get_name().ends_with("nómina") {
                    println!("{}     MISS", self.get_name());
                }
                *entry.insert(regex.is_match(self.get_name()))
            }
            Entry::Occupied(entry) => {
                *self.hits.borrow_mut() += 1;
                if self.get_name().ends_with("nómina") {
                    println!("{}     HIT", self.get_name());
                }
                *entry.get()
            }
        }
    }

And the output (the last part of it, it's about 15 MB) is:

  • stdout
3152039713130906693     --(?i)vactivo--Activo:ING:Cuenta nómina     MISS
3152039713130906693     --(?i)vactivo--Activo:ING:Cuenta nómina     MISS
  • stderr (from dbg!)

[src/models/account.rs:85] &*list = {
    "Agua": false,
    "(?i)^(Activo:DeGiro)|stockplan|broker": false,
    "(?i)(^Activo:Préstamos P2P)|(coinbase|crowdestor|pagatelia|urbanitae|mytriplea|civislend|plus capital|btc-e)": false,
    "Hogar:Comunidad": false,
    "(?i)^Activo:(.*:(cobas|fondo|azvalor|indexa)|Renta 4)": false,
    "(^Activo:Inmuebles)|(^Activo:Coches)": false,
    "^Activo:Holding": false,
    "(?i)Hogar:Comunidad": false,
    "(?i)(^Activo:Raisin Ceci|:raisin:depósitos)|(^Activo:(.*naranja|.*depósito|.*impuestos cero))": false,
    "(?i)^Activo:(Paypal|Efectivo|Transferwise|Revolut|.*corriente|.*nómina|.*cuenta común)": true,
}
[src/models/account.rs:85] &*list = {
    "Agua": false,
    "(?i)^(Activo:DeGiro)|stockplan|broker": false,
    "(?i)(^Activo:Préstamos P2P)|(coinbase|crowdestor|pagatelia|urbanitae|mytriplea|civislend|plus capital|btc-e)": false,
    "Hogar:Comunidad": false,
    "(?i)^Activo:(.*:(cobas|fondo|azvalor|indexa)|Renta 4)": false,
    "(^Activo:Inmuebles)|(^Activo:Coches)": false,
    "^Activo:Holding": false,
    "(?i)Hogar:Comunidad": false,
    "(?i)(^Activo:Raisin Ceci|:raisin:depósitos)|(^Activo:(.*naranja|.*depósito|.*impuestos cero))": false,
    "(?i)^Activo:(Paypal|Efectivo|Transferwise|Revolut|.*corriente|.*nómina|.*cuenta común)": true,
}

What I don't get is that there is a MISS both times. After the first miss, the entry should be added to the cache and (?i)vactivo be part of the hashmap.

Not sure what's going on.

(But thanks, I didn't know about the dbg macro)

Does self implement Clone? Is there any chance you have multiple copies with the same id, each with their own cache? You could check whether self is the same reference by printing it as a pointer:

println!("self = {%p}", self);

It is Clone.

#[derive(Debug, Clone)]
pub struct Account {
    name: String,
    origin: Origin,
    note: Option<String>,
    isin: Option<String>,
    aliases: HashSet<String>,
    check: Vec<String>,
    assert: Vec<String>,
    payee: Vec<Regex>,
    default: bool,
    matches: RefCell<HashMap<String, bool>>,
    hits: RefCell<usize>,
    misses: RefCell<usize>,
    id: usize,
}

Let me try the other thing. Thanks for the help.

Looks like I'm getting somewhere, it is not always the same reference, which should be.

I'll figure it out eventually. But if I have an Rc<Account> (my struct is called account), shouldn't I always get the same reference when I clone it?

Thanks a lot.

Yes, as long as you are cloning the Rc. One way to double-check that you are cloning the Rc is to write Rc::clone(&acct) instead of acct.clone(). (Then you will get a compile-time error if acct is not actually an Rc.

Another thing you could try is removing derive(Clone) from your struct (unless there are times when you do need to do a “deep” clone.

Turns out this is what's happening.

When I do this, it's not fine:

println!("Before clone: {:p}", posting.account);
let account = posting.account.clone();
println!("After  clone: {:p}", account);

When I do this, it is fine:

println!("Before clone: {:p}", posting.account);
let account = Rc::clone(&posting.account);
println!("After  clone: {:p}", account);

Thanks so much. I'll change it in all the places I do this.

Interesting! Out of curiousity, what is the type of the posting variable, and what type does its account field have?

The posting is a &Posting.

pub fn eval(
    node: &Node,
    posting: &Posting,
    transaction: &Transaction<Posting>,
    commodities: &mut List<Currency>,
)
...

This is a Posting.

#[derive(Debug, Clone)]
pub struct Posting {
    pub(crate) account: Rc<Account>,
    pub amount: Option<Money>,
    pub balance: Option<Money>,
    pub cost: Option<Cost>,
    pub kind: PostingType,
    pub comments: Vec<Comment>,
    pub tags: RefCell<Vec<Tag>>,
    pub payee: Option<Rc<Payee>>,
    pub transaction: RefCell<Weak<Transaction<Posting>>>,
    pub origin: PostingOrigin,
}

Huh. I can't figure out why posting.account.clone() would not resolve to Rc::clone in that code.

this is not true, I read the log wrong

I read the log worng.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.