What is the most efficient way to convert `&str` of `HashMap<&str, ()>` to lowercase?

Hi,

Let's say we have the following code:

use std::collections::HashMap;

fn main() {
    let src = "FÖÖ BÄR baz foo abc def BÄR";
    let mut case_sensitive = HashMap::new();
    let mut case_insensitive = HashMap::new();
    for x in src.split(' ') {
        case_sensitive.entry(x).or_insert(());
        // Is there anything better than allocating a new `String`?
        let x_i: String = x.to_lowercase();
        case_insensitive.entry(x_i).or_insert(());
    }
    dbg!(case_sensitive);
    dbg!(case_insensitive);
}

Playground

What would be the most efficient way to get a HashMap whose keys are lower-case, so it's possible to do lookups that are case-insensitive?

One idea I had, was to allocate a String to which the case-insensitive values are written, and then lower-case this string:

use std::collections::HashMap;

fn main() {
    let src = "FÖÖ BÄR baz foo abc def BÄR";
    let mut case_sensitive = HashMap::new();
    let mut case_insensitive = String::new();
    for x in src.split(' ') {
        case_sensitive.entry(x).or_insert(());
        case_insensitive.push_str(x);
        case_insensitive.push(' ');
    }
    let case_insensitive = case_insensitive.to_lowercase();
    let case_insensitive: HashMap<&str, ()> = case_insensitive
        .trim_end()
        .split(' ')
        .map(|x| (x, ()))
        .collect();
    dbg!(case_sensitive);
    dbg!(case_insensitive);
}

Playground

This would result in three allocations (which would be fine).

Another approach I thought of, was to use something like SmartString, but I can't find a crate that has such a type that has to_lowercase.

One problem is, that &str can contain non-ASCII characters (without that requirement str::as_bytes + char::make_ascii_lowercase + str::from_utf8 probably could be used).

The input I actually have is cssparser::CowRcStr (which implements Deref<Target = str>), in case that would make a difference.

smallstr::SmallString will only make a heap allocation if the held string is too large for its statically-allocated buffer. It implements FromIterator<char>, so you should be able to do something like this to avoid the intermediate String value (untested):

let case_insensitive_key: SmallString<[u8;16]> =
    x.chars().flat_map(char::to_lowercase).collect();

EDIT: Looking at the SmartString docs, this approach would work with it too.

1 Like

@2e71828: Thank you!

I'm curious if there are other significantly different approaches, so I'll keep this as "un-solved" a bit longer.

If you only lowercase for sake of comparison, then the most efficient way is not to convert at all, but wrap them in case-ignorant-comparison wrapper:

4 Likes

@kornel: Nice! This should work for my particular use-case.

Are you sure, this always will be faster (on average), than creating a SmartString with @2e71828's approach (if the string is small enough to not allocate on the heap)? Wouldn't Unicase have to do the case-folding at every comparison?

Another approach If many of the strings are already lower case, you could store a Cow<'_,str> in the lowercase version to only allocate when there's an uppercase character. Since your inputs are kind of a cow type you might be able to use them directly

use std::collections::HashSet;
use cssparser::CowRcStr;

fn main() {
    println!("Hello, world!");

    let src = "FÖÖ BÄR baz foo abc def BÄR";

    let src:Vec<_> = src.split(' ').map(CowRcStr::from).collect();

    let mut case_sensitive = HashSet::new();
    let mut case_insensitive = HashSet::new();
    let mut buf = String::new();

    for x in src {
        buf.push_str(&x);
        buf.make_ascii_lowercase();

        if buf == x.as_ref() {
            // CowRcStr should be very cheap to clone
            case_insensitive.insert(x.clone());
        } else {
            case_insensitive.insert(CowRcStr::from(buf.clone()));
        }

        case_sensitive.insert(x);
        buf.clear();
    }
}

@Cocalus: Thank you, this solution looks interesting too!

I've selected @2e71828's approach as the solution because it answers my question and is a great answer for it, but I decided to use @kornel's approach because it fits my use-case well and is a bit easier to implement (including the remainder of the program), and might even be faster than other solutions.

Thanks everybody!