What is the most efficient way to convert `&str` of `HashMap<&str, ()>` to lowercase?

Hi,

Let's say we have the following code:

use std::collections::HashMap;

fn main() {
    let src = "FÖÖ BÄR baz foo abc def BÄR";
    let mut case_sensitive = HashMap::new();
    let mut case_insensitive = HashMap::new();
    for x in src.split(' ') {
        case_sensitive.entry(x).or_insert(());
        // Is there anything better than allocating a new `String`?
        let x_i: String = x.to_lowercase();
        case_insensitive.entry(x_i).or_insert(());
    }
    dbg!(case_sensitive);
    dbg!(case_insensitive);
}

Playground

What would be the most efficient way to get a HashMap whose keys are lower-case, so it's possible to do lookups that are case-insensitive?

One idea I had, was to allocate a String to which the case-insensitive values are written, and then lower-case this string:

use std::collections::HashMap;

fn main() {
    let src = "FÖÖ BÄR baz foo abc def BÄR";
    let mut case_sensitive = HashMap::new();
    let mut case_insensitive = String::new();
    for x in src.split(' ') {
        case_sensitive.entry(x).or_insert(());
        case_insensitive.push_str(x);
        case_insensitive.push(' ');
    }
    let case_insensitive = case_insensitive.to_lowercase();
    let case_insensitive: HashMap<&str, ()> = case_insensitive
        .trim_end()
        .split(' ')
        .map(|x| (x, ()))
        .collect();
    dbg!(case_sensitive);
    dbg!(case_insensitive);
}

Playground

This would result in three allocations (which would be fine).

Another approach I thought of, was to use something like SmartString, but I can't find a crate that has such a type that has to_lowercase.

One problem is, that &str can contain non-ASCII characters (without that requirement str::as_bytes + char::make_ascii_lowercase + str::from_utf8 probably could be used).

The input I actually have is cssparser::CowRcStr (which implements Deref<Target = str>), in case that would make a difference.

smallstr::SmallString will only make a heap allocation if the held string is too large for its statically-allocated buffer. It implements FromIterator<char>, so you should be able to do something like this to avoid the intermediate String value (untested):

let case_insensitive_key: SmallString<[u8;16]> =
    x.chars().flat_map(char::to_lowercase).collect();

EDIT: Looking at the SmartString docs, this approach would work with it too.

1 Like

@2e71828: Thank you!

I'm curious if there are other significantly different approaches, so I'll keep this as "un-solved" a bit longer.

If you only lowercase for sake of comparison, then the most efficient way is not to convert at all, but wrap them in case-ignorant-comparison wrapper:

4 Likes

@kornel: Nice! This should work for my particular use-case.

Are you sure, this always will be faster (on average), than creating a SmartString with @2e71828's approach (if the string is small enough to not allocate on the heap)? Wouldn't Unicase have to do the case-folding at every comparison?

Another approach If many of the strings are already lower case, you could store a Cow<'_,str> in the lowercase version to only allocate when there's an uppercase character. Since your inputs are kind of a cow type you might be able to use them directly

use std::collections::HashSet;
use cssparser::CowRcStr;

fn main() {
    println!("Hello, world!");

    let src = "FÖÖ BÄR baz foo abc def BÄR";

    let src:Vec<_> = src.split(' ').map(CowRcStr::from).collect();

    let mut case_sensitive = HashSet::new();
    let mut case_insensitive = HashSet::new();
    let mut buf = String::new();

    for x in src {
        buf.push_str(&x);
        buf.make_ascii_lowercase();

        if buf == x.as_ref() {
            // CowRcStr should be very cheap to clone
            case_insensitive.insert(x.clone());
        } else {
            case_insensitive.insert(CowRcStr::from(buf.clone()));
        }

        case_sensitive.insert(x);
        buf.clear();
    }
}

@Cocalus: Thank you, this solution looks interesting too!

I've selected @2e71828's approach as the solution because it answers my question and is a great answer for it, but I decided to use @kornel's approach because it fits my use-case well and is a bit easier to implement (including the remainder of the program), and might even be faster than other solutions.

Thanks everybody!

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.