Converting HashMap with word indices to HashMap with letter indices

Sorry if my title is a bit gibberish. I'm creating an algorithm that detects abbreviations in a String. I've more or less completed the detecting part, I'm just trying to return the found abbreviations in a suitable format.

I have a BTreeMap<usize, Option>, where the usize is the index in the input String split on each whitespace. I'm not super picky about the output format, but it should use the index of each char instead of each word. So, for example, if I have the String "i like abc," the current hashmap I have would be { 1: None, 2: None, 3: Some("abc") }. What I'm current trying to convert it to is a HashMap<Range, String>, where the Range represents the characters that the abbreviation spans

I've been struggling with the algorithm to do this. I got pretty close with the following:

let mut res = HashMap::<Range<usize>, String>::new();

    let mut i = 0;
    let mut j;
    for (index, opt) in abbrs {
        match opt { // generate range
            Some(abbr) => {
                j = i + abbr.len();
                res.insert(i..j, abbr);
            }
            None => {
                i += words[index].len();
            }
        }
        i += 1; // to make up for space
    }

But, the indices were off. I think it may be related to the fact that I'm doing a lot of Regex stuff to the String, but I tried to make it less annoying by, instead of completely replacing a match, just substituting every letter in the match with a whitespace.

My full function:

pub fn detect_acronyms(text: String, excl_dict: &Vec<String>, add_dict: &Vec<String>) -> HashMap<Range<usize>, String> {
    let rm_re = Regex::new("[\\.\\?!,\\(\\)\\d:;-_+=|]").unwrap();
    let tag_re = Regex::new("\\*\\*[^\\[]*\\[[^\\]]*\\]").unwrap();
    let vowel_re = Regex::new("[aeuoi]").unwrap();

    // returns the ranges of char indices that were replaced and the string with matched characters replaced with spaces
    let (mut ranges, mapped_no_tags) = replace_with_space(&text, &tag_re);
    let (cleaned_ranges, mapped_clean_text) = replace_with_space(&mapped_no_tags, &rm_re);
    let words = mapped_clean_text.split(" ").map(|x| x.to_owned()).collect::<Vec<String>>();

    ranges.extend(cleaned_ranges);

    let mut impossible_bigrams_rdr = csv::Reader::from_path("./data/dict/bad_bigrams.csv").unwrap();
    let mut impossible_bigrams_re_match = String::new();
    let mut bigram_rcrds = impossible_bigrams_rdr.records();
    impossible_bigrams_re_match.push_str(&bigram_rcrds.next().unwrap().unwrap()[0]);

    for bigram in bigram_rcrds {
        let txt = bigram.unwrap()[0].to_string();
        impossible_bigrams_re_match.push_str(&format!("|{}", txt));
    }

    let bigram_re = Regex::new(&format!("[^ ]*({})[^ ]*", impossible_bigrams_re_match)).unwrap();

    // Rules:
    // * Acronyms will always be < 5 characters, unless they fit an "always" rule
    // * Acronyms won't be in dict
    // * Every word in the medical abbreviation dictionary is an acronym
    // * Every word with an illegal bigram is an acronym

    let abbrs: BTreeMap<usize, Option<String>> = words
        .iter()
        .enumerate()
        .map(|(i, word)| (i, word.to_lowercase().trim().to_string()))
        .map(|(i, word)| if word.len() >= 1 { (i, Some(word)) } else { (i, None) }) // len >= 1 is so important it gets its own filter, oohlala
        .map(|(i, word)| 
            {
                match word {
                    Some(abbr) => {
                        if  (
                                abbr.len() < 5 && 
                                !excl_dict.contains(&abbr)
                            ) || (
                                bigram_re.is_match(&abbr) ||
                                !vowel_re.is_match(&abbr) ||
                                add_dict.contains(&abbr)
                            ) {
                                (i, Some(abbr))
                        } else {
                            (i, None)
                        }
                    }
                    None => (i, None)
                }
            }
        )
        .collect(); // mmmm delicious spaghetti 

    let mut res = HashMap::<Range<usize>, String>::new();

    let mut i = 0;
    let mut j;
    for (index, opt) in abbrs {
        match opt { // generate range
            Some(abbr) => {
                j = i + abbr.len();
                res.insert(i..j, abbr);
            }
            None => {
                i += words[index].len();
            }
        }
        i += 1; // to make up for space
    }

    res
}

Thank you!!

This is very Perl-esque in that it's using Regex like sledgehammer and making a lot of assumptions about the incoming text, but let's see what we can do.

Pokes at code awhile

OK, the heart of your question is really, how do you split on ASCII spaces and know the span of the returned "words"s?

Frankly there should be an offset_of function so you didn't have to keep track of the starting offset yourself, but there isn't. Let's consult the documentation on split to check for any gotchas... Splitting on character seems to be a sane implementation, where splitting on a string containing n of the character will iterate over n+1 substrings.

This means that the first "word" is always at offset 0, and every returned "word" except the last is followed by a space.

If we're keeping track of the offset as we go, we don't really care about our offset once we've seen the last item from the iterator, so we can just act like every item is followed by one delimiter. Something like this:

fn split_ranges(text: &str, delimiter: char) -> impl Iterator<Item = (Range<usize>, &str)> {
    let mut offset = 0;
    text.split(delimiter)
        .map(move |word| {
            let range = offset..offset+word.len();
            offset += word.len() + 1;
            (range, word)
        })
}
1 Like