Sorry if my title is a bit gibberish. I'm creating an algorithm that detects abbreviations in a String. I've more or less completed the detecting part, I'm just trying to return the found abbreviations in a suitable format.
I have a BTreeMap<usize, Option>, where the usize is the index in the input String split on each whitespace. I'm not super picky about the output format, but it should use the index of each char instead of each word. So, for example, if I have the String "i like abc," the current hashmap I have would be { 1: None, 2: None, 3: Some("abc") }. What I'm current trying to convert it to is a HashMap<Range, String>, where the Range represents the characters that the abbreviation spans
I've been struggling with the algorithm to do this. I got pretty close with the following:
let mut res = HashMap::<Range<usize>, String>::new();
let mut i = 0;
let mut j;
for (index, opt) in abbrs {
match opt { // generate range
Some(abbr) => {
j = i + abbr.len();
res.insert(i..j, abbr);
}
None => {
i += words[index].len();
}
}
i += 1; // to make up for space
}
But, the indices were off. I think it may be related to the fact that I'm doing a lot of Regex stuff to the String, but I tried to make it less annoying by, instead of completely replacing a match, just substituting every letter in the match with a whitespace.
My full function:
pub fn detect_acronyms(text: String, excl_dict: &Vec<String>, add_dict: &Vec<String>) -> HashMap<Range<usize>, String> {
let rm_re = Regex::new("[\\.\\?!,\\(\\)\\d:;-_+=|]").unwrap();
let tag_re = Regex::new("\\*\\*[^\\[]*\\[[^\\]]*\\]").unwrap();
let vowel_re = Regex::new("[aeuoi]").unwrap();
// returns the ranges of char indices that were replaced and the string with matched characters replaced with spaces
let (mut ranges, mapped_no_tags) = replace_with_space(&text, &tag_re);
let (cleaned_ranges, mapped_clean_text) = replace_with_space(&mapped_no_tags, &rm_re);
let words = mapped_clean_text.split(" ").map(|x| x.to_owned()).collect::<Vec<String>>();
ranges.extend(cleaned_ranges);
let mut impossible_bigrams_rdr = csv::Reader::from_path("./data/dict/bad_bigrams.csv").unwrap();
let mut impossible_bigrams_re_match = String::new();
let mut bigram_rcrds = impossible_bigrams_rdr.records();
impossible_bigrams_re_match.push_str(&bigram_rcrds.next().unwrap().unwrap()[0]);
for bigram in bigram_rcrds {
let txt = bigram.unwrap()[0].to_string();
impossible_bigrams_re_match.push_str(&format!("|{}", txt));
}
let bigram_re = Regex::new(&format!("[^ ]*({})[^ ]*", impossible_bigrams_re_match)).unwrap();
// Rules:
// * Acronyms will always be < 5 characters, unless they fit an "always" rule
// * Acronyms won't be in dict
// * Every word in the medical abbreviation dictionary is an acronym
// * Every word with an illegal bigram is an acronym
let abbrs: BTreeMap<usize, Option<String>> = words
.iter()
.enumerate()
.map(|(i, word)| (i, word.to_lowercase().trim().to_string()))
.map(|(i, word)| if word.len() >= 1 { (i, Some(word)) } else { (i, None) }) // len >= 1 is so important it gets its own filter, oohlala
.map(|(i, word)|
{
match word {
Some(abbr) => {
if (
abbr.len() < 5 &&
!excl_dict.contains(&abbr)
) || (
bigram_re.is_match(&abbr) ||
!vowel_re.is_match(&abbr) ||
add_dict.contains(&abbr)
) {
(i, Some(abbr))
} else {
(i, None)
}
}
None => (i, None)
}
}
)
.collect(); // mmmm delicious spaghetti
let mut res = HashMap::<Range<usize>, String>::new();
let mut i = 0;
let mut j;
for (index, opt) in abbrs {
match opt { // generate range
Some(abbr) => {
j = i + abbr.len();
res.insert(i..j, abbr);
}
None => {
i += words[index].len();
}
}
i += 1; // to make up for space
}
res
}
Thank you!!