Unicode tokenization

If you want something untested that you can just paste into your code, something like this should work on simple Tamil text:

pub fn get_letters(s: &str) -> Vec<&str> {
    let mut result = vec![];
    let mut start = 0;
    while start < s.len() {
        let end = s[start..].char_indices().skip(1)
            .find(|(_, c)| !is_tamil_combining_char(*c))
            .map(|(i, _)| start + i)
            .unwrap_or(s.len());
        result.push(&s[start..end]);
        start = end;
    }
    result
}

fn is_tamil_combining_char(c: char) -> bool {
    matches!(c,
        '\u{0B82}' // Anusvara
        // Vowel signs
        | '\u{0BBE}' | '\u{0BBF}' | '\u{0BC0}' | '\u{0BC1}' | '\u{0BC2}' | '\u{0BC6}'
        | '\u{0BC7}' | '\u{0BC8}' | '\u{0BCA}' | '\u{0BCB}' | '\u{0BCC}'
        | '\u{0BCD}' // Virama
        | '\u{0BD7}' // Au length mark
        | '\u{200C}' // ZWJ
        | '\u{200D}' // ZWNJ
    )
}

(Playground)

Be warned again, this will return incorrect results for most non-Tamil scripts, including emoji and some Latin diacritics. It could also be incorrect for some Tamil text, if it uses combining characters that are not found in the list above.

1 Like