Unicode tokenization

Hello there- Rust noob question.
I’m developing a Tamil language (தமிழ்) tokenizer using Rust.

This is what I have working now but seems excessive to me. How could I improve it ?

/** Split a tamil-unicode stream into
 * tamil characters (individuals).
 */
 pub fn get_letters(x:&str) -> Vec<String> {
     /* Splits the @word into a character-list of tamil/english
     *characters present in the stream. This routine provides a robust tokenizer
     *for Tamil unicode letters. */
     let mut v: Vec<String> = Vec::new();
     let mut tmp:String=String::from("");
     for (idx,c) in x.chars().enumerate() {
         if x.is_char_boundary(idx) {
             if ( tmp.len() != 0 ) {
                 v.push(format!("{}",tmp));
                 v.push(format!("{}",c));
             } else {
                 v.push(format!("{}",c));
             }
             tmp.clear();
         } else {
             tmp =  format!("{}{}",tmp,c);
         }
     }
     if tmp.len() != 0 {
         v.push(tmp);
     }
     v
 }

This bit doesn't make sense, so I doubt it's "working" in the way you expect. idx counts characters, because it's an enumerator for an iterator over characters, but is_char_boundary counts bytes.

Usually the suggestion for fixing code like this is to use char_indices instead. But idx isn't actually used inside the loop anyway, so why would you need to check whether it's a character boundary (which would always be the case if it were the actual index of the character c)? And how does this distinguish between Tamil and English characters? I'm confused.

if tmp.is_empty() {
v.push(tmp);
tmp.push(c);
4 Likes

You can use the unicode-segmentation library to handle grapheme clusters, words, and sentences for you.

10 Likes

EDIT: This is incorrect, see new version below.

I don't know how Unicode's Tamil language encoding works, but it looks like you're trying to get individual codepoints. That's what chars() does already; does this do what you want:

 pub fn get_letters(x:&str) -> Vec<String> {
    x.chars().map(Into::<String>::into).collect()
 }

My code works, as I’ve tested it.

Sorry theres not enough context here.

Let me explain; code is supposed to take “தமிழ்” and return “த” ,”மி”, “ழ்”

Personally, I'd probably write it like this:

use unicode_segmentation::UnicodeSegmentation; // 1.7.1

pub fn get_letters(x:&str) -> Vec<String> {
    x.graphemes(true).map(Into::<String>::into).collect()
 }
 
fn main() {
    dbg!(get_letters("தமிழ்"));
}

(Playground)


   Compiling playground v0.0.1 (/playground)
    Finished dev [unoptimized + debuginfo] target(s) in 1.33s
     Running `target/debug/playground`
[src/main.rs:8] get_letters("தமிழ்") = [
    "த",
    "மி",
    "ழ\u{bcd}",
]
2 Likes

The Unicode Tamil block uses combining characters for vowel signs and puḷḷi/virama. This means that a single Tamil glyph might be represented by more than one Unicode code point (or char).

For example, this code shows that the glyph மி is encoded as two codepoints, U+0BAE (TAMIL LETTER MA) followed by U+0BBF (TAMIL VOWEL SIGN I), and ழ் is encoded as followed by U+0BCD (TAMIL SIGN VIRAMA):

let s = "தமிழ்";
let chars: Vec<char> = s.chars().collect();
println!("{:?}", chars);
// prints ['த', 'ம', 'ி', 'ழ', '\u{bcd}']

The Rust standard library doesn't have a built-in way to detect combining marks, or letters made of multiple code points. Methods like is_char_boundary can only find the boundaries between code points. If you want to treat each glyph as a single unit, you'll need to use to unicode-segmentation crate mentioned above to break it into Extended Grapheme Clusters instead:

use unicode_segmentation::UnicodeSegmentation;

let s = "தமிழ்";
let graphemes: Vec<&str> = s.graphemes(true).collect();
println!("{:?}", graphemes);
// prints ["த", "மி", "ழ\u{bcd}"]    

Note: The confusion between byte indices and char indices in your current code causes it to fail on strings like this:

let s = "தமிழ் Hello";
println!("{:?}", get_letters(s));
// prints ["த", "மி", "ழ", "\u{bcd} ", "H", "el", "l", "o"]

Notice that it joined 'e' and 'l' into a single “letter.”

Rust Playground with all the code from this comment.

6 Likes

Well, I don’t think it works:

fn main() {
    dbg!(get_letters("தமிழ்aaaaaaaaaதமிழ்"));
}
    Finished dev [unoptimized + debuginfo] target(s) in 1.88s
     Running `target/debug/playground`
[src/main.rs:30] get_letters("தமிழ்aaaaaaaaaதமிழ்") = [
    "த",
    "மி",
    "ழ",
    "\u{bcd}a",
    "a",
    "aa",
    "a",
    "aa",
    "a",
    "aத",
    "ம",
    "ி",
    "ழ",
    "\u{bcd}",
]

(playground)

as @trentj already mentioned

Your code is mixing up different kinds of indexing a string, in effect using the lengths of the UTF-8 codepoint encodings (measured in number of bytes) to determine the lengths of the graphemes (measured in number of codepoints). Your code only works correctly by chance on certain inputs.

1 Like

Thanks for all the replies. I don’t want a third party solution - I’ll repost once I’ve settled on something thats workable to my needs/uses.

Doing this correctly requires Unicode tables that are not included in the Rust standard library. You can write code to download and parse those tables yourself, but I recommend using a widely-tested library for it. Note that the maintainer of unicode-segmentation is a member of the Rust core team, and the Rust compiler itself relies on this library. It's not just some random code from the Internet.

If you only care about grapheme clusters in the Tamil script, you could hard-code the list of combining characters from that block. But you would still get incorrect results for multi-code-point graphemes in other scripts (including ones sometimes found in English text, such as emoji and various diacritics).

18 Likes

Also note, the Rust standard library is not meant to be a complete toolbox for processing text, or for most other tasks. It's designed to be used in combination with external crates.

4 Likes

If you want something untested that you can just paste into your code, something like this should work on simple Tamil text:

pub fn get_letters(s: &str) -> Vec<&str> {
    let mut result = vec![];
    let mut start = 0;
    while start < s.len() {
        let end = s[start..].char_indices().skip(1)
            .find(|(_, c)| !is_tamil_combining_char(*c))
            .map(|(i, _)| start + i)
            .unwrap_or(s.len());
        result.push(&s[start..end]);
        start = end;
    }
    result
}

fn is_tamil_combining_char(c: char) -> bool {
    matches!(c,
        '\u{0B82}' // Anusvara
        // Vowel signs
        | '\u{0BBE}' | '\u{0BBF}' | '\u{0BC0}' | '\u{0BC1}' | '\u{0BC2}' | '\u{0BC6}'
        | '\u{0BC7}' | '\u{0BC8}' | '\u{0BCA}' | '\u{0BCB}' | '\u{0BCC}'
        | '\u{0BCD}' // Virama
        | '\u{0BD7}' // Au length mark
        | '\u{200C}' // ZWJ
        | '\u{200D}' // ZWNJ
    )
}

(Playground)

Be warned again, this will return incorrect results for most non-Tamil scripts, including emoji and some Latin diacritics. It could also be incorrect for some Tamil text, if it uses combining characters that are not found in the list above.

1 Like

is_char_boundary doesn't do what you think. You assume it's a boundary between visible/human-recognizable characters — it is NOT. It's only a check whether a byte is a part of an incomplete UTF-8 sequence. It a primitive, low-level function that has no idea what a character looks like.

Rust's standard library has no way of doing what you want. This is intentional. What you need is grapheme clusters, and that requires a special algorithm and lookup tables that you need to get from a 3rd party crate.

I don’t want a third party solution

You will not be able to be productive and write non-trivial programs in Rust by sticking only to the standard library. Rust is designed to rely on crates-io for almost everything.

8 Likes

I maintain open-tamil for Python, and recently ported some pieces of that large library into Rust as tamil crate. https://crates.io/crates/tamil

Thanks to comments here I tested my code and the str::is_unicode_boundary is not sufficient for my use

My algorithm is here: get_letters routine

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.