How to work with strings and graphemes similar to SQL? How to avoid crate proliferation?

My concrete question is:
How can I do similar things in Rust (on UTF8-Rust-Strings)? Without having to worry about how many bytes a grapheme (in my understanding this is the equivalent to a character) occupies?

SUBSTRING ( expression , start , length ) 
LEFT ( character_expression , integer_expression ) 
LEN ( string_expression )
RIGHT ( character_expression , integer_expression ) 
STUFF ( character_expression , start , length , replaceWith_expression )

I have no idea.
.

But if you can get it to work with unicode text l̷̢̢̰̬͇̙͉͕̠̠̥̂̿̋͑̕͝͠ͅͅĩ̴̡̢̛̠̻̫̲͉̤̱̟͍̤̳͔͐̍̔̈́͊̒͂͋̈́̉̔̕̚͜͜͠k̸͍̳̜̗̰̼̦̟̖̳̥̙̗̂́̓̎͌͊͘ȩ̴͔̝̤̳̖̜̓̽̕ ̶̪̺͖̈́̃t̷̪̯̟̳͍̲͔̎͋̿̉̒̑̓̊̾̊̒̚͘ḩ̷̦͂̈́͗͌̏̏̇̔̈́͒̒̆̄̈́̚͠į̸̨̡̛̤͚͓̯͎̘̪̙̟̮͈͔͔̈́͋̉̾̃̎̒̈́́̾͂́ͅś̷̘̙̜̯͖̄͆̿̄̑̄̄͝ you will be way ahead of the game.

You can start with unicode-segmentation, specifically with UnicodeSegmentation::graphemes() which gives you an interator over graphemes, you can then use usual iterator methods to splice the string any way you wish.

If you use only a subset of limited characters and you know what you're doing, you could normalise Unicode using NFC and just iterate over codepoints and use standard library str operations.

1 Like

According to the documentation that @trentj linked to, all of those operations work on codepoints and not graphemes. Something like this should be equivalent (compiles, but untested):

/// SUBSTRING ( expression , start , length )
fn substr(s: &str, start:usize, len:usize)->String {
    s.chars().skip(start).take(len).collect()
}

/// LEFT ( character_expression , integer_expression ) 
fn left(s: &str, len:usize)->String {
    s.chars().take(len).collect()
}

/// LEN ( string_expression )
fn len(s:&str)->usize {
    s.chars().count()
}

/// RIGHT ( character_expression , integer_expression )
fn right(s:&str, len:usize)->String {
    if len == 0 {
        Default::default()
    } else {
        let start = s.char_indices().rev().nth(len-1).unwrap().0;
        s[start..].into()
    }
}

/// STUFF ( character_expression , start , length , replaceWith_expression )
fn stuff(s:&str, start:usize, len:usize, r:&str)->String {
    s.chars().take(len)
     .chain(r.chars())
     .chain(s.chars().skip(start+len))
     .collect()
}

(Playground)


And here are versions of substr, left, and right that return a slice of the original string instead of making a heap allocation:

fn idx_of(s:&str, idx:usize)->usize {
    s.char_indices().nth(idx).map(|(x,_)| x).unwrap_or(s.len())
}

/// SUBSTRING ( expression , start , length )
fn substr(s: &str, start:usize, len:usize)->&str {
    &s[idx_of(s,start)..idx_of(s,start+len)]
}

/// LEFT ( character_expression , integer_expression ) 
fn left(s: &str, len:usize)->&str {
    &s[..idx_of(s,len)]
}

/// RIGHT ( character_expression , integer_expression )
fn right(s:&str, len:usize)->&str {
    &s[idx_of(s, s.len()-len)..]
}

(Playground)

4 Likes

Also note that it might not be very efficient to construct strings simply using the operations above, it might be a lot more performant to use iterators, chain them, use skip(), take(), take_while(), etc. like @2e71828 used in the post above.

1 Like

Or, most efficient, you should always retrieve and store and pass UTF-8 byte indices instead of char indices. Then you don't need to do any UTF-8 decoding and iterating, and can use the native Rust UTF-8 indexing/slicing/splicing operations. The Rust standard library makes this very easy because the string operations that work with lengths and indices all take and return byte indices.

(The operations are still Unicode-aware; for example str.find(...) will never return an index that is not on a code-point boundary.)

2 Likes

I think this is the missing point I was looking for to understand how to work with strings. method.chars
it looks like this method does what I expect.

But then I don't understand this:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust's standard library, check crates.io instead.

Maybe the notes about "grapheme" in the documentation just confused me unnecessarily, and with this chars everything is much simpler?

Using code points is simpler than graphemes but then you need to consider Unicode normalisation -- two different codepoint sequences can represent the same grapheme. So it really depends on what you want to do with your strings and where they're coming from.

Some things that appear as a single character visually are represented by a sequence of codepoints. Both these functions and the SQL ones have a chance of splitting a base character from its diacritical marks, for example.

2 Likes

This reads logically. But I can find only char_indices(&self). Do you mean something else?

Often when you want to do something like splice or trim a string at a particular index, it's because you found that index in some previous inspection of the string. In Rust, operations like str::find and str::rfind return byte indices, which you can then pass directly to operations like str::split or str::insert.

Again, it would be useful to have a concrete case where you use these functions. Something you can provide code for like, "remove all the text after the first semicolon," or "strip trailing emojis from the text."

5 Likes

One of the languages that let you do this in O(1) time complexity, Python, stores strings like Box<[char]>. But this isn't a good default for a low level language like Rust, [char] wastes memory and you rarely need slicing strings in a context where you don't already iterating over them.

1 Like

Generally the main reason you rarely end up actually needing grapheme clusters in practice comes from the answer to the question "Where did 7 and 5 come from?"

Pretty much all ways that the standard library provides that let you obtain those values will give you the byte index rather than a "grapheme cluster" index, and once you have the byte index, you can just use the methods from the standard library.

3 Likes

Also note that if "7" and "5" are hard-coded numbers, then your SQL code will break in the middle of multi-code-point grapheme clusters, producing nonsensical-looking output for input strings containing flag emoji, combining diacritics, Thai vowel signs, etc., because SQL also does not have built-in support for extended grapheme clusters.

3 Likes

Yeah, for example the following statement returns 6 and 2:

SELECT LENGTH("কী"), CHAR_LENGTH("কী")

For comparison:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = "কী";
    println!("Bytes: {}", s.len());
    println!("Chars: {}", s.chars().count());
    println!("Graphemes: {}", s.graphemes(true).count());
}
Bytes: 6
Chars: 2
Graphemes: 1
2 Likes

I understand that I must have encountered a "problem" there while reading the documentation that is not normally a problem. I got scared reading the documentation and these "grapheme" that I never read anything before, that I can not do things as simple as the ones I asked, with Rust. In real life, however, everything seems to be a lot simpler, and I guess I don't need to worry about not being able to do such simple string operations with Rust.

I had already asked myself after reading the introduction about strings: how can it be that such (in my opinion almost trivial) actions are not possible?

But in Rust, I guess they just think much further ahead than I ever did, and that's where graphemes come into play (and scare beginners like me).

One of the languages that let you do this in O(1) time complexity, Python, stores strings like Box<[char]> . But this isn't a good default for a low level language like Rust, [char] wastes memory and you rarely need slicing strings in a context where you don't already iterating over them.

This is very revealing: Because Rust is also a low level language, you have to think about such things. And Python is simply more wasteful with memory. That's probably what makes Rust so special: They don't just care about secure code, they also think about other important things.

I am also impressed by the compiler, which gives such useful hints. What a contrast to working with VS and some Microsoft product error messages. And I was also very impressed by the built-in documentation with cargo doc --open.

2 Likes

Basically, if you want to implement MS SQL bindings, then you need to do the same thing MS SQL does, even if it's "wrong."

This might be a bit complicated, if you run your database with a UCS-2 collation, because it means you need to support splitting surrogate pairs in half.

Isn't it the same with Python? Out of the box Python does not understand graphemes either. You need to install the graphemes module:

$ python3
Python 3.7.3 (default, Dec 20 2019, 18:57:59) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import grapheme
>>> s = "কী"
>>> len(s)
2
>>> grapheme.length(s)
1
>>> len(s.encode('utf-8'))
6
2 Likes

Yes, python doesn't understands graphemes. Built-in methods like slicing operates on unicode scalar values.

I don't think having to worry about graphemes is common, I didn't used them ever.

Even without graphemes Unicode is tricky, consider: Rust Playground

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    for s in [
        "\u{03d3}",
        "\u{03d2}\u{0301}",
        "\u{038e}",
        "\u{03a5}\u{0301}",
    ]
    .iter()
    {
        println!("{}", s);
        println!("Bytes: {}", s.len());
        println!("Chars: {}", s.chars().count());
        println!("Graphemes: {}\n", s.graphemes(true).count());
    }
}

All 4 strings represent the same "letter" (GREEK UPSILON WITH ACUTE AND HOOK SYMBOL) but they have different representations. All of them are valid but none of the 4 normalisations are identical. When comparing unicode strings you need to be careful what/how you're comparing/hashing/etc and ideally not only validate your input for valid utf-8 but also normalise it the same way.

(Interestingly, my browsers don't render those 4 string the same way, only the last two look the same.)

1 Like