How to work with strings and graphemes similar to SQL? How to avoid crate proliferation?

Germo · February 9, 2021, 2:37pm

My concrete question is:
How can I do similar things in Rust (on UTF8-Rust-Strings)? Without having to worry about how many bytes a grapheme (in my understanding this is the equivalent to a character) occupies?

SUBSTRING ( expression , start , length ) 
LEFT ( character_expression , integer_expression ) 
LEN ( string_expression )
RIGHT ( character_expression , integer_expression ) 
STUFF ( character_expression , start , length , replaceWith_expression )

ZiCog · February 9, 2021, 2:55pm

I have no idea.
.

But if you can get it to work with unicode text l̷̢̢̰̬͇̙͉͕̠̠̥̂̿̋͑̕͝͠ͅͅĩ̴̡̢̛̠̻̫̲͉̤̱̟͍̤̳͔͐̍̔̈́͊̒͂͋̈́̉̔̕̚͜͜͠k̸͍̳̜̗̰̼̦̟̖̳̥̙̗̂́̓̎͌͊͘ȩ̴͔̝̤̳̖̜̓̽̕ ̶̪̺͖̈́̃t̷̪̯̟̳͍̲͔̎͋̿̉̒̑̓̊̾̊̒̚͘ḩ̷̦͂̈́͗͌̏̏̇̔̈́͒̒̆̄̈́̚͠į̸̨̡̛̤͚͓̯͎̘̪̙̟̮͈͔͔̈́͋̉̾̃̎̒̈́́̾͂́ͅś̷̘̙̜̯͖̄͆̿̄̑̄̄͝ you will be way ahead of the game.

qaopm · February 9, 2021, 3:02pm

You can start with unicode-segmentation, specifically with UnicodeSegmentation::graphemes() which gives you an interator over graphemes, you can then use usual iterator methods to splice the string any way you wish.

If you use only a subset of limited characters and you know what you're doing, you could normalise Unicode using NFC and just iterate over codepoints and use standard library str operations.

2e71828 · February 9, 2021, 3:03pm

According to the documentation that @trentj linked to, all of those operations work on codepoints and not graphemes. Something like this should be equivalent (compiles, but untested):

/// SUBSTRING ( expression , start , length )
fn substr(s: &str, start:usize, len:usize)->String {
    s.chars().skip(start).take(len).collect()
}

/// LEFT ( character_expression , integer_expression ) 
fn left(s: &str, len:usize)->String {
    s.chars().take(len).collect()
}

/// LEN ( string_expression )
fn len(s:&str)->usize {
    s.chars().count()
}

/// RIGHT ( character_expression , integer_expression )
fn right(s:&str, len:usize)->String {
    if len == 0 {
        Default::default()
    } else {
        let start = s.char_indices().rev().nth(len-1).unwrap().0;
        s[start..].into()
    }
}

/// STUFF ( character_expression , start , length , replaceWith_expression )
fn stuff(s:&str, start:usize, len:usize, r:&str)->String {
    s.chars().take(len)
     .chain(r.chars())
     .chain(s.chars().skip(start+len))
     .collect()
}

(Playground)

And here are versions of substr, left, and right that return a slice of the original string instead of making a heap allocation:

fn idx_of(s:&str, idx:usize)->usize {
    s.char_indices().nth(idx).map(|(x,_)| x).unwrap_or(s.len())
}

/// SUBSTRING ( expression , start , length )
fn substr(s: &str, start:usize, len:usize)->&str {
    &s[idx_of(s,start)..idx_of(s,start+len)]
}

/// LEFT ( character_expression , integer_expression ) 
fn left(s: &str, len:usize)->&str {
    &s[..idx_of(s,len)]
}

/// RIGHT ( character_expression , integer_expression )
fn right(s:&str, len:usize)->&str {
    &s[idx_of(s, s.len()-len)..]
}

(Playground)

qaopm · February 9, 2021, 3:14pm

Germo:

SUBSTRING ( expression , start , length ) 
LEFT ( character_expression , integer_expression ) 
LEN ( string_expression )
RIGHT ( character_expression , integer_expression ) 
STUFF ( character_expression , start , length , replaceWith_expression )

Also note that it might not be very efficient to construct strings simply using the operations above, it might be a lot more performant to use iterators, chain them, use skip(), take(), take_while(), etc. like @2e71828 used in the post above.

mbrubeck · February 9, 2021, 3:32pm

Or, most efficient, you should always retrieve and store and pass UTF-8 byte indices instead of char indices. Then you don't need to do any UTF-8 decoding and iterating, and can use the native Rust UTF-8 indexing/slicing/splicing operations. The Rust standard library makes this very easy because the string operations that work with lengths and indices all take and return byte indices.

(The operations are still Unicode-aware; for example str.find(...) will never return an index that is not on a code-point boundary.)

Germo · February 9, 2021, 3:38pm

I think this is the missing point I was looking for to understand how to work with strings. method.chars
it looks like this method does what I expect.

But then I don't understand this:

It's important to remember that char represents a Unicode Scalar Value, and may not match your idea of what a 'character' is. Iteration over grapheme clusters may be what you actually want. This functionality is not provided by Rust's standard library, check crates.io instead.

Maybe the notes about "grapheme" in the documentation just confused me unnecessarily, and with this chars everything is much simpler?

qaopm · February 9, 2021, 3:43pm

Using code points is simpler than graphemes but then you need to consider Unicode normalisation -- two different codepoint sequences can represent the same grapheme. So it really depends on what you want to do with your strings and where they're coming from.

2e71828 · February 9, 2021, 3:45pm

Some things that appear as a single character visually are represented by a sequence of codepoints. Both these functions and the SQL ones have a chance of splitting a base character from its diacritical marks, for example.

Germo · February 9, 2021, 3:46pm

This reads logically. But I can find only char_indices(&self). Do you mean something else?

mbrubeck · February 9, 2021, 3:51pm

Often when you want to do something like splice or trim a string at a particular index, it's because you found that index in some previous inspection of the string. In Rust, operations like str::find and str::rfind return byte indices, which you can then pass directly to operations like str::split or str::insert.

Again, it would be useful to have a concrete case where you use these functions. Something you can provide code for like, "remove all the text after the first semicolon," or "strip trailing emojis from the text."

eko · February 9, 2021, 3:54pm

One of the languages that let you do this in O(1) time complexity, Python, stores strings like Box<[char]>. But this isn't a good default for a low level language like Rust, [char] wastes memory and you rarely need slicing strings in a context where you don't already iterating over them.

alice · February 9, 2021, 4:49pm

Generally the main reason you rarely end up actually needing grapheme clusters in practice comes from the answer to the question "Where did 7 and 5 come from?"

Pretty much all ways that the standard library provides that let you obtain those values will give you the byte index rather than a "grapheme cluster" index, and once you have the byte index, you can just use the methods from the standard library.

mbrubeck · February 9, 2021, 5:02pm

Also note that if "7" and "5" are hard-coded numbers, then your SQL code will break in the middle of multi-code-point grapheme clusters, producing nonsensical-looking output for input strings containing flag emoji, combining diacritics, Thai vowel signs, etc., because SQL also does not have built-in support for extended grapheme clusters.

alice · February 9, 2021, 5:20pm

Yeah, for example the following statement returns 6 and 2:

SELECT LENGTH("কী"), CHAR_LENGTH("কী")

For comparison:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let s = "কী";
    println!("Bytes: {}", s.len());
    println!("Chars: {}", s.chars().count());
    println!("Graphemes: {}", s.graphemes(true).count());
}

Bytes: 6
Chars: 2
Graphemes: 1

Germo · February 9, 2021, 5:26pm

I understand that I must have encountered a "problem" there while reading the documentation that is not normally a problem. I got scared reading the documentation and these "grapheme" that I never read anything before, that I can not do things as simple as the ones I asked, with Rust. In real life, however, everything seems to be a lot simpler, and I guess I don't need to worry about not being able to do such simple string operations with Rust.

I had already asked myself after reading the introduction about strings: how can it be that such (in my opinion almost trivial) actions are not possible?

But in Rust, I guess they just think much further ahead than I ever did, and that's where graphemes come into play (and scare beginners like me).

One of the languages that let you do this in O(1) time complexity, Python, stores strings like Box<[char]> . But this isn't a good default for a low level language like Rust, [char] wastes memory and you rarely need slicing strings in a context where you don't already iterating over them.

This is very revealing: Because Rust is also a low level language, you have to think about such things. And Python is simply more wasteful with memory. That's probably what makes Rust so special: They don't just care about secure code, they also think about other important things.

I am also impressed by the compiler, which gives such useful hints. What a contrast to working with VS and some Microsoft product error messages. And I was also very impressed by the built-in documentation with cargo doc --open.

notriddle · February 9, 2021, 5:29pm

Basically, if you want to implement MS SQL bindings, then you need to do the same thing MS SQL does, even if it's "wrong."

This might be a bit complicated, if you run your database with a UCS-2 collation, because it means you need to support splitting surrogate pairs in half.

ZiCog · February 9, 2021, 5:48pm

Isn't it the same with Python? Out of the box Python does not understand graphemes either. You need to install the graphemes module:

$ python3
Python 3.7.3 (default, Dec 20 2019, 18:57:59) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import grapheme
>>> s = "কী"
>>> len(s)
2
>>> grapheme.length(s)
1
>>> len(s.encode('utf-8'))
6

eko · February 9, 2021, 5:59pm

Yes, python doesn't understands graphemes. Built-in methods like slicing operates on unicode scalar values.

I don't think having to worry about graphemes is common, I didn't used them ever.

qaopm · February 9, 2021, 6:24pm

Even without graphemes Unicode is tricky, consider: Rust Playground

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    for s in [
        "\u{03d3}",
        "\u{03d2}\u{0301}",
        "\u{038e}",
        "\u{03a5}\u{0301}",
    ]
    .iter()
    {
        println!("{}", s);
        println!("Bytes: {}", s.len());
        println!("Chars: {}", s.chars().count());
        println!("Graphemes: {}\n", s.graphemes(true).count());
    }
}

All 4 strings represent the same "letter" (GREEK UPSILON WITH ACUTE AND HOOK SYMBOL) but they have different representations. All of them are valid but none of the 4 normalisations are identical. When comparing unicode strings you need to be careful what/how you're comparing/hashing/etc and ideally not only validate your input for valid utf-8 but also normalise it the same way.

(Interestingly, my browsers don't render those 4 string the same way, only the last two look the same.)

Topic		Replies	Views
How do you iterate over grapheme clusters of a String in Rust?	11	14749	July 3, 2022
Is there another way of indexing a String rather than converting it to bytes?	30	1678	November 17, 2020
Frank's Rust String Class	31	5932	January 12, 2023
Do you feel Rust is too verbose for simple things? help	28	4605	February 16, 2023
Where did str.graphemes() go?	3	3300	January 12, 2023

How to work with strings and graphemes similar to SQL? How to avoid crate proliferation?

Related topics