How to work with strings and graphemes similar to SQL? How to avoid crate proliferation?

Germo · February 9, 2021, 10:33am

I read the Rust documentation sequentially, and with the "strings" I see some problems for my use of Rust.

Getting grapheme clusters from strings is complex, so this functionality is not provided by the standard library. Crates are available on crates.io if this is the functionality you need.

I work primarily with databases (mssql) and Rust would be interesting to me for an interaction with databases. In TSQL ~~a grapheme uses one or two bytes depending on the type.~~ I have many easy possibilities to work with strings:

LEN, LEFT, RIGHT, STUFF, SUBSTRING, PATINDEX, ...

After reading the documentation on "strings", I realize why this can't or won't be so easy in Rust because of UTF8. Such a crate like unicode_segmentation returns interators. UnicodeSegmentation implements some split methods. I also would need other crates for non-unicode varchar equivalent.

Is this the way you would have to work in Rust to map "trivial" string operations? An unstable crate here, with a link to another one there, and then you have to pick everything together somehow, just to be able to work with strings somehow usable and as usual in other languages?

Or I've seen that for something as common as a GUID, you have to use an extra crade GUID. Shouldn't there be something more in the standard? Especially to make it easier for Rust beginners to get started?

Do finished programs then look like this afterwards, that you have to gather hundreds of crates for all possible purposes? Sometimes a crate only for one data type? And then it may happen that different developers in the same project get the same functionality via different crates?

Or is there some sort of prioritization or recommendation system for crates to prevent such proliferation?

Since I work with the SQL Server, it is quite simple there. You work with a certain version of SQL Server, it has certain features, and you can use them or not.

alice · February 9, 2021, 10:43am

All of the SQL commands you have listed can be performed on strings with the Rust standard library, without extra crates.

Well, except maybe the STUFF command. But the standard library does provide it for a Vec<u8> under the name splice.

alice · February 9, 2021, 10:54am

Regarding the unicode segmentation crate in particular, it was actually supported by the standard library at some point in the past, but it was removed due to the large size of the tables that are needed to implement the feature.

Germo · February 9, 2021, 11:03am

@alice
My understanding from the documentation Storing UTF-8 Encoded Text with Strings - The Rust Programming Language is, that this Vec<u8> normally can't be used because this will work byte-based. You could use this only when you ensure that one grapheme is one byte. This could only work in case of varchar-content. But in case of nvarchar-content (unicode) a grapheme could be one or two or more bytes. That's the reason I'm asking here.

alice · February 9, 2021, 11:06am

It is true that if you want to use splice by going through Vec<u8>, you would need to first compute the byte indexes that correspond to the character indexes in question, but if you do that, going through Vec<u8> and back will work.

alice · February 9, 2021, 11:19am

Regarding your other questions, the amount of small crates defining one or two types you are going to need isn't really that big. I've used Rust for five years, and this has never been a problem at all.

Germo · February 9, 2021, 11:28am

you would need to first compute the byte indexes that correspond to the character indexes in question

I think that is the point. First I would need to convert a string (each string?) into some "vector of graphems" before I can do something with these vectors. And for me as a beginner, it looks very complicated. I would expect that there would be something to work with "out of the box" when working with strings. Maybe there is, and it's just not in this place in the documentation?

alice · February 9, 2021, 11:33am

No, you don't necessarily have to convert it into a "vector of graphemes". If you have a string containing "aøb" or something like that, you can splice away the ø by splicing from index 1 to 3.

I mean the thing is, when are you ever going to be calling splice on a string with indexes that are for grapheme clusters? Like, how did you get that index? If you are using str::find, it is going to give you the byte index, which can be used in splice directly without any sort of conversion.

Germo · February 9, 2021, 11:33am

I see, here was a similar topic a few years ago:

But there is no really solution described.

alice · February 9, 2021, 11:34am

There is a solution described. If you want something more complicated than offered by the standard library, use the unicode-segmentation crate.

SkiFire13 · February 9, 2021, 12:25pm

The problems with this are:

Bloating the stdlib. You're complaining that you end up with hundreds of dependencies. Would anything change if those dependencies were in every project thanks to the stdlib?
In case there were multiple ways to do something, what should the stdlib choose? The risk is that users will use another crate anyway because the stdlib didn't offer that niche feature they needed.
Who should maintain them? Do you expect the current team to maintain hundreds of crates with the same quality level of the rest of the stdlib?
Backward compatibility. If some of those crates are found to have a bad API surface we need to keep it as is for the sake of backward compatibility. We already have std::sync::mpsc that nobody uses because flume and crossbeam are much better but it can't be removed and must be supported.

I think someone once proposed it but the stdlib team/whoever should have made that system feared it could be accused of favoritism. There are however crates that are the de-facto standard for certain things, for example serde. There's also lib.rs (note that it's unofficial) for searching popular crates.

Germo · February 9, 2021, 12:35pm

@SkiFire13

Thank you for the answer. That is very plausible and therefore well understandable.

Right now it all looks a little complicated and is scaring me a little from starting to program in Rust. But maybe it will turn out later that it is not as complicated as it seems at first glance. Right now, I'm afraid that I'll probably have to turn strings into vectors of graphemes a lot before I can do certain operations with them.

Because probably the most important question is: how do I use Rust to solve MY tasks? And will that even be possible, so that the investment of time in learning Rust could also be worthwhile?

I am also a bit surprised that Rust is one of the most popular languages according to surveys, but is hardly used. So, like the beloved and unattainable princess.

But since I am basically enthusiastic about the Rust concept, I will not give up so quickly.

qaopm · February 9, 2021, 12:53pm

I think this is the right approach. Give it a go and you'll see. I haven't done much unicode processing in Rust (beyond basics) but from what I've seen Rust offers pretty similar functionality to what e.g. Python would give you. I.e. out of the box you won't be able to work with graphemes, only with unicode codepoints and it's up to you to as a caller to deal with graphemes, normalisation, etc.

Don't be afraid to use external crates -- many of them are maintained by the same people as the Rust standard library.

One property of compiled languages like Rust is that even if you include a massive crate in your project, it doesn't mean that the entire crate will be built into the final executable. Only bits that are actually used are built in, unused parts are optimised out.

BurntSushi · February 9, 2021, 1:02pm

If you describe the problems you want to solve in more detail, then I'm sure folks would be happy to help you figure out the specifics. It might be the case that you don't need to deal with graphemes (as defined by Unicode) explicitly. Basic string handling often doesn't need to do it at all.

But yes, Rust's standard library is small and that is intentional. There are lots of reasons for it, but one of them is that std can't just make a breaking change and release a 2.0 without major difficulty. Crates can. Thus, crates are easier to evolve.

As with any tool you use, you should do your due diligence on each crate you bring into your dependency tree.

Germo · February 9, 2021, 1:16pm

I have described my ambitious plans. But there was no answer which framework could be suitable (Maybe Seed?):

ZiCog · February 9, 2021, 1:23pm

As far as I can tell the complexity here is not in Rust but in Unicode. Unicode horribly complex standard and a difficult thing to deal with.

C++ can't handle unicode either, and that language has been around far longer.

However, when it comes to reading and writing strings from SQL I have not had to deal with any of that complexity. It's all just bytes right?

2e71828 · February 9, 2021, 1:32pm

This seems like TSQL is either using the term grapheme differently from the way Unicode does or that it doesn't support the full set of Unicode graphemes. The CJK block and its extensions contain more than 65k single-codepoint characters before even considering the craziness that can come from combining marks.

trentj · February 9, 2021, 1:56pm

From a cursory search¹, it seems that MS SQL servers support UCS-2, UTF-16 and more recently UTF-8 with different options and types. These are all Unicode encodings. You should know which one(s) you are using and understand the difference between bytes, code units, code points and graphemes.

In all encodings, a grapheme may comprise any number of code points.

A code point is the type represented by Rust's char type. It is encoded as one or more code units, according to some encoding.

UTF-8 is the encoding used by Rust's str. It is a variable-length encoding where the code unit is a single byte. Each code point takes from 1 to 4 code units, for a total size of 1 to 4 bytes.
UTF-16 is a variable-length encoding where the code unit is 2 bytes. Each code point takes either 1 or 2 code units, for a total size of either 2 or 4 bytes.
UCS-2 is a fixed-length encoding where the code unit is 2 bytes. Each code point takes 1 code unit, which is 2 bytes. Because it is limited to 16 bits, UCS-2, unlike both UTFs, cannot represent all Unicode code points.

I don't find any documentation on using TSQL for working with graphemes or grapheme clusters, so whatever you're doing, that's probably not it. You're probably talking either about code points or code units, both of which Rust's standard library is adequate to deal with (although perhaps more convenient with certain crates). But you need to figure out which it is.

¹ Collation and Unicode support - SQL Server | Microsoft Docs

Germo · February 9, 2021, 2:04pm

My statement that in SQL Server a grapheme is 2 byte was wrong. I also looked again and saw that I wrote nonsense there.

However, I never thought about how Unicode is stored. Because I can work with graphemes via the string functions. So I can use a SQL Server string like a vector and as a developer I don't need to think about how that is stored internally. So I can say: give me a substring starting at position 7 with length 5. Similar with replace.

BurntSushi · February 9, 2021, 2:24pm

I meant specific string handling problems that you need help with. That post is more about seeking general advice. I'm talking about a specific string handling problem that you think graphemes are a solution to. Showing some code, what you've tried, expected outputs and so on would be appropriate here. Without specifics, we can just chase our tails in generality indefinitely.

As to your post about web forms with automatic SQL server integration and Windows GUI apps, I would say that that area of Rust is probably in the "very early adopter" phase. So if you use Rust for something like that, you should probably be expecting to blaze your own trail.

Topic		Replies	Views
String library respecting Unicode	5	618	March 19, 2021
How do you iterate over grapheme clusters of a String in Rust?	11	14631	July 3, 2022
More efficient conversion from utf8 bytes to a string? help	8	650	July 29, 2022
Where did str.graphemes() go?	3	3275	January 12, 2023
New string interning crate: `symbol_table` announcements	8	2353	October 6, 2022

How to work with strings and graphemes similar to SQL? How to avoid crate proliferation?

Related topics