Some traits to ease the use of Vec of chars as a String

Hello,

I have made same traits that ease the use of

Vec<char>

by beginners, I implemented many function and methods for them. I intend to teach Rust to my daughter and also same friends. So the use of this code, at least at the beginning will help. Also, I think that is easier for small intensive algorithmic code that is non high performance code but that you would like it to have random access in O(1).

The license is MIT Open Source.

SubStrings Slices and Random String Access in Rust
https://github.com/joaocarvalhoopen/SubStrings_Slices_and_Random_String_Access_in_Rust

This is a very simple thing, and there should be lots of implementations of things like this, but I didn't found any.

Thank you,

Best regards,
João

2 Likes

Something else you could consider, if you don't want to deal with Unicode (because char-by-char doesn't work either), is

2 Likes

Hello,

Thank you @scottmcm for your example but I, my daughter and the majority of Portuguese, speak Portuguese, English and French, from school and understand Spanish and Italian, because they are also similar languages that derive also from the Latin, like Portuguese.

Maybe one day I will understand the German language.

The problem with ASCII is that I simply can't represent any of those languages in simple ASCII.

For example the Portuguese character "ç" isn't on ASCII, you only have "c".

Thank you,

Best regards,
João

Hello,

and at least in those cases of Portuguese it appears to work.

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=90da3e7096647b4d42d539f82476d229

Thank you,

Best regards,
João

Well, this gets into the problems with doing Unicode -- it depends how you write it, and in ways that you can't see. I can make almost all of those not work again as https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=557f15fc85622d32e561fe877e14a3bc:

It's, of course, up to you to decide whether you're ok with ignoring those problems for your exercises.

But if you're wondering why the standard library doesn't provide any helpers for dealing with slices of char, things like this are why.

If, on the other hand, you decide you'd like to tackle these problems too, then the usual recommendation seems to be unicode-segmentation — Rust text processing library // Lib.rs.

6 Likes

Not a technical comment but more of a principle/pedagogy related one: if you are teaching a beginner, you should not socialize her/him in the incorrect "strings are just arrays of chars" mindset. S/he will have trouble un-learning all the wrong assumptions you plant with this.

If you are trying to teach people how to deal with arrays, then that's simple – use arrays! Use arrays of numbers, booleans, enums, tuples – there are many ways of showing useful examples of array-based algorithms without teaching your student something that is technically incorrect.

For the time being, treat strings as atomic types. Don't try to get into "well a string is an array of chars but it actually isn't, and there is no such thing as a character, and even if there is, then the char type isn't that".

Later, when s/he is comfortable with the ideas of arrays, collections, iterators in general, you can tell her/him the whole story. With Unicode being a multi-level variable-width encoding, the difference between bytes and code points and code units and grapheme clusters, encodings, the ambiguity of "the length" of a string, grapheme width, and all of that fluff. But don't actively mislead her/him. That's what """easy""" languages have been doing for half a century, and the results are disastrous. If we were to act as responsible professionals, we should teach our children properly, and not lie to them for the sake of questionable simplicity or alleged comfort.

20 Likes

Hello @H2CO3, thank you for your input.

In principle you are completely correct, but the same thing can be said about physics, why do you teach Newtonian physics to children in high schools when we know that Quantum Field Theory and General relativity are more correct abstractions of reality?

Because in practice those children are not ready for the inherent complexity of the better models known to physicist of how the world really works. But give to children the simple models, and you can say to them that there is a more precise and more complex representation of reality and even try to explain to them in really high level and general terms same of the differences between the different models of reality.

I thing that the same principle applies to this pedagogy of the complexities of Text Processing with Strings to beginners.

Best regards,
João

That's neither a good analogy, nor a fair comparison.

I did write in my previous comment that

I.e., I fully acknowledged that you shouldn't expose the student to the complexity of the subject all at once. Because in programming, fortunately, we do have a choice. We can just use arrays to teach arrays, and do so 100% technically correctly.

In physics, however, we don't really have a choice, as there is no separate 100% correct and simple way of teaching a little piece of objective truth (and even QM isn't the complete truth, since there are many missing pieces in basic research). So we necessarily have to resort to approximations and simplifications – which, however, are still immensely useful and basically correct in everyday life (mechanical engineers who design car engines don't need to consider QM).

However, this is absolutely not the case with text processing and Unicode. Teaching that a string is just an array of characters is not even approximately true, nor is it useful in practice. As others have already demonstrated with examples, it's trivial to come up with realistic scenarios where the assumptions break down quickly, in everyday, non-contrived text processing tasks.

Hence, the two fields have very different requirements, the two subjects are very different, and there is no reason we should not go for a better method of teaching if we are so fortunate that we have it at hand.

3 Likes

Hello @H2CO3, thank you again for your input.

But I really think it's a valid analogy, and I will give you a similar comparison, this time in CS. How many CS Engineers really know the inside out of IEEE 754. The IEEE Standard for Floating-Point Arithmetic that almost every core uses? There is complexity, problems and peculiarities waiting almost every numerical code in existence. But in reality every person that really needs or thinks that is using a infinite precision decimal representation, uses f32 or f64 in is daily use not regarding for all the intrinsic of the IEEE 754 standard.

There are many developers that don't even know the difference a floating point representation and a decimal point representation and that it has to be used for example for monetary or financial code.

The same could be said for other arithmetic representations like Interval Arithmetic representation in which the error inherent in the finite precision of each mathematical operation is propagated and can be quantified at the end of a calculation.

Best regards,
João

It's of no particular importance how many people actually happen to know it. Anyone who uses floats should know about their most important caveats. The difference is not only a theoretical exercise – again, it is of practical importance, because it affects for example numerical stability. One might get completely bogus results from certain computations (simulations, stiff differential equations, or simple arithmetic where differences happen to be tiny compared to the subtrahends) if unaware of the intricacies of floating-point numbers.

That many people don't know or don't care about these caveats is unfortunate, but it is a non sequitur to conclude that therefore they wouldn't need to know, either. They need to, and professionals should strive to teach enthusiastic but less knowledgeable people about the right way to use tools. We shouldn't recommend sloppiness, because it will lead to errors and suffering. The fact that it happened before is not an excuse for continuing bad practice.

2 Likes

For most European languages as they should be encoded in most cases, a Vec<char> will actually work quite well. Most accents can be written as "combining ascent" + "letter" which is two unicode code points (or two char elements), but can also, be written an "ascented character" one code point, one char element. And to me as a native swedish speaker, that is the correct way of writing them. There are multiple ways to write something likes look that an a with a ring above, but there is one way to write the actual character å, and that fits in a single char.

Now, some environments tend to provide strings in one way and some in another way, but there is a crate, unic_normal - Rust , that makes it easy to convert a string with any form of codepoints to a string of the form you prefer, which should make it easy to e.g. reverse the letters of a word or check if two words are anagrams or any other such example ...

2 Likes

Another way, that may be enlightening, can be to try to convert the string you work with to ISO-8859-15 or simliar that was used in your country before switching to unicode. Then you can easily work witch "characters as bytes" in the expected language, while being made aware that some strings contain other stuff that the program beeing written can't handle. Then if someone enters text in chineese or greek the program can respond with "sorry, this doesn't look like Portuguese to me" or simliar.

The beginner don't need to actually handle all strange cases that can come up perfectly, but it is a good thing to be made aware that strange cases can come up, and at least handle them as errors. That is why Rust is actually a great language for beginners. :slight_smile:

1 Like

Thank you @Kaj, @H2CO3 and @scottmcm each of you contributed so that my little traits on Vec for teaching Rust to my daughter, become better and more resilient to problems like reverse order and others.

I used the Crate that @Kaj suggested for unicode normalization and then I added it to the method that does the conversion from String and from &str to Vec the get_vec_chars().

Then I used the examples that @scottmcm created and that I created and that he then modified, to see the behavior of them with the crate. Like @Kaj said unicode as Vec for in my computer in Portuguese, the things that I type on my keyboard works pretty well.

The problematic characters never appear while I type.

But with the normalization even that problem disappeared.

Then I went over the 1100 lines of my little code and changed it here and there to consistently use the get_vec_chars(). I run my tests and all passed, nice!

So, in conclusion, thank you all for helping me transforming this little code to teach my daughter Rust, in something that has less problems that it used to have. With this change, the performance of the code is lower because it has to make an intermediate copy while a String. But in this case the best performance isn't the main goal.

Thank you,

Best regards,
João

2 Likes

Hello again,

This is a small program that I have made with the Vec<char> small "lib".

less_fp - Simple less with fixation points.
https://github.com/joaocarvalhoopen/less_fp_-_Simple_less_with_fixation_points

Thank you,

Best regards,
João