Is there the Rust way to modify String in the place?

I heard also that Apple work on a private cloud feature of a processor allowing encrypted data in memory. Even they offer $1M bounty if someone broke their protection. So right, soon UTF8 will look like a joke.

UTF-8 has nothing to do with encryption.

It feels like you're trolling.

6 Likes

I think they're arguing that memory is so cheap (by various examples), we should store String as an array of char for better usability. Is that right @MOCKBA?

1 Like

Absolutely, why waste CPU cycles for an extra processing required by UTF-8? I doubt that there are many languages using UTF-8 for string representations in memory. I can understand an initial intention, but the reality is different. Rust has tons good features, but errors handling and strings are not bright features of the language.

My point was that many more cycles are wasted, and latency increased, during char processing by CPU cache misses when a String uses more memory, especially 4X as much. This occurs no matter how much memory you have, because the memory bus is limited. and CPU caches are limited.

4 Likes

Also if memory is encrypted then accessing memory becomes even more costly. So I don't see how memory encryption was supposed to be an argument in favor of UTF-32.

4 Likes

You're right. What did you mean, @MOCKBA?

Currently I port an application from C++ to Rust. It's done not from point view of getting more robust code. Only performance I am getting in the consideration. So far , the application is still behind of C++ which uses wchar for strings. The results are consistent on x64 and Arm64 platforms. The performance gap is about 2% . An interesting fact that Rust code was 20% faster than C++ code, so tuning was required for C++ code as well. Since I still racing with C++, I am trying to write Rust code with a minimal overhead. I am glad to know that Rust gives me the advantage you mentioned. But it isn't the key.

So much so that Java strings are no longer just UTF-16, they're (in Rust terms) more like

enum StringRepr {
    Latin1(Box<[u8]>),
    UTF16(Box<[u16]>)
}

and checking whether the string is ISO-8859-1 or UTF-16 on every access is still faster overall than always using UTF-16.

1 Like

Just to be sure you know, since I didn't explain, the code I posted for replacing a char, although it works for multibyte chars, will not cause reallocation when replacing with a char of the same length.

Until you do a clear benchmarking, you can't be certain that a certain mechanism will give a real performance impact. For example, I tried recently branchless programming which supposed to reduce a pipe reloading, but the actual performance was dropped by 15%. So it is a clear hype. I'm afraid of that UTF-8 encoded strings in memory is also similar stuff. Sure, I need to compare OsString and String to be certain. I will do that but not now.

A good point, probably Rust should follow just Java approach and do the storage dynamically based on your region. For example, if you are in Germany, all strings are Latin1, but when you moved in Chine, your strings become UTF-16.

It's obvious, do not worry.

Right. One of the properties of UTF-8 is that ASCII characters can never be confused for parts of multibyte codepoints. That's so that old C code or whatever can interpret newlines or NUL terminators properly and so that the multibyte codepoints don't require special handling in a lot of older string algorithms.

I think you're jumping to conclusions much too quickly. Branchless programming not working in one instance is not a suitable basis for concluding that branchless programming is hype.

4 Likes

It's a right observation, any benefits of any approach can be verified in a particular case. As for me, an increase of a development time and reduced a readability of the code, was the key factor. So I switched to more important tasks as adding new features in the product. Probably I will reconsider the conclusion when will have a spare time.

I've never thought the "we have more RAM so let's waste it" argument very compelling. People got more RAM because it allowed them to do their tasks better, whatever those tasks were. Which means the extra RAM is already being put to use, so the last thing we need is to waste it.

Besides, there are still tons of people out there on older hardware, and even newer hardware with low specs (I'm one of them). Saying "screw you" to all those people just isn't okay in my opinion.

And... that's not at all what OsString is. It's not UTF-32, it's whatever the OS uses. Except it's not even that, it's some cursed variation. So I don't know where that came from at all.

You seems English speaking guy. As for me, who speaks Chinese, UTF-8 is waste.

True, it's much less worthwhile for certain other languages. But the most you could argue from that is that we could use different string types for different languages, not that Rust should default to wasting space for English, rather than wasting it for Chinese.

It's not.

  1. If you have markup in the file, UTF-8 is often smaller anyway. For example, HTML pages in Asian languages are typically actually larger in UTF-16 than in UTF-8, because the large space savings in the markup are more than the space cost for the text. (This gets particularly true if you ever use data: URLs.)
  2. If you actually care about wasting space, use compression, which saves far more space than a different encoding every will.

https://utf8everywhere.org/#asian

4 Likes