Python-like string in Rust

Is there a crate that implements an optimized string data type of code points (Unicode Scalar Value) for use in scripting languages like Python and mine (not done yet)? (In case no one knows, Python's str data type consists of code points.)

Something like:

let s = SVString::from("a\u{10FFFF}");
assert_eq!(s.len(), 2);
assert_eq!(s[1], '\u{10FFFF}');

With interoperability with std::string::String.

(I'd use it for my language, even though I'm not still implementing the runtime, but it'll be useful to know.)

Rust's String already represents a sequence of chars, which are Unicode scalar values. let s = "a\u{10FFFF}" compiles as-is. You can iterate with .chars(). If you want constant-time indexing, you can use Vec<char>, which uses more memory.

fn main() {
    let s = "a\u{10FFFF}".chars().collect::<Vec<char>>();
    assert_eq!(s.len(), 2);
    assert_eq!(s[1], '\u{10FFFF}');
}

(Playground)

1 Like

@mdHMUpeyf8yluPfXI @jbe

Actually String dereferences to &str, which internally consists of UTF-8 code units. Right, Rust string data types can be iterated through chars(), but I'm rather looking for something like this PEP: PEP 393 – Flexible String Representation | peps.python.org

The Unicode string type is changed to support multiple internal representations, depending on the character with the largest Unicode ordinal (1, 2, or 4 bytes). This will allow a space-efficient representation in common cases, but give access to full UCS-4 on all systems.

If I use Vec<char>, then the string at runtime for my scripting language will always take 4 bytes for each character.

What happens in Python at runtime is that the string data type either stores every character as 1 byte, 2 bytes or 4 bytes. Latin-1, UCS-2 or UCS-4; as by the quote, it depends on largest Unicode ordinal. Rust string data type is simply UTF-8, so it's not efficient for my case.

I also for instance made such flexible string type: GitHub - matheusdiasdesouzads/rust-scalar-value-string: Scalar Value String for Rust (a month ago)

But I'd like something yet more optimized.

I would like to know what case that is. Do you have an example? Have you benchmarked this? Is it really slower in Rust than Python?

2 Likes

Rust stores code points quite efficiently in a String or str, because each character only takes as many bytes as required by UTF-8. E.g. the string "Hello" takes 5 bytes only.

The only inefficient thing is when you want to access code points.

I'm not sure if I understand you right, but do you want a type which stores the whole string either as ASCII, UTF-16, or UTF-32?

1 Like

The only thing that you need to know about Python is that you can not use Unicode that way:

$ python
Python 3.10.9 (main, Dec  7 2022, 13:47:07) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> "naïve" == "naı̈ve"
False

Why do you want to do that? What's the point? I guess it wouldn't be too hard to create something like that, but it's very not Rust-like.

Rust tries to do things correctly, and Python-style: “hey, everything is simple… except it isn't” is just not something I see embraced by Rust.

You'll have to choose between constant-time char indexing and smaller than char memory footprint. You can't have both.

I'm creating a compiler for my language and will futurely implement its runtime somewhere, probably Rust. The string data type there will be similiar to Python's one.

Yes, so that I can use it for my multi-purpose scripting language. Computing code points into a UTF-8 string should be inefficient depending on use case.

I've just said what I want above: implement the runtime for my scripting language in Rust, so I need some compliant string type.

I believe there may be cases where something like this could be faster:

enum FlexString {
    Ascii(Vec<u8>), // must not contain any value >= 0x80
    Utf16(Vec<u16>), // must not contain surrogates
    Unicode(Vec<char>),
}

But I think it heavily depends on the use case. And I don't know of any crate providing this.

2 Likes

Well, this is what the crate I made early does: GitHub - matheusdiasdesouzads/rust-scalar-value-string: Scalar Value String for Rust I'd refactor a few things though: use Gc instead of Arc and rename SvString into SVString.

Hmm, but I wanted to internally also split the string data type in case there are millions of characters in mixed range.... optimization - String representation in Python runtimes - Software Engineering Stack Exchange

For having true interoperability, you would need to have a cheap reference-to-reference conversion to &str. This only works if you internally store a copy of the string as UTF-8 (consecutively) in memory.

Thus, you might store a UTF-16 or UTF-32 representation additionally, but you will at least need to have the UTF-8 string in memory if you want to have full "interoperability" (i.e. being able to pass your string to a function expecting &str without having to allocate first).

3 Likes

That's obviously the only valid use for such a structure, but then you don't have to provide anything with String-like interface on Rust side. In particular that approach means that change to one, single, character may, suddenly, reallocate and change the whole string.

That's fine if you want to duplicate Python's non-solution for non-problem, but in that case interface on Rust side is not all that important. And since it's such a narrow use-case I'm not sure people would dedicate much time trying to optimize it.

I've only seen that after I wrote an answer.

No, you don't need it. Since it would be stupid to use it anywhere except when you want to duplicate Python's non-solution to non-problem you would want it used narrowly, specifically only in the implementation of your language runtime.

It's not hard to create such a type since you don't need interoperatibility with &str/String.

You can also look on how V8 handles strings. Only you've started on the “useless busywork for no good reason” road you may spend the rest of your life there.

But the important thing to note is that you, usually, only need this in the language runtime limits interface of your string but makes it possible to do lots of fancy pseudo-optimizations.

How your GC works, how most of the code written in your language works, there are lots of moving parts.

You may use Cow-style interface and create UTF-8 representation on demand.

But everything would depend on how often you would need to convert between that Language-runtime type and str, on how GC works in that scripting language and on bazillion other things.

Thus I don't think it's ever makes sense to create something like that not tied to the runtime of said language.

1 Like

Yeah, I know I'll only use such type to allow execution of my language's code. And, yes, I may want to convert this SVString into String or &str depending on the context... And there's overhead, which I don't mind.

Regardless of whether your desired data structure would be the best fit for your use case, your requirements are so precise that it's unlikely anyone has made such a library. Have you run into any real performance issues with your current implementation of the data structure? If not, worrying about it amounts only to premature optimization.

1 Like

After all, since my language has only parser so far, I decided I'll change the string representation to UTF-8 so that I don't have to deal with this optimization case. I'm not even sure Python was used successfully as a multi-purpose scripting language.