Scalar Value string

I made a string type SvString. The main goal of this type is for use for implementing runtime for scripting languages such as Python, where the string type consists of Scalar Values (not, say, UTF-8 or UTF-16 code units).

Consult documentation

Just missing:

  • split (accept regex)
  • replace (accept regex)
  • match (accept regex)

The goal with this type is...

  • Manipulate characters like in Python or EcmaScript. With this string type, one should be able to easily convert code from Python or EcmaScript (in ES, strings consist of UCS-2 units, while in Python, strings consist of Scalar Value). This string type is more near to Python's str primitive type.
  • Flexible representation, based in this PEP. It uses either Latin-1 (one-octet), UCS-2 (two-octets) or UCS-4 (four-octets) based on maximum Unicode ordinal.
  • Interning, preventing duplicating string contents
  • Freedom for using out-of-bounds indexes. For example, you can do SvString::empty().substr(1..2) == SvString::empty()
  • Operations should return the same string type. For example, SvString::from("ZXC").to_lowercase() returns SvString::from("zxc"), not "zxc" or std::string::String::from("zxc").

You can check the implementation here: https://github.com/matheusdiasdesouzads/rust-scalar-value-string I've tested it a bit:

let s = SvString::from("foo");
let mut it = s.chars();
assert_eq!(it.next().unwrap(), 'f');
assert_eq!(it.next().unwrap(), 'o');
assert_eq!(it.next().unwrap(), 'o');

let s = SvString::from("abc.zxc");
assert!(s.ends_with(".zxc"));
assert!(s.ends_with('c'));
assert_eq!(s.substr(1..2), SvString::from("b"));

But does it support UTF-8? It doesn't feel flexible otherwise.

Also, does this mean that if I have any 2 or 4 byte character in a String (for example 'è' or similar letters which are common in Europe) then every other character will also occupy 2 or 4 bytes?

This seems really dangerous IMO. If I make some mistake while indexing a string I don't want to silently get wrong results.


Do you have any benchmark for how it compares with the stdlib's implementation or flexstr's ?

Looking at your code it seems you use unsafe to access a static mut containing a RefCell. This seems terribly unsafe (it's almost impossible to use correctly a static mut) and not thread-safe (you're going to have data races with those RefCells).

3 Likes

That's right, it'll use the same size for every character. The representation is based on PEP: https://legacy.python.org/dev/peps/pep-0393/

Allowing out-of-bounds manipulations makes it easier to adapt code from EcmaScript (for example, say you need to implement NodeJS path operations in Rust), but is also advantageous in some ways.

Oh, I wasn't aware this can be problematic. The compiler didn't warn anything about not implementing things like Sync trait, so I thought it'd be fine. Do you know if there is a better replacement for this?

You used unsafe, that's enough of a reason to know what you're doing.

For the lazy initialization you can use the once_cell crate (which is also planned to be added to the stdlib, see https://github.com/rust-lang/rust/issues/74465)

For mutating the HashMap and its contents you should use either a specialized HashMap for concurrency, or a normal HashMap wrapped in a RwLock.

This should allow you to avoid having to use a static mut and unsafe.

1 Like

Just a note that when you use unsafe, you're telling the compiler "you trust me: I'll guarantee that all of Rust's validity requirements are upheld." It won't warn you because, basically, you told it not to.

If you're just starting out, you haven't learned all of the requirements -- they are quite tricky even for experts -- so there's a high likelihood that using unsafe will result in unsound code. If you're trying to do something that requires unsafe, it's time to go read up on why it requires unsafe -- what conditions you're promising to uphold.

4 Likes

I've no experience with multi-threading, but thank you! I should try testing the crate to see if it works with different threads, but I changed to use what you recommended me.

I'm now doing this to access the interned strings:

let mut p1 = INTERNED.lock().unwrap();
let p1 = p1.get_mut().unwrap();
// p1: &mut HashMap

This is used in fn intern() and fn drop() (for Arc<StringRepr0>). I wanted to have a separate function to retrieve INTERNED, but I keep getting a lifetime error, so I just dumped these 2 lines in each function.

I kind of want to lock any other thread in this operation. Am I doing the right thing here or may a panic occur when 2 or more threads perform this same operation?

The .unwrap() after lock will only panic if an existing thread panics while olding the lock, so if that can't happen it's fine. The .get_mut().unwrap() is useless however, you don't need any RwLock inside a Mutex. Either remove the Mutex or all the RwLocks.

1 Like

I've moved the type out of my previous crate rust-fb. It's scalar-value-string on crates.io (docs). Now the goal of this type is for being used when implementing runtime for scripting languages like Python.

My language (Violent ES) would also consist of Scalar Values, but I'm not working on it now.