Frank's Rust String Class


#1

Hello fellow Rust-users,
as I felt the built-in String types are a royal pain in the backside, I created my own String class. It surely does not make sense to convert from UTF8 to UTF32 back and forth. After all this language is supposed to be efficient, if I am not mistaken.

Please feel free to download from

http://frankgerlach.d-n-s.name/RustStringClass.txt

Also feel free to modify and use in any way you want to. Needless to say it is free of charge and no patents have been filed (if such an insane thing would be possible).

Frank Gerlach
Tailfingen
Germany


#2

… but why would you want to use UTF-32, anyway?


#3

Well, it seems to be the canonic character representation.

What would you suggest instead ?


#4

Also, could I parametrize the character type (like C++ template classes ?)


#5

char isn’t a character; it’s a mis-named code point. Using UTF-32 doesn’t gain you anything; it’s still a variable-width encoding of text, so you still can’t do O(1) indexing.

It just makes almost every string take more memory and tricks people into thinking they don’t have to worry about composite characters.

Yes, although I’m still not sure what the point would be. UTF-8 is about the only sane encoding for text, outside of specialised domains.


#6

The Rust String will be more efficient than yours for a lot of use cases too, especially when you input or output UTF-8.

Very brief notes on the code:

  • Vec::truncate doesn’t release any memory, use shrink_to_fit if you want that.
  • It’s very uncommon to have an API in another language than english, but I like it, it’s funny, and I’m pro using your natural language when you want, so I don’t see the problem if you want it.
  • Use rust’s #[test] feature to easily add tests! Cargo will help to easily run it too.

#7

I am in the process of building an http server. ASCII would suffice here.

And the built-in String stuff is really bad when trying to write a CLEAN lexer.

Maybe “I am holding it wrong”, but Rust surely needs a String class on par with Java or STL.


#8

Thanks for this feedback, I will change the code accordingly.


#9

I thought HTTP used latin-1 or something, not ASCII. In either case, this means your string is going to use four times more memory than is necessary.

How so? I’ve found that Rust’s string type really only lacks a few niceties, notably popping a grapheme from the front. That was why I wrote the strcursor crate.

Again, what is it missing? Just don’t say “getting char at an index” because that’s almost never a meaningful or reasonable thing to do.

And Java’s a bad example: last I checked, it used UTF-16, and its char isn’t even a complete code point! I don’t know if it checks for unpaired surrogates…


#10

Yeah, maybe it is just a matter of “Java habit” and “C++ habit”…


#11

Rust’s String type is by far the most conveniently correct string abstraction I’ve ever used.


#12

Well in the real world (automotive industry) we do funny things with strings. For example the FIN and the VIN, which identifies your car.

In some cases it is useful to access this string in a random-access manner.

E.g. I only want the product line “222”, then I might not be interested to read the first three characters, because I already know the entire data set belongs to my company…

Or many other funny encodings inside the FIN.

Copied from somewhere:
z.B. WDB2110061A123456

WDB211 : Baureihe 211 (neue E-Klasse)
006 : Motornummer
1 : Linkslenker (2=Rechtslenker)
A : Herstellungsplatz (A is Sindelfingen)
123456 : Folgenummer


#13

Or in a language parser - the Rust way of comparing a string with a literal seems to be rather costly, because of the conversion into a Vec.

But as I say, maybe I haven’t yet got a good documentation and I simply “hold it wrong”…


#14

Then use &s[3..]. But this doesn’t sound like Unicode text; it sounds like mixed text and structured data. You should probably be processing this as an array of u8s or something. Or an array of Ascii (there’s a crate for this floating around somewhere, I think). Or decoding into a structured type.

It sounds very much like you’re using the wrong type for the job, then complaining that the type is wrong.

This is just completely false.

For your own sake, you should make sure you understand the Rust standard library before you expend a huge effort re-implementing it to solve non-existent problems. :slight_smile:


#15

Comparing a string with a literal or a Vec<u8> with a byte string literal should absolutely not be costly. I usually think that our String, for example, might be a little bit harder to use exactly because we care so much about providing low-overhead operations.

Comparing a string to a literal involves no conversion into vec or any costly operation except for the string data comparison itself:

let s = "hallo".to_string();
let is_equal = s == "hallo";

The only conversion happening here is that s is converted from String to &str which is zero cost in practice, &str is a lightweight string view type, which allows simple and low overhead operations with substrings. Unlike C we can create substrings easily with a simple location reference and length, because we don’t use string terminators (the zero byte in C).


#16

Ok, thanks for educating me :smile:

One cannot shake off 20 years of C, Pascal and Java easily…


#17

Don’t worry about it; every language I’ve ever used (including Rust) perpetuates at least some misconceptions about how text works. It’s a minor miracle anything involving text works at all.


#18

Well, I think I effectively did what you said: I created a random-access-array type of some sorts :smile:

Certainly for HTTP a Oktet would suffice and UTF32 is excessive.

The FIN actually is even less characters than ASCII and can be compressed even more (which we did in order to squeeze more data sets into RAM).


#19

Hey Frank, i used your FIN example form my own learning session.

I guess you will love the result:

#[derive(Debug)] // damit println!("{:?}", fin); geht
#[allow(non_snake_case)] // meckert sonst ĂĽber GroĂźschreibung
struct FIN {
    pub Firma: [u8; 3],
    pub Baureihe: u16,
    pub Motornummer: u8,
    pub Linkslenker: bool, // (false = Rechtslenker)
    pub Herstellungsplatz: char,
    pub Folgenummer: u32,
}

#[allow(non_snake_case)]
#[allow(dead_code)]
impl FIN {
    fn from_str(fin: &str) -> FIN {
        fn sub_str<'a>(str: &'a str, offset: &mut u8, size: u8) -> &'a str {
            let start = *offset;
            *offset += size;
            &str[start as usize .. *offset as usize]
        }

        use std::str::FromStr;
        
        let mut offset = 0;

        let firma: [u8; 3] = match fin.len() {
            17 => {
//                fin.bytes().take(3).into_array()
                offset = 3;

                let mut b = fin.bytes();
                [b.next().unwrap(),
                 b.next().unwrap(),
                 b.next().unwrap()]
            }
            14 => {
                [87, 66, 68]  // "WBD"
            }
            _ => { panic!(format!("UngĂĽltiger FIN '{}'", fin))}
        };

        // collections::slice::SliceConcatExt => unstable
        // let s = fin.chars().take(3).concat().parse::<u16>();
        
        FIN {
            Firma: firma,
            Baureihe: FromStr::from_str(sub_str(fin, &mut offset, 3)).unwrap(),
            Motornummer: FromStr::from_str(sub_str(fin, &mut offset, 3)).unwrap(),
            Linkslenker: sub_str(fin, &mut offset, 1).chars().next().unwrap() == '1',
            Herstellungsplatz: sub_str(fin, &mut offset, 1).chars().next().unwrap(),
            Folgenummer: FromStr::from_str(sub_str(fin, &mut offset, 6)).unwrap(),
        }
    }
    
    fn als_Rechtslenker(&self) -> FIN {
            FIN {Linkslenker: false, .. *self}
    }

    fn to_string(&self) -> String {
        use std::ascii::AsciiExt;
        
        let mut buf = String::with_capacity(17);
        let firma = String::from_utf8(self.Firma.to_ascii_uppercase()).unwrap();
        buf.push_str(&format!("{}{:03}{:03}", firma, self.Baureihe, self.Motornummer));
        buf.push(if self.Linkslenker {'1'} else {'2'});
        buf.push_str(&format!("{}{:6}", self.Herstellungsplatz, self.Folgenummer));
        buf
    }
}

use std::fmt;
use std::collections::HashMap;

    impl fmt::Display for FIN {
        #[allow(unused_must_use)]
        fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
            let hp = { //HerstellungsPlätze
                let mut p: HashMap<char, &'static str> = HashMap::new();
                p.insert('A',    "Sindelfingen");
                p.insert('B',    "Buxdehude");
                p
            };
            
            f.write_str(if self.Linkslenker { "Lenker: links" } else { "Lenker: für Engländer ;)" });
            f.write_str(&format!("\nHerstellungsplatz: {}",
                    hp.get(&(self.Herstellungsplatz)).unwrap_or(&"Unbekannt")));
            
            fmt::Display::fmt("\n", f)
        }
    }


#[test]
fn test_it() {
        let fin = FIN {
        Firma: b"wbd".to_owned(),
        Baureihe: 211,
        Motornummer: 6,
        Linkslenker: true,
        Herstellungsplatz: 'A',
        Folgenummer: 123456,
    };

    assert_eq!("WBD2110061A123456", fin.to_string());
    assert_eq!("WBD2120061B123789", FIN::from_str("WBD2120061B123789").to_string());
}

fn main() {
    let fin1 = FIN {
        Firma: b"WBD".to_owned(),
        Baureihe: 211,
        Motornummer: 6,
        Linkslenker: true,
        Herstellungsplatz: 'A',
        Folgenummer: 123456,
    };
    
    let fin2 = FIN {Herstellungsplatz: 'C', Folgenummer: fin1.Folgenummer + 1, .. fin1};
    
    println!("{}:", fin1.to_string());
    println!("{}", fin1);
    println!("{}", fin1.als_Rechtslenker());

    println!("\n{}:", fin2.to_string());
    println!("{:?}", fin2);
    println!("{}", fin2);

    println!("\n{:?}", FIN::from_str("2220032B234567"));
}

#20

Hi Andy,
it seems you are quite deep inside Rust :smile:

I currently don’t have the time to look into your code, but thanks nevertheless.

I will have a detailed look at it later.