UTF-8 strings with a maximum defined length in bytes

I am currently writing Rust code based on the Gemini specification which specifies a particular field like this:

<META> is a UTF-8 encoded string of maximum length 1024 bytes, whose meaning is <STATUS> dependent.

Of course, as the transmitter of this value I can manually perform a series of steps:

  • Obtain the intended value
  • Assert it is valid UTF-8
  • Assert it is <= 1024 bytes
  • Place it in a struct for handling, or make my new() implementation error out if these constraints are not met

But now it seems the reins are loosened—there is no type safety that asserts the value of that field will maintain the <1024 bytes property after the struct is created. It is now just a Vec<u8> or a String.

In the context of my application this is somewhat pedantic, but now I'm curious—is there any convention for how to enforce maximum-length strings in types, other than this "outer" check? Particularly for a restriction based on UTF-8 bytes rather than characters?

I've searched both the forum and crates and haven't managed to pull up anything relevant.

I don’t know of anything pre-written, but you could define a wrapper struct like this:

struct MetaString {
    len: usize,
    data: [u8; 1024]
}

impl Deref for MetaString {
    type Target = str;
    fn deref(&self)->&str { /* ... */ }
}

impl FromStr for MetaString { /* ... */ }

// These will have to `panic` on overflow, so you’ll want to
// provide non-trait versions that return `Result` as well:
impl FromIterator<char> for MetaString { /* ... */ }
impl Extend<char> for MetaString { /* ... */ }
2 Likes

You can newtype the String, and then never provide mutable access inside the new type.

mod meta {
    pub struct GeminiMeta(String);


    impl GeminiMeta {
        pub fn new(meta: String) -> Result<Self, String> {
            if meta.len() < 1024 {
                Ok(Self(meta))
            } else {
                Err(meta)
            }
        }

        pub fn get(&self) -> &str {
            &self.0
        }
        
        pub fn into_inner(self) -> String {
            self.0
        }
    }
}

Outside this module GeminiMeta is guaranteed to be less than 1024 bytes, so you can rely on that.

This is basically how String works in the std lib. It's a newtype around Vec<u8> that uses privacy to guarantee that it is utf-8 encoded. It then relies on this guarantees internally for optimizations using unsafe code.

You could also use Deref instead of get

5 Likes

You can use Box<str> instead of String to save 8 bytes for the capacity field if no mutation is necessary.