Checking whether the String contains a Character Other than the Specific Characters

I have the following code:

#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Hash, Debug)]
pub struct Id(String);

impl TryFrom<&str> for Id {
    type Error = Error;

    fn try_from(value: &str) -> Result<Self, Self::Error> {
        let mut value_i = value.chars();

        if match value_i.next() {
            Some(some) => match some {
                '_' | 'A'..='Z' | 'a'..='z' => true,
                '0'..='9' | _ => false,
            },
            None => return Err(Error::EmptyId),
        } {
            value_i.next();

            while let Some(c) = value_i.next() {
                match c {
                    '_' | '0'..='9' | 'A'..='Z' | 'a'..='z' => continue,
                    _ => return Err(Error::BadId(String::from(value))),
                }
            }
        }

        Ok(Id(String::from(value)))
    }
}

And this code looks horrible to me.

It does work correctly, but I feel like there is a better way to achieve what the function does. I would like to also ask about naming. Should I rename the structure type Id to Ident or Identifier, or is it fine?

If what the function does is not clear, it reads the given string value and returns an error if it contains a character other than '_' | '0'..='9' | 'A'..='Z' | 'a'..='z', and it also checks whether the string starts with a number.

I would use a regular expression in this case.

use {
    once_cell::sync::Lazy,
    regex::Regex,
};

pub enum Error {
    EmptyId,
    BadId(String)
}

#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Hash, Debug)]
pub struct Id(String);

impl TryFrom<&str> for Id {
    type Error = Error;

    fn try_from(value: &str) -> Result<Self, Self::Error> {
        static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^[_a-zA-Z][_a-zA-Z0-9]+$").unwrap());
        
        let trimmed_value = value.trim();
        
        if trimmed_value.is_empty() {
            return Err(Error::EmptyId)
        }
        
        if RE.is_match(trimmed_value) {
          Ok(Id(String::from(trimmed_value)))          
        } else {
          Err(Error::BadId(String::from(trimmed_value)))
        }
    }
}

How is the performance of your solution compared to the original? I would be really surprised when there is not an order of magnitude difference?

For the original, I think there are functions like is_ascii_alphabetic() and is_ascii_alphanumeric() which would make it simpler. Or perhaps convert the char to upper case first, so the test 'A'..='Z' is sufficient. But I guess due to Unicode that is not fast -- for plain ASCII it would be fast of course.

I honestly wouldn't care about the performance of such a trivial thing.

Actually I would try to do all the character comparisons with plain ASCII. I would assume that we could convert the original Unicode to Ascii, and then make a faster evaluation.

Why is RE static?

The performance is indeed important.

1 Like

I wonder if you are missing an else on the outer if? It seems like starting with a digit is a fast track to acceptance: "0!@#$%^%" is accepted.

This is what I've used for a simiar style test:

fn is_char_machine_type_allowed(ch: char) -> bool {
    matches!(ch, '0'..='9' | 'a'..='z' | '-' | '+')
}

fn valid_machine_type(mtype: &str) -> Option<String> {
    let mut chars = mtype.chars();
    let ch = chars.next()?; // non-empty!
    if !ch.is_ascii_alphanumeric() { return None; }
    if !mtype.chars().all(is_char_machine_type_allowed) { return None; }
    return Some(mtype.to_owned());
}
#[cfg(test)]
mod tests {
    mod id_tests {
        use super::super::Id;

        #[test]
        fn test_try_from() {
            assert!(
                Id::try_from("_0Aa1Bb2Cc3Dd4Ee5Ff6Gg7Hh8Ii9JjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz")
                    .is_ok()
            );
            assert!(Id::try_from(
                "_0 A!a\"1#B$b%2&C'c(3)D*d+4,E-e.5/Ff6Gg7Hh8Ii9JjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz",
            ).is_err());
        }
    }
}

Is how the function is tested, and it seems to work fine. Where would I need an else statement?

This test fails.

assert!(Id::try_from("0!@#$%^%").is_err());

Avoids performance pitfalls with having to reiniitalize the regex in every iteration. See the relevant documentation

This is the result of an AI benchmark. The second version was an AI improved version of the original, and in the last it used ASCII comparison:

 Running benches/id_bench.rs (target/release/deps/id_bench-3970369e494f111b)

Gnuplot not found, using plotters backend
original_id time: [28.337 ns 28.451 ns 28.572 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

idiomatic_id time: [23.265 ns 23.309 ns 23.361 ns]
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe

ascii_id time: [19.232 ns 19.410 ns 19.602 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe

It really does so. Why though?

I would think that the second third _ => return Err(Error::BadId(String::from(value))), in:

match c {
    '_' | '0'..='9' | 'A'..='Z' | 'a'..='z' => continue,
    _ => return Err(Error::BadId(String::from(value))),
}

Would prevent that.

Thanks.

Do you mean doing value.bytes() instead of value.chars()?

After I asked if a plain ASCII comparison would be faster, the AI generated somethink with "let bytes = value.as_bytes();"

True. That could be faster.

Here is the complete benchmark output including the regex version from above. The regex version is actually surprisingly fast:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/id_bench.rs (target/release/deps/id_bench-508a63bae12acbf8)
Gnuplot not found, using plotters backend
original_id             time:   [26.521 ns 26.554 ns 26.589 ns]
                        change: [-7.0437% -6.6336% -6.2226%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

idiomatic_id            time:   [19.639 ns 19.690 ns 19.747 ns]
                        change: [-15.343% -15.013% -14.678%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

ascii_id                time:   [18.166 ns 18.255 ns 18.348 ns]
                        change: [-7.7957% -6.7476% -5.7584%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

regex_id                time:   [48.737 ns 48.851 ns 48.966 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

A tip: You can use backticks to quote code and output like so:

```
Code here
```

This would make the benchmark output more readable.

Especially on mobile with lots of line breaks your existing post is very unreadable.

OK, I tried. Personally for me, the output was fine, and now as code, I have a medium size view with horizontal and vertical scroll bars. That is, long lines, and bottom of code, needs scrolling.