Checking whether the String contains a Character Other than the Specific Characters

anon53519680 · June 15, 2025, 12:13pm

I have the following code:

#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Hash, Debug)]
pub struct Id(String);

impl TryFrom<&str> for Id {
    type Error = Error;

    fn try_from(value: &str) -> Result<Self, Self::Error> {
        let mut value_i = value.chars();

        if match value_i.next() {
            Some(some) => match some {
                '_' | 'A'..='Z' | 'a'..='z' => true,
                '0'..='9' | _ => false,
            },
            None => return Err(Error::EmptyId),
        } {
            value_i.next();

            while let Some(c) = value_i.next() {
                match c {
                    '_' | '0'..='9' | 'A'..='Z' | 'a'..='z' => continue,
                    _ => return Err(Error::BadId(String::from(value))),
                }
            }
        }

        Ok(Id(String::from(value)))
    }
}

And this code looks horrible to me.

It does work correctly, but I feel like there is a better way to achieve what the function does. I would like to also ask about naming. Should I rename the structure type Id to Ident or Identifier, or is it fine?

If what the function does is not clear, it reads the given string value and returns an error if it contains a character other than '_' | '0'..='9' | 'A'..='Z' | 'a'..='z', and it also checks whether the string starts with a number.

firebits.io · June 15, 2025, 12:19pm

I would use a regular expression in this case.

use {
    once_cell::sync::Lazy,
    regex::Regex,
};

pub enum Error {
    EmptyId,
    BadId(String)
}

#[derive(Eq, PartialEq, PartialOrd, Ord, Clone, Hash, Debug)]
pub struct Id(String);

impl TryFrom<&str> for Id {
    type Error = Error;

    fn try_from(value: &str) -> Result<Self, Self::Error> {
        static RE: Lazy<Regex> = Lazy::new(|| Regex::new(r"^[_a-zA-Z][_a-zA-Z0-9]+$").unwrap());
        
        let trimmed_value = value.trim();
        
        if trimmed_value.is_empty() {
            return Err(Error::EmptyId)
        }
        
        if RE.is_match(trimmed_value) {
          Ok(Id(String::from(trimmed_value)))          
        } else {
          Err(Error::BadId(String::from(trimmed_value)))
        }
    }
}

StefanSalewski · June 15, 2025, 12:42pm

How is the performance of your solution compared to the original? I would be really surprised when there is not an order of magnitude difference?

For the original, I think there are functions like is_ascii_alphabetic() and is_ascii_alphanumeric() which would make it simpler. Or perhaps convert the char to upper case first, so the test 'A'..='Z' is sufficient. But I guess due to Unicode that is not fast -- for plain ASCII it would be fast of course.

firebits.io · June 15, 2025, 12:45pm

I honestly wouldn't care about the performance of such a trivial thing.

StefanSalewski · June 15, 2025, 12:48pm

Actually I would try to do all the character comparisons with plain ASCII. I would assume that we could convert the original Unicode to Ascii, and then make a faster evaluation.

anon53519680 · June 15, 2025, 12:52pm

Why is RE static?

anon53519680 · June 15, 2025, 12:53pm

The performance is indeed important.

duelafn · June 15, 2025, 12:59pm

I wonder if you are missing an else on the outer if? It seems like starting with a digit is a fast track to acceptance: "0!@#$%^%" is accepted.

This is what I've used for a simiar style test:

fn is_char_machine_type_allowed(ch: char) -> bool {
    matches!(ch, '0'..='9' | 'a'..='z' | '-' | '+')
}

fn valid_machine_type(mtype: &str) -> Option<String> {
    let mut chars = mtype.chars();
    let ch = chars.next()?; // non-empty!
    if !ch.is_ascii_alphanumeric() { return None; }
    if !mtype.chars().all(is_char_machine_type_allowed) { return None; }
    return Some(mtype.to_owned());
}

anon53519680 · June 15, 2025, 1:04pm

#[cfg(test)]
mod tests {
    mod id_tests {
        use super::super::Id;

        #[test]
        fn test_try_from() {
            assert!(
                Id::try_from("_0Aa1Bb2Cc3Dd4Ee5Ff6Gg7Hh8Ii9JjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz")
                    .is_ok()
            );
            assert!(Id::try_from(
                "_0 A!a\"1#B$b%2&C'c(3)D*d+4,E-e.5/Ff6Gg7Hh8Ii9JjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz",
            ).is_err());
        }
    }
}

Is how the function is tested, and it seems to work fine. Where would I need an else statement?

duelafn · June 15, 2025, 1:06pm

This test fails.

assert!(Id::try_from("0!@#$%^%").is_err());

SkiFire13 · June 15, 2025, 1:08pm

Avoids performance pitfalls with having to reiniitalize the regex in every iteration. See the relevant documentation

StefanSalewski · June 15, 2025, 1:09pm

This is the result of an AI benchmark. The second version was an AI improved version of the original, and in the last it used ASCII comparison:

 Running benches/id_bench.rs (target/release/deps/id_bench-3970369e494f111b)

Gnuplot not found, using plotters backend
original_id time: [28.337 ns 28.451 ns 28.572 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high mild

idiomatic_id time: [23.265 ns 23.309 ns 23.361 ns]
Found 8 outliers among 100 measurements (8.00%)
6 (6.00%) high mild
2 (2.00%) high severe

ascii_id time: [19.232 ns 19.410 ns 19.602 ns]
Found 2 outliers among 100 measurements (2.00%)
2 (2.00%) high severe

anon53519680 · June 15, 2025, 1:10pm

It really does so. Why though?

I would think that the second third _ => return Err(Error::BadId(String::from(value))), in:

match c {
    '_' | '0'..='9' | 'A'..='Z' | 'a'..='z' => continue,
    _ => return Err(Error::BadId(String::from(value))),
}

Would prevent that.

anon53519680 · June 15, 2025, 1:10pm

Thanks.

anon53519680 · June 15, 2025, 1:12pm

Do you mean doing value.bytes() instead of value.chars()?

StefanSalewski · June 15, 2025, 1:14pm

After I asked if a plain ASCII comparison would be faster, the AI generated somethink with "let bytes = value.as_bytes();"

anon53519680 · June 15, 2025, 1:16pm

True. That could be faster.

StefanSalewski · June 15, 2025, 1:21pm

Here is the complete benchmark output including the regex version from above. The regex version is actually surprisingly fast:

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.00s

     Running benches/id_bench.rs (target/release/deps/id_bench-508a63bae12acbf8)
Gnuplot not found, using plotters backend
original_id             time:   [26.521 ns 26.554 ns 26.589 ns]
                        change: [-7.0437% -6.6336% -6.2226%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) high mild
  6 (6.00%) high severe

idiomatic_id            time:   [19.639 ns 19.690 ns 19.747 ns]
                        change: [-15.343% -15.013% -14.678%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

ascii_id                time:   [18.166 ns 18.255 ns 18.348 ns]
                        change: [-7.7957% -6.7476% -5.7584%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

regex_id                time:   [48.737 ns 48.851 ns 48.966 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

Vorpal · June 15, 2025, 1:41pm

A tip: You can use backticks to quote code and output like so:

```
Code here
```

This would make the benchmark output more readable.

Especially on mobile with lots of line breaks your existing post is very unreadable.

StefanSalewski · June 15, 2025, 1:46pm

OK, I tried. Personally for me, the output was fine, and now as code, I have a medium size view with horizontal and vertical scroll bars. That is, long lines, and bottom of code, needs scrolling.

Topic		Replies	Views
Questions about performance in regards to my generic regular expressions learning project help	13	1304	January 12, 2023
Frank's Rust String Class	31	5981	January 12, 2023
Jetscii now works with (future) stable Rust 1.27.0 announcements	12	4752	January 12, 2023
Why my rust code is 2 times slower than my JS code help	18	4227	November 10, 2019
PSA: regex got a lazy DFA. it's fast. report problems!	3	1926	January 12, 2023

Checking whether the String contains a Character Other than the Specific Characters

Related topics