String processing to match valid UUID

Hi there, I've been working on a little tool and wrote this function to check whether an input string is a valid UUID (example: 2a1c18cb-1bc6-4520-bdc3-09139558b783). I'm looking for feedback on how to make this function more concise or perform better. I've been interested in and trying to get better at using iterators and method chains for string processing, so I'm not interested in solutions that use for loops or anything like that, even if they would be better (but feel free to share them nonetheless if you think they're particularly good or interesting).

pub async fn match_uuid(string: &str) -> Option<Uuid> {
    // Check if there are exactly 5 segments. This is necessary because the `zip()` method only
    // runs for the first 5 segments and will ignore anything that comes after.
    if !(string.split('-').count() == 5) {
        return None;
    }

    // Check if segments of the UUID are the proper lengths and if they are made up of hex digits.
    let segment_lengths = [8, 4, 4, 4, 12];
    string
        .split('-')
        .zip(segment_lengths)
        .all(|(seg, len)| is_hexstring_of_len(seg, len))
        .then(|| Uuid {
            likely_genuine: is_genuine_hash(
                &string
                    .chars()
                    .filter(|c| c.is_ascii_hexdigit())
                    .collect::<String>(),
            )
            .unwrap(),
        })
}

I'm also curious if there's a neat way, or a way at all, to do the is_hexstring_of_len function as a single chain of methods, instead of needing to do the length check 'separately':

pub fn is_hexstring_of_len(string: &str, len: usize) -> bool {
    string.len() == len && string.chars().all(|c| c.is_ascii_hexdigit())
}

Well one fairly obvious way is to get rid of the heap-allocated strings altogether, just convert each UUID to a u128, then operate on those instead.
Bit operations tend to be a lot faster than string operations, and a u128 is a lot more cache friendly than a String as well.

2 Likes

That's a good point, I could do something like that for the is_genuine_hash function which performs some statistics to see if the data is 'random' enough to be genuinely generated. Though that function is generic, I have multiple functions to match other hex-based sting formats of different lengths (all the common SHA hashes for example), and because of that I want that function to work with an arbitrary number of bytes, so I just pass the hex string to it directly. There's also some logic which specifically uses knowledge of the hex representation of data to detect sequences that don't "look" random in hex, e.g. 0xAAAAAAAAAAAA has the same amount of ones and zeroes in binary, making it appear statistically random if you just look at the distribution of bits, though obviously it's not likely the result of a random generator because in hexadecimal all symbols are the same.

I can't really replace anything else with a u128 either. An IPv6 address and a UUID can both be represented as a u128, but the point of my tool is that it can detect and differentiate between known string formats such as these, which requires working on the strings directly.

In the most basic sense, all I'm doing is what most people would use regex for, but manually, to get better at using Rust's string processing features.

Would you consider:

use uuid::Uuid;

pub fn match_uuid(string: &str) -> Option<Uuid> {
    Uuid::parse_str(string).ok()
}

fn main() {
    assert!(match_uuid("2a1c18cb-1bc6-4520-bdc3-09139558b783").is_some());
    assert!(match_uuid("sixteen tons of bananas").is_none());
}

In the most basic sense, all I'm doing is what most people would use regex for, but manually, to get better at using Rust's string processing features.

Ah, then probably not.

How about this?


pub fn is_hyphenated_uuid(string: &str) -> bool {
    let rest = match hexits(string, 8) {
        Some(rest) => rest,
        None => return false,
    };
    
    let rest = match hyphen(rest) {
        Some(rest) => rest,
        None => return false,
    };

    let rest = match hexits(rest, 4) {
        Some(rest) => rest,
        None => return false,
    };
    
    let rest = match hyphen(rest) {
        Some(rest) => rest,
        None => return false,
    };

    let rest = match hexits(rest, 4) {
        Some(rest) => rest,
        None => return false,
    };
    
    let rest = match hyphen(rest) {
        Some(rest) => rest,
        None => return false,
    };

    let rest = match hexits(rest, 4) {
        Some(rest) => rest,
        None => return false,
    };
    
    let rest = match hyphen(rest) {
        Some(rest) => rest,
        None => return false,
    };

    hexits(rest, 12).is_some()
}

fn hexits(string: &str, count: usize) -> Option<&str> {
    // may panic if string contains multibyte characters in the first count
    // chars, or is shorter than count
    let (prefix, rest) = string.split_at(count);
    
    if prefix.chars().by_ref().all(|c| c.is_ascii_hexdigit()) {
        Some(rest)
    } else {
        None
    }
}

fn hyphen(string: &str) -> Option<&str> {
    string.strip_prefix('-')
}

fn main() {
    assert!(is_hyphenated_uuid("2a1c18cb-1bc6-4520-bdc3-09139558b783"));
    assert!(!is_hyphenated_uuid("sixteen tons of bananas"));
}

This is intended to be a translation of the appropriate regular expression, avoiding the use of regexp machinery but nonetheless evaluating the string character-by-character.

The repetitive match blocks could be replaced with a macro, to make the code more readable:

macro_rules! consume {
    ($expr:expr) => {
        match $expr {
            Some(rest) => rest,
            None => return false,
        }
    }
}


pub fn is_hyphenated_uuid(string: &str) -> bool {
    let rest = consume!(hexits(string, 8));
    let rest = consume!(hyphen(rest));
    let rest = consume!(hexits(rest, 4));
    let rest = consume!(hyphen(rest));
    let rest = consume!(hexits(rest, 4));
    let rest = consume!(hyphen(rest));
    let rest = consume!(hexits(rest, 4));
    let rest = consume!(hyphen(rest));

    hexits(rest, 12).is_some()
}

fn hexits(string: &str, count: usize) -> Option<&str> {
    // may panic if string contains multibyte characters in the first count
    // chars, or is shorter than count
    let (prefix, rest) = string.split_at(count);
    
    if prefix.chars().by_ref().all(|c| c.is_ascii_hexdigit()) {
        Some(rest)
    } else {
        None
    }
}

fn hyphen(string: &str) -> Option<&str> {
    string.strip_prefix('-')
}

fn main() {
    assert!(is_hyphenated_uuid("2a1c18cb-1bc6-4520-bdc3-09139558b783"));
    assert!(!is_hyphenated_uuid("sixteen tons of bananas"));
}

Another variation, that makes heavier use of iterator transformations:

const UUID_PATTERN: [fn(char) -> bool; 36] = [
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hyphen,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hyphen,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hyphen,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hyphen,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
    is_hexdigit,
];

fn is_hexdigit(c: char) -> bool {
    c.is_ascii_hexdigit()
}

fn is_hyphen(c: char) -> bool {
    c == '-'
}

pub fn is_hyphenated_uuid(string: &str) -> bool {
    string.len() == UUID_PATTERN.len() && string.chars()
        .zip(UUID_PATTERN.iter())
        .all(|(char, predicate)| predicate(char))
}

fn main() {
    assert!(is_hyphenated_uuid("2a1c18cb-1bc6-4520-bdc3-09139558b783"));
    assert!(!is_hyphenated_uuid("sixteen tons of bananas"));
}

Just pass an IntoIterator<Item = char> (or u8) instead.

Also, drop the async, it's plain unnecessary.

The rest of the code can also be simplified: Playground

fn is_hexstring_of_len(string: &str, len: usize) -> bool {
    string.len() == len && string.bytes().all(|b| b.is_ascii_hexdigit())
}

pub fn match_uuid(string: &str) -> Option<Uuid> {
    let mut parts = string.split('-');
    let mut bytes = [0_u8; 32];
    let mut slice = bytes.as_mut_slice();
    
    for len in [8, 4, 4, 4, 12] {
        let seg = parts.next()?;
            
        if !is_hexstring_of_len(seg, len) {
            return None;
        }

        let (chunk, rest) = slice.split_at_mut(len);
        chunk.copy_from_slice(seg.as_bytes());
        slice = rest;
    }
    
    let concat = str::from_utf8(&bytes).ok()?;
    let likely_genuine = is_genuine_hash(concat);
    
    Some(Uuid { likely_genuine })
}

Suggestion.

         slice = rest;
     }
+
+    // Forbid trailing junk
+    let None = parts.next() else { return None };
 
     let concat = str::from_utf8(&bytes).ok()?;
3 Likes

Yes! That's definitely something I had in mind while starting out, but then flip-floping between iterator combinators and a procedural for knocked it right out of my head.

1 Like

When I originally wrote it, my solution didn't detect trailing junk either. From checking the docs I knew zip() stops as soon as either iterator ends, but I didn't make that connection that that would mean any trailing hyphen separated junk would be completely ignored. I only found out after adding a test called check_invalid_too_many_segments. I expected all tests to pass as I only added them in case I wanted to change the function in the future, and was puzzled about it for a bit. This bit:

...was only added after I found out why the test was failing.

This is code is called in an async context, I should have specified!

Thanks for all the suggestions everyone! You taught me some new patterns and methods that I'm sure I'll utilize in some ways.

Indeed, but @quinedot's addition above is superior because it avoids iterating two copies of the iterator twice. (And it was in fact my original motivation for rewriting the consuming zip as a mutable iterator.)

It's still unnecessary unless you directly pass it to a higher-order function that expects the return value to be a literal future. If you are just calling it, just call it. There's nothing async in this function, so making it artificially async and then awaiting will not result in better concurrency, more speed, or any other benefit. You are just making it harder to call from non-async code.

1 Like

Oh, you're right! I thought calling non-async methods from Axum handlers would be a compile error, but I guess not. Compiler's fine with it, so I guess I must've misremembered and made things async from the start, expecting it to be necessary. Thanks for the heads up.

If you couldn't call sync code from async code, then 90% of the standard library would be useless. slice.len() > 0? HashMap insertion? Out!

The advice is to not call blocking I/O from async, because that defeats the performance advantage of async (ie., someone could do something useful while the current task is waiting for I/O to return). But for code that's inherently sync and has no suspend points, you can't do any better than running it all at once.

1 Like

iterating over the string twice seems unideal. instead i would use string.len() != 36, since that can be compiled into a simple integer comparison.

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.