Are the Regex ascii classes correct?

david-waterworth · August 25, 2022, 10:44pm

I'm not sure if this forum is the correct place to raise questions about crates.io packages?

I've observed that my rust Regex's often fail to match what I think they should, in particular when I use regex101 to develop them and then port them to rust. When I looked closer I found that the ascii character classes in the crate documentation vary considerably from what I'm used to, i.e. the table in Regex Tutorial - POSIX Bracket Expressions

For example the rust crate documentation (regex - Rust) defines [:print:] as [ -~] yet there are many more printable ascii characters - [\x20-\x7E] according to www.regular-expressions.info

Similarly [:graph:] and [:punct:] are quite different.

I've not tested them all, but I can certainly confirm that rusts [:punct:] misses the characters [$+<>=^`|~] despite them being included in other flavours.

Is this deliberate, does rust follow some specific flavour? But even so the surely the printable and graphical expressions should match more than a handful of characters?

cuviper · August 25, 2022, 10:50pm

[ -~] is a range with a hyphen separator -- that's everything from SPACE (0x20) through TILDE (0x7E), exactly like you found elsewhere. I didn't compare the others, but I suspect they are similar.

Also, make sure you use double-brackets in your pattern, as [:punct:] only matches those literal characters, while [[:punct:]] should match the character class.

BurntSushi · August 25, 2022, 11:23pm

I think it's not the wrong place, but the regex repo does have Discussions enabled. When you ask a question there, it gets fed right into my inbox. If I'm free and the question is easy, I can often answer pretty quickly.

[ -~] is indeed how [:print:] is defined and it is precisely equivalent to [\x20-\x7E]. \x20-\x7E is just another way to write it. The ASCII space character corresponds to the \x20 codepoint and the ASCII tilde character corresponds to the \x7E codepoint.

The [:graph:] definition matches what's on regular-expressions.info exactly.

The [:punct:] definition is actually slightly different! The regular-expressions.info web site defines it as [!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~] where as the regex crate defines it as [!-/:-@\[-\x60{-~]. Those look very different, but the former is written without any ranges and so it is more verbose.

If you actually translate these classes into their corresponding HIR, then we can see things a bit more precisely:

$ regex-cli debug hir '[!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]'
[!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]
------------------------------------------
    parse time:  53.102µs
translate time:  16.502µs

Hir {
    kind: Class(
        Unicode(
            ClassUnicode {
                set: IntervalSet {
                    ranges: [
                        ClassUnicodeRange {
                            start: "!",
                            end: "/",
                        },
                        ClassUnicodeRange {
                            start: ":",
                            end: "@",
                        },
                        ClassUnicodeRange {
                            start: "[",
                            end: "_",
                        },
                        ClassUnicodeRange {
                            start: "{",
                            end: "~",
                        },
                        ClassUnicodeRange {
                            start: "‘",
                            end: "‘",
                        },
                    ],
                },
            },
        ),
    ),
    info: HirInfo {
        bools: 1,
    },
}
$ regex-cli debug hir '[!-/:-@\[-`{-~]'
[!-/:-@\[-`{-~]
---------------
    parse time:  12.968µs
translate time:  6.764µs

Hir {
    kind: Class(
        Unicode(
            ClassUnicode {
                set: IntervalSet {
                    ranges: [
                        ClassUnicodeRange {
                            start: "!",
                            end: "/",
                        },
                        ClassUnicodeRange {
                            start: ":",
                            end: "@",
                        },
                        ClassUnicodeRange {
                            start: "[",
                            end: "`",
                        },
                        ClassUnicodeRange {
                            start: "{",
                            end: "~",
                        },
                    ],
                },
            },
        ),
    ),
    info: HirInfo {
        bools: 1,
    },
}

From the looks of it, the regular-expressions.info definition is missing \x60 (tilde) and adds ‘, where the latter is U+2018. The regex crate takes its definition from UTS#18:

$ regex-cli debug hir '[\p{ascii}&&[\p{gc=Punctuation}\p{gc=Symbol}--\p{alpha}]]'
[\p{ascii}&&[\p{gc=Punctuation}\p{gc=Symbol}--\p{alpha}]]
---------------------------------------------------------
    parse time:  50.673µs
translate time:  101.398µs

Hir {
    kind: Class(
        Unicode(
            ClassUnicode {
                set: IntervalSet {
                    ranges: [
                        ClassUnicodeRange {
                            start: "!",
                            end: "/",
                        },
                        ClassUnicodeRange {
                            start: ":",
                            end: "@",
                        },
                        ClassUnicodeRange {
                            start: "[",
                            end: "`",
                        },
                        ClassUnicodeRange {
                            start: "{",
                            end: "~",
                        },
                    ],
                },
            },
        ),
    ),
    info: HirInfo {
        bools: 1,
    },
}

which is precisely equivalent to [!-/:-@\[-\x60{-~].

This is exactly what is prescribed in UTS#18, and that conformance is documented in the regex UNICODE.md doc: https://github.com/rust-lang/regex/blob/master/UNICODE.md#rl12a-compatibility-properties

So is adding U+2018 correct? Well, since U+2018 isn't ASCII and these are called ASCII character classes (even on regular-expressions.info), I'd say... no, definitely not. regular-expressions.info appears to have an error.

But what about \x60? Is it correct to include that in [:punct:]? Well, UTS#18 includes it, and it's also easy to check that POSIX includes it too:

$ echo '`' | grep -E '[[:punct:]]'
`

So it looks like regular-expressions.info has two errors for [:punct:]. However, I think it's just one error. My best guess is that they meant to include \x60 and not U+2018, but when writing \x60, some word processor or whatever automatically translated it to U+2018. Look at where U+2018 appears in their character class: [!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]. It's sandwiched between _ (\x5F) and { (\x7B). So that's exactly where tilde or \x60 should go. (I keep spelling it out as "tilde" because Markdown makes it difficult to put a tilde inside inline code blocks. Sigh.)

So accounting for that error, I'd say the intended meaning of the ASCII definition of [:punct:] on regular-expressions.info matches precisely what the regex crate does.

When saying stuff like this, it is SUPER SUPER SUPER important to include the actual code you've written, its inputs and its outputs. For example, this program

use regex::Regex;
fn main() {
    let re = Regex::new(r"[[:punct:]]").unwrap();
    let haystack = "$+<>=^`|~";
    for m in re.find_iter(haystack) {
        println!("{:?}", m.as_str());
    }
}

outputs:

"$"
"+"
"<"
">"
"="
"^"
"`"
"|"
"~"

Which clearly demonstrates that [[:punct:]] matches every character in the string $+<>=^|~`. Playground link: Rust Playground

The flavor is "Perl-like", but limited to the subset of expressions that are actually regular (so no look-around or backreferences). However, this is just a flavor and like most regex engines, it doesn't behave identically to any other regex engine.

david-waterworth · August 25, 2022, 11:42pm

Sorry I misunderstood that the '-' represented a range. I'm using the huggingface tokenizers library which is implemented in rust and it's not matching those characters, but I used your simple example with the same version of regex (1.3) and rust (2018) and that works so it seems to be something in that library and not the rust regex library that's causing my issue.

cuviper · August 26, 2022, 3:35pm

Except that's not tilde (~), that's a backtick (`) -- though they're on the same key on US layouts.

(The Markdown trick is to repeat and space it out, like `` ` ``.)

((And for inception, I had to write that as ``` `` ` `` ```, etc.))

BurntSushi · August 26, 2022, 4:22pm

Ug. Derp. Yes you are right. Stupid brain flipped things around.

Test: `

Wow, neat. Thanks!

system · November 24, 2022, 4:23pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Regex::bytes problem help	3	559	January 12, 2023
Regex match issues help	4	1574	January 12, 2023
Regex 0.2.7 released --- includes rewrite of regex-syntax crate announcements	1	561	January 12, 2023
The regex crate now supports matching on &[u8]	1	762	January 12, 2023
Regex reading file contents help	3	2502	January 23, 2021

Are the Regex ascii classes correct?

Related Topics