I think it's not the wrong place, but the regex repo does have Discussions enabled. When you ask a question there, it gets fed right into my inbox. If I'm free and the question is easy, I can often answer pretty quickly.
[ -~]
is indeed how [:print:]
is defined and it is precisely equivalent to [\x20-\x7E]
. \x20-\x7E
is just another way to write it. The ASCII space character corresponds to the \x20
codepoint and the ASCII tilde character corresponds to the \x7E
codepoint.
The [:graph:]
definition matches what's on regular-expressions.info exactly.
The [:punct:]
definition is actually slightly different! The regular-expressions.info web site defines it as [!"\#$%&'()*+,\-./:;<=>?@\[\\\]^_‘{|}~]
where as the regex crate defines it as [!-/:-@\[-\x60{-~]
. Those look very different, but the former is written without any ranges and so it is more verbose.
If you actually translate these classes into their corresponding HIR, then we can see things a bit more precisely:
$ regex-cli debug hir '[!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]'
[!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]
------------------------------------------
parse time: 53.102µs
translate time: 16.502µs
Hir {
kind: Class(
Unicode(
ClassUnicode {
set: IntervalSet {
ranges: [
ClassUnicodeRange {
start: "!",
end: "/",
},
ClassUnicodeRange {
start: ":",
end: "@",
},
ClassUnicodeRange {
start: "[",
end: "_",
},
ClassUnicodeRange {
start: "{",
end: "~",
},
ClassUnicodeRange {
start: "‘",
end: "‘",
},
],
},
},
),
),
info: HirInfo {
bools: 1,
},
}
$ regex-cli debug hir '[!-/:-@\[-`{-~]'
[!-/:-@\[-`{-~]
---------------
parse time: 12.968µs
translate time: 6.764µs
Hir {
kind: Class(
Unicode(
ClassUnicode {
set: IntervalSet {
ranges: [
ClassUnicodeRange {
start: "!",
end: "/",
},
ClassUnicodeRange {
start: ":",
end: "@",
},
ClassUnicodeRange {
start: "[",
end: "`",
},
ClassUnicodeRange {
start: "{",
end: "~",
},
],
},
},
),
),
info: HirInfo {
bools: 1,
},
}
From the looks of it, the regular-expressions.info definition is missing \x60
(tilde) and adds ‘
, where the latter is U+2018
. The regex crate takes its definition from UTS#18:
$ regex-cli debug hir '[\p{ascii}&&[\p{gc=Punctuation}\p{gc=Symbol}--\p{alpha}]]'
[\p{ascii}&&[\p{gc=Punctuation}\p{gc=Symbol}--\p{alpha}]]
---------------------------------------------------------
parse time: 50.673µs
translate time: 101.398µs
Hir {
kind: Class(
Unicode(
ClassUnicode {
set: IntervalSet {
ranges: [
ClassUnicodeRange {
start: "!",
end: "/",
},
ClassUnicodeRange {
start: ":",
end: "@",
},
ClassUnicodeRange {
start: "[",
end: "`",
},
ClassUnicodeRange {
start: "{",
end: "~",
},
],
},
},
),
),
info: HirInfo {
bools: 1,
},
}
which is precisely equivalent to [!-/:-@\[-\x60{-~]
.
This is exactly what is prescribed in UTS#18, and that conformance is documented in the regex UNICODE.md
doc: regex/UNICODE.md at master · rust-lang/regex · GitHub
So is adding U+2018
correct? Well, since U+2018
isn't ASCII and these are called ASCII character classes (even on regular-expressions.info), I'd say... no, definitely not. regular-expressions.info appears to have an error.
But what about \x60
? Is it correct to include that in [:punct:]
? Well, UTS#18 includes it, and it's also easy to check that POSIX includes it too:
$ echo '`' | grep -E '[[:punct:]]'
`
So it looks like regular-expressions.info has two errors for [:punct:]
. However, I think it's just one error. My best guess is that they meant to include \x60
and not U+2018
, but when writing \x60
, some word processor or whatever automatically translated it to U+2018
. Look at where U+2018
appears in their character class: [!"\#$%&\x27()*+,\-./:;<=>?@\[\\\]^_‘{|}~]
. It's sandwiched between _
(\x5F
) and {
(\x7B
). So that's exactly where tilde or \x60
should go. (I keep spelling it out as "tilde" because Markdown makes it difficult to put a tilde inside inline code blocks. Sigh.)
So accounting for that error, I'd say the intended meaning of the ASCII definition of [:punct:]
on regular-expressions.info matches precisely what the regex crate does.
When saying stuff like this, it is SUPER SUPER SUPER important to include the actual code you've written, its inputs and its outputs. For example, this program
use regex::Regex;
fn main() {
let re = Regex::new(r"[[:punct:]]").unwrap();
let haystack = "$+<>=^`|~";
for m in re.find_iter(haystack) {
println!("{:?}", m.as_str());
}
}
outputs:
"$"
"+"
"<"
">"
"="
"^"
"`"
"|"
"~"
Which clearly demonstrates that [[:punct:]]
matches every character in the string $+<>=^
|~`. Playground link: Rust Playground
The flavor is "Perl-like", but limited to the subset of expressions that are actually regular (so no look-around or backreferences). However, this is just a flavor and like most regex engines, it doesn't behave identically to any other regex engine.