How to use 7-bit escape sequences

I want to color a string written out to standard output on Linux.
How do I write the octal byte 033 in string literals?

The Rust Reference has the following statement, but I'm not sure what it means.

Is \x used for both hexadecimal and octal escape sequences?
If so, how does it distinguish between 33 in octal and 33 in hexadecimal?

7-bit escapes

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

Literal expressions - The Rust Reference

Rust doesn't support octal literal specification; a 7-bit escape is still specified in hexadecimal, it's just that the first nibble has to be limited to 0-7, instead of being 0-F.

You'll need to specify the byte in hexadecimal - octal 33 is hexadecimal 1b, so \x1b for this literal.

7 Likes

I use the coloured crate to do that: colored - Rust

3 Likes

There's also nu_ansi_term or ansiconst if you want more than just colouring from your ANSI terminal codes.

1 Like

I'm confused. "An octal digit followed by a hexadecimal digit" is very obviously not the same as "two octal digits or two hexadecimal digits", which is what your interpretation would imply. How could the documentation possibly mean that?

I seems like the '7 bit escape' thing is just an '8 bit escape' where the first digit is 0-7.
I don't see the purpose in having such a thing.

7-bit escapes are used in Unicode string and character literals, where they map to the subset of Unicode scalar values that are are identical to ASCII and can be represented as single bytes in UTF-8. For USVs outside that range, you must use Unicode escape sequences like \u{FF}.

8-bit escapes are used in byte and byte strings literals (which are not UTF-8, so can contain arbitrary u8 values).

2 Likes

I forgot to mention why this difference is enforced. One possible reason is that removing the b from a byte string literal (like b"\x41\x42") to turn it into a Unicode string literal (like "\x41\x42") will result in a string whose UTF-8 representation exactly matches the original byte string, or a compile-time error if that is not possible.

That is, it ensures that various types of literals that look the same are guaranteed to have the same in-memory representation.

3 Likes

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

I think this documentation is needlessly confusing because base 8 isn't used here. It should just say two hexadecimal digits, and that the number must be in the range 0x00 - 0x7f.

3 Likes

I think it depends where it's documenting this.

If it's part of the lexer, it's just writing out the regex x[0-7][a-fA-F0-9] in English, and talking about the individual USVs in the source code is the correct approach in the lexer.

Note that for the value the same section describes it as

is the result of interpreting the final two characters in the escape sequence as a hexadecimal integer

which sounds more like what you're talking about.

This is the reference, so it's phrased for that, and that's very different from how I'd informally describe it to people.

Also, the page clarifies

In the definitions of escapes below:

  • An octal digit is any of the characters in the range [0-7].
  • A hexadecimal digit is any of the characters in the ranges [0-9], [a-f], or [A-F].
5 Likes

OK so my proposed definition matches this: two hexadecimal digits in the range 00..=7f. What's wrong with that? I don't see what is informal about that, it's just as formal.

Imagine if the range was from \x00 to \x6c for some reason. How would you describe that? Clearly just specifying the range is both more general and cleaner.

Imagine there was a token that was required to be a decimal number between 00 and 39, would it be better to say "one quaternary digit and one decimal digit, whose value is computed by reinterpreting the quaternary digit as a decimal digit", or would it be better to simply say "a two-digit number in the range 0..=39"? The latter is equally formal, and much shorter and clearer.

Unicode escapes also have a limited range, the lexer has to reject numbers bigger than \u{10ffff}, and yet the grammar and its English description don't specify the range using regexes, it just says "up to 6 digits".

1 Like

There's a bunch of open conversations about exactly what the lexer should accept in which positions. For example, '\u{AA0000}' is clearly not semantically valid, but it's possible that it might be lexically valid, either to a proc macro or as an ignored tt.

So from the perspective of the reference, I still want both "this is the regex that you should put in your tokenizer" that does talk in terms of the individual characters.

From the perspective of

There's an important lexical choice to make there. If that was the range, is \x6d valid inside a cfg(FALSE) or not? Is it legal to call ignore_one_tt!('\x6d'); because the lexer doesn't care, even if let c = '\x6d'; has to fail?

As a programmer, I almost certainly don't care. But the reference has to describe such things.

Ha, in the reference it's exactly the reverse of what you want.

In the section about Tokens it says exactly what I suggested:

A 7-bit code point escape starts with U+0078 (x ) and is followed by exactly two hex digits with value up to 0x7F

In the section about Expressions it says:

The escape sequence consists of \x followed by an octal digit then a hexadecimal digit.

But my point is that a description of the form "exactly two hex digits with value up to 0x7F" is exactly equivalent to the regex [0-7][a-fA-F0-9]. So it's described properly either way. Regexes aren't the only way to define tokens.

Sure, in the formal grammar you can use a regex instead, but we're talking about the part in English.

If it's unclear what happens under cfg(FALSE) or in macros, then switching from one way of describing it to the other doesn't clarify it because the two definitions are equivalent. So that's a separate issue.

1 Like