fn parse_str(string: &str) -> String {
let mut result = String::new();
let mut i = 0;
while i < string.len()-1 {
match string.chars().nth(i).unwrap() {
'\\' => {result.push_str(match string.chars().nth(i+1).unwrap() {
'n' => "\n",
't' => "\t",
'r' => "\r",
'0' => "\0",
'\\' => "\\",
'\'' => "\'",
'"' => "\"",
'x' => {
let hex = &string[i + 2..i + 4];
let decoded = u8::from_str_radix(hex, 16)
.map(|num| num as char)
.unwrap_or('\0');
return decoded.to_string();
}
'u' => {
/* ??? */
}
_ => panic!("Invalid escape code.")
}); i += 1}
_ => result.push(string.chars().nth(i).unwrap())
}
i += 1
}
result
}
I was wondering how I would parse the u? It does not have a fixed length, instead relying on \u{xxxx} and I'm not sure if there is a function like from_str_radix.
For what it's worth, that's the code I'm using in a lexer generator. It's not fully optimized, but it does the job (the function returns a basic Result type, so you'll have to adapt that if you want to keep String as a return type):
pub(crate) fn decode_str(strlit: &str) -> Result<String, String> {
let mut result = String::new();
let mut chars = strlit.chars();
while let Some(c) = chars.next() {
match c {
'\\' => {
result.push(match chars.next().ok_or(format!("'\\' incomplete escape code in string literal '{strlit}'"))? {
'n' => '\n',
'r' => '\r',
't' => '\t',
'\'' => '\'',
'\\' => '\\',
'u' => {
if !matches!(chars.next(), Some('{')) { return Err(format!("malformed unicode literal in string literal '{strlit}' (missing '{{')")); }
let mut hex = String::new();
loop {
let Some(h) = chars.next() else { return Err(format!("malformed unicode literal in string literal '{strlit}' (missing '}}')")); };
if h == '}' { break; }
hex.push(h);
};
let code = u32::from_str_radix(&hex, 16).map_err(|_| format!("'{hex}' isn't a valid hexadecimal value"))?;
char::from_u32(code).ok_or_else(|| format!("'{hex}' isn't a valid unicode hexadecimal value"))?
}
unknown => return Err(format!("unknown escape code '\\{unknown}' in string literal '{strlit}'"))
});
}
_ => result.push(c)
}
}
Ok(result)
}
Here's the specification of the literals I'm using, just in case. I'm not limiting the length of the hexadecimal digits, but any overflow will be caught by from_str_radix:
fragment HexDigit : [0-9a-fA-F];
fragment UnicodeEsc : 'u{' HexDigit+ '}';
fragment EscChar : '\\' ([nrt'\\] | UnicodeEsc);
fragment CharLiteral : '\'' Char '\'';
fragment StrLiteral : '\'' Char Char+ '\'';
CharLit : CharLiteral;
StrLit : StrLiteral;
PS: It looks like the syntax highlighter has a problem with escape sequences.
Since my function returns a String and not a Result<String, String> I tried to turn your code to work for my function without having to change anything outside us match arm. Tell me if I did it correctly:
/* parse_str: escape special characters in strings */
fn parse_str(string: &str) -> String {
let mut result = String::new();
let mut i = 0;
while i < string.len()-1 {
match string.chars().nth(i).unwrap() {
'\\' => {result.push_str(match string.chars().nth(i+1).unwrap() {
'n' => "\n",
't' => "\t",
'r' => "\r",
'0' => "\0",
'\\' => "\\",
'\'' => "\'",
'"' => "\"",
'x' => {
let hex = &string[i + 2..i + 4];
let decoded = u8::from_str_radix(hex, 16)
.map(|num| num as char)
.unwrap_or('\0');
return decoded.to_string();
}
'u' => {
if !matches!(string.chars().nth(i+1).unwrap(), '{') { panic!("Invalid \\u escape code.") }
let mut hex = String::new();
loop {
let h = string.chars().nth(i+1).unwrap();
if h == '}' { break; }
hex.push(h);
};
let code = u32::from_str_radix(&hex, 16)
.map(|num| char::from_u32(num).unwrap())
.unwrap_or('\0');
let decoded = char::from_u32(code as u32).unwrap();
return decoded.to_string()
}
_ => panic!("Invalid escape code.")
}); i += 1}
_ => result.push(string.chars().nth(i).unwrap())
}
i += 1
}
result
}
EDIT: No it does not work. It panics as invalid escape codes.
It looks fine, but I strongly recommend testing each case in your unit tests.
If the string you're parsing is coming from any user input, I'd also recommend returning a Result or at least an Option rather than panicking, which should be reserved to internal errors. Also, if the unicode is incorrect, you return '\0', which may be surprising. But it all depends on how you're using it, of course.
Returning a simple &str would panic as it does not live long enough, this is a work around. It should panic (Maybe later in the dev process I will change this myself). What's important is it does not work. I added a print!("\u{7FFF}") when my program starts, and entered "\u{7FFF}" manually (which uses parse_str). It panicked as it does not recognize it. (The program is a lisp interpreter.)
翿Type an s-expr or use C-c to quit.
eval> "\u{7FFF}" ; The wierd thing in the corner is this unicode
thread 'main' panicked at src/interpreter.rs:251:75:
Invalid \u escape code.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
You seem to increment i twice, though it's hard to tell if that's intended due to the strange indentation. You should replace your string.chars().nth(i).unwrap() and simply iterate on the chars, without using any indices. I'm sure the problem is coming from there.
1.error[E0308]: mismatched types
--> src/interpreter.rs:263:21
|
263 | code.to_string()
| ^^^^^^^^^^^^^^^^ expected `&str`, found `String`
|
help: consider borrowing here
|
263 | &code.to_string()
| +
2.
error[E0716]: temporary value dropped while borrowed
--> src/interpreter.rs:263:22
|
235 | '\\' => {result.push_str(match string.chars().nth(i+1).unwrap() {
| -------- borrow later used by call
...
263 | &code.to_string()
| ^^^^^^^^^^^^^^^-
| | |
| | temporary value is freed at the end of this statement
| creates a temporary value which is freed while still in use
|
= note: consider using a `let` binding to create a longer lived value
// as_str does the same
OK, try this. I haven't done the "x" part; that I'll leave to you. You already have something that takes hexadecimal, so the best would be to merge both or use a sub-function.
There are a few others, but those I saw were pretty old and/or with a 0.x version, without any unit test. I haven't checked if any of them was sound, so make sure to test the results.