Code review - serialization library

This is serialization and de-serialization for two of the many custom representations used by Second Life and Open Simulator. There's an XML version and a binary version. I may submit this to "crates.io". This format isn't used for much. but there are third party programs that use it, and implementations in several languages.

The XML part seems to have more conversion between "str" and "string" than should be necessary. I've been struggling with that.

1 Like

Interesting, I like seeing seeing crates that deal with game data formats (working on one myself).

These are my thoughts:

Most of the functions inside parse_value() you can also install as methods on Cursor by importing bytes::Buf, if you are okay with a panicking implementation (I am, since for me it's an unrecoverable error anyways).

Just some bikeshedding: functions like generate_value, parse and parse_value are pretty vague, because you can't tell at a glance what they convert into what. Maybe you can implement some of these in terms of convert traits?

You can use the various methods on Option and Result (these are seriously amazing) to express things like:

match output {
        Some(out) => Ok(out),
        None => Err(anyhow!("Unexpected end of data, no <llsd> block.")),
    }

as my_result = out.ok_or_else(|| anyhow!("Unexpected end of data, no <llsd> block."))

parse has two variables both named buf. That is confusing and might possibly not do what you expect. It's also a pretty big function; consider refactoring it into smaller ones.

this is an unreachable (?) pattern.

Definitely don't want to panic based on data that comes in from the network. Occasional invalid data is normal for Second Life. It's an old system in C++, there's user-created content, and there's also Open Simulator, a mostly compatible set of servers with several variants. It's not quite as hostile an environment as reading web pages, but there are errors.

Only "parse" is exported, and that is a common name for things which get parsed. The top level function, the one exported from the library, is LLSDValue::parse, which is reasonably unambiguous.

Agreed. I generally dislike shadowing variables. But Rust sort of encourages it, in that you can write

let x = 1;
let x = x + 1;

without the compiler complaining of redeclaration.

The only reason the function looks big is because cargo fmt turns an error message call into 5 lines. When "fmt" gets something too long for one line, it breaks it up into one element per line, which gets bulky.

If it were unreachable, there would be a compiler warning. There are some other possible XML event types, such as CDATA, Decl, and Processing Instruction, none of which are allowed in this format. They need to be caught as errors.

Oh, I see now; you're matching _ => ..., which is actually Ok(_) => .... That was what confused me. I (as someone who isn't familiar with xml) didn't realize you weren't matching all Ok(Event) types.

That's useful information to include in the error message, e.g.:

Ok(other_event) => {
                return Err(anyhow!(
                    "Encountered illegal event {:?} at position {} while parsing map",
                    other_event,
                    reader.buffer_position(),
                ))
            }

This is something you can change :slight_smile:
Look into custom rustfmt options, for line width you can change max_width = some_number.

True, I was mostly thinking in terms of someone who wants to add/update functionality or is curious how the parser works.

Sadly, don't have time to do a proper review, but the following thing immediately jumped at me:

let snippet = &msgstring[0..usize::min(60,msgstring.len())]; // beginning of malformed LLSD      
        Err(anyhow!("LLSD format not recognized: {:?}", snippet))

Rust strings are guaranteed to be valid utf8, but they use raw byte positions for indexing. If you slice a string, the index might fall in the middle of a unicode codepoint. The result is a panic: Rust Playground

To get a short prefix for debug purposes, I'd write something like this:

fn short_prefix(mut s: &str) -> &str {
    let n = 60;
    if s.len() >= n {
        for idx in (n-4 .. n).rev() {
            if let Some(p) = s.get(..idx) {
                s = p;
                break
            }
        }
    }
    
    assert!(s.len() < n);
    s
}

(untested, compiles, but is not guaranteed to work)

1 Like

If you slice a string, the index might fall in the middle of a unicode codepoint. The result is a panic:

Oh, that makes sense. Is there a standard function to get the first N characters of a string, or a format descriptor to edit with a width limit? It seems a common enough function that there ought to be support for it somewhere.

There's "string.truncate(n)", but it counts bytes, not characters, and panics if the truncate point is in the middle of a character. That's somewhat non-useful semantics.

So I wrote a one-line truncate to N chars.

You should use take and accept an &str instead (for interoperability with non-String types that do deref to str), and you don't need the turbofish:

fn trim_length(s: &str, length: usize) -> String {
    s.chars().take(length).collect()
}

Even better: you don't need to allocate –

fn trim_length(s: &str, n_chars: usize) -> &str {
    let n_bytes = s.char_indices().nth(n_chars).map_or(s.len(), |(i, _)| i);
    &s[..n_bytes]
}
2 Likes

I know. cargo fmt has two modes for long lists of items: too high and too wide. You get to pick, but not get more intelligent behavior.

I know. cargo fmt has two modes for long lists of items: too high and too wide. You get to pick, but not get more intelligent behavior.

You might be looking for use_small_heuristics = "Max". But yeah, overall it's impossible to get "perfect" formatting with rustmft. You either have to settle for good enough, or format everything by hand.

Oh, that makes sense. Is there a standard function to get the first N characters of a string, or a format descriptor to edit with a width limit? It seems a common enough function that there ought to be support for it somewhere.

Sadly, the isn't. The reason for that is that, due to unicode, "what is a character" and "what is width" is a hard question with the proper answer changing over time. In particular, taking n chars is, pedantically, always wrong. The nth character might be the middle of complex unicode grapheme cluster like :ru: , for example. So, the solution is:

  • if you want to do it properly, you grab unicode_width - Rust and unicode_segmentation - Rust
  • if you want to do it quickly correctly enough (when the output is intended for the programmer and not the user), you need to write you own code to iterate chars
3 Likes

Right. I once wrote grapheme-aware word wrap in Rust, so I understand the problem.

1 Like