RFC: serde xml support


#1

Hi everyone.

TLDR: I’m writing the xml support for rust-serde. I’d like some comments on what others expect from Rust<->Xml conversions.

Now, I’m not an expert when it comes to the xml-standard and xsd-schemata. I’m making most of this up as I go.

Some information about xml:

  • every xml document must have exactly one root element
  • every xml document must have some info at the start about the xml version and stuff
  • looks like <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
  • I don’t think that’s relevant for Rust
    • I’m simply assuming it’s always utf-8
  • I haven’t implemented reading that
  • I’m probably going to simply ignore such a prefix
  • xsd-schemata allow the description of xml documents
  • sequences, optional elements, choices between elements…
  • ordered and unordered child-elements (basically struct fields in order or out of order)

Current state

I can parse any int, float, char or string from an xml like <start_tag>value</end_tag>. Note: start_tag and end_tag can be arbitrary, since xml does not support simple values.

I can parse structs containing struct fields whose types can be sequences (any tuple, [T;N], Vec), ints, floats, chars, strings, Options, or other structs. An example:

#[derive(PartialEq, Debug, Serialize, Deserialize)]
struct Inner {
    a: (),
    b: (usize, String, i8),
    c: Vec<String>,
}
#[derive(PartialEq, Debug, Serialize, Deserialize)]
struct Outer {
    inner: Option<Inner>,
}
Outer {
    inner: Some(Inner {
        a: (),
        b: (2, "boom".to_string(), 88),
        c: vec![
            "abc".to_string(),
            "xyz".to_string(),
        ]
    })
}
<Outer>
    <inner>
        <c>abc</c>
        <c>xyz</c>
        <a/>
        <b>2</b>
        <b>boom</b>
        <b>88</b>
    </inner>
</Outer>

As you can see, sequences in xml are simply the same element repeated over and over (with changing contents).

Future

Xml-Attributes

An encoded instance of struct A { x: String } could look like <A x="foo" /> instead of <A><x>foo</x></A>

Rust-Enums

The enum kind could be encoded as it’s own tag or just an attribute to the outer tag:

<A>
  <x xsi:type="Cake" />
  <x><Cake/></x>
</A>

The issue is that using a special attribute would interfere with struct fields being encoded as attributes. It would be a requirement that the xsi:type attribute always comes first. Otherwise we cannot decide which enum type it is before trying to parse the enum contents and that would require infinite (until the next >) lookahead.
Encoding the enum kind as its own tag is the clean way when looking at it from a parser-designing point of view. But would be incompatible to xsd.

check all closing tags

Currently closing tags (</foo>) are not compared to their opening tags, to see if the name matches. This is just a nice-to have, but isn’t possible yet in serde without doing heap allocations.

serialization

Once the deserializer does its job for all relevant rust types it would be nice to also serialize stuff to xml.

mixed content

example xml:

<root>hi <b>you</b><i>!</i></root>

I have no idea what I should do with that. Is that a sequence of

enum Mixed<T> { Text(String), Element(T) }

Or should this be the String “hi <b>you</b><i>!</i>”?

Root element

In case of structs, should the root element be named after the struct name? Or don’t we really care about those since the parser doesn’t require them to figure out what it’s parsing?

Not deserializable Xml

Can you think of any xml that could not be deserialized to a Rust type but should be?


#2

This seems incorrect. According to XML 1.0 spec XML descriptor (e.g. <?xml version="1.0" encoding="UTF-8" standalone="yes"?>) is optional. In XML 1.1 it’s mandatory, but XML 1.1. parsers MUST be able to parse it regardless.

Other than this tiny correction, project seems fine. Reminds me of good old XStream for Java.


#3

Another issue is this xsd

<xs:choice minOccurs="1" maxOccurs="unbounded">
    <xs:element name="c" type="A" />
    <xs:element name="d" type="B" />
</xs:choice>

and the corresponding xml:

<some_outer_struct>
    <c/>
    <d/>
    <c/>
    <c/>
    <c/>
    <d/>
</some_outer_struct>

Technically this should be parsed to a SomeOuterStruct { x: Vec<AB>} over the type enum AB{A, B}. But there’s no real way to get the element name ‘x’. This kind of sequence cannot be parsed without modifying the xsd and xml.


#4

So this is a tool like Code Synthesis’s XSD but for Rust?

There are from my experience several cases where type names must be invented by the code generator, but I’ve never come across @oli_obk’s case. I just tested, and XSD generates something like struct SomeOuterStruct { vector<A> c; vector<B> d; } which is not really correct… first time I’ve encountered an issue like this.


#5

yea I’ve worked with Code Synthesis XSD. Thanks for testing my example.

Actually serde is much easier to use than Code Synthesis XSD, since you do not have to write any code for serialization or deserialization, even though you are deserializing to your own types. There’s no code generation step.

What you don’t get is xsd-validity checking. You only get a guarantee about the rust types. There might be more types that can be represented with xsd but not with my serde-xml implementation.


#6

Another issue is whitespace. Should <a> </a> be

  1. a string of some whitespace chars
  2. or should it be a unit?

Both have advantages:

  1. You can encode/decode strings that only contain whitespace
  2. You can insert arbitrary whitespace without affecting the structure

Both have disadvantages:

  1. <a> </a> cannot be parsed as a struct where all fields should be defaulted, but <a></a> and <a/> can?
  2. You cannot create strings that only consist of whitespace -> you always loose information when parsing -> you cannot parse into an arbitrary structure representing your xml (like a dom-tree)

#7

Encoding struct members in attributes doesn’t scale to complicated types, recursive types, etc.


#8

But it’s very nice to have for the lowest-level of structs.


#9

the question is more the decoding of existing xml. I can simply use attributes as struct fields that deserialize to simple types. Otherwise attributes can never be parsed from xml.


#10

Lets say we have a struct:

struct A {
    x: Vec<(i32, String, char)>
}

An instance of the struct looks like

let a = A {
    x: vec![
        (42, "Cake", '♫'),
        (0, "Blub", 'c'),
    ],
};

A corresponding xml would look like

<A>
    <x>42</x>
    <x>Cake</x>
    <x>♫</x>
    <x>0</x>
    <x>Blub</x>
    <x>c</x>
</A>

But this gets problematic with Vec<Vec<T>>. If I implement this as above, then you’d always read a vec![vec![...]] (a Vec with a single inner Vec that contains all elements. I don’t think this can be prevented.

@erickt : what do you think, maybe extending serde with “fixed size sequences” would work, as then I could error out on layered unsized sequences.


#11

@oli_obk, Thanks a lot for this!

The only issue I’m having is - how to handle empty (self-closing) tags?

I’m trying to deserialize an xml where if there’s no data for a tag, it’s has a self-closing tag. Something like this:

<root>
 <foo>
  <a>Hello</a>
  <b>World</b>
 </foo>
 <foo>
  <a>Hi</a>
  <b/>
 </foo>
</root>

I’ve tried this but it does not work. Is there a better way to get it working?


#12

Ah. That is an oversight… thanks for the test case.


#13

fixed. doesn’t work for root elements yet though.


#14

Thanks a lot, works great!