How can I write elements with CDATA in XML?

Hello,
I am relatively new to rust and I am currently struggling with the following issue. I have the following class:

#[derive(Serialize, Deserialize, Debug)]
pub struct XmlCountry {
    #[serde(rename = "ShortName")]
    pub short_name: String,
    #[serde(rename = "Name")]
    pub full_name: String,
    #[serde(rename = "Comment", default)]
    pub comment: String,
    #[serde(rename = "PrivateComment", default)]
    pub private_comment: Option<String>,
    #[serde(rename = "Order", default)]
    pub order: i32,
    #[serde(skip_serializing, skip_deserializing)]
    _extra: Option<serde_json::Value>,  // To ignore any extra elements
}

I want the Comment and PrivateComment elements to be always written wrapped in:
<![CDATA[]]>

I have tried multiple methods, such as custom serializers etc, but I cannot get it to work. The closest I got was creating a custom CData class and a custom serializer like this:

fn serialize_cdata<S>(value: &CData, serializer: S) -> Result<S::Ok, S::Error>
where
    S: Serializer,
{
    let cdata_str = format!("<![CDATA[{}]]>", value.0);
    serializer.serialize_str(&cdata_str)
}

But even with this, in the XML file I get:

<Comment>&lt;![CDATA[myvalue]]&gt;</Comment>

instead of:

<Comment><![CDATA[myvalue]]></Comment>

Is there really no way to write proper CDATA sections with rust?

Of course. But this isn't a Rust issue, it's a serde issue.

serde works by defining a data model that is used to connect types (like XmlCountry) and the serializers (like serde_json, quick_xml, etc.). This data model, in so far as I am aware, has no concept of "CDATA". As such, it's impossible to communicate that you want a particular string to be encoded as CDATA.

That's also why your "wrapper" code doesn't work: the string you generate is being passed to the XML serializer, which then correctly escapes the < and >. If it didn't, any string that contained those would be corrupted.

As an aside: from what I remember, CDATA should never be necessary; it's just another way of writing character data.

Theoretically, you could do this in an XML serializer using a backchannel to communicate "hey, the next string you see should be encoded as CDATA", but this would be non-standard, and require you to write your own custom serializers. Also, I checked the first few XML serializers for serde that popped up in a quick search, and none of them seemed to support anything of the sort.

If you absolutely need to have those fields encoded as CDATA, you're going to need either a dedicated XML serialization library that supports this, or to write the output code yourself.

Another aside: Rust doesn't have classes. Trying to map Rust concepts to your existing object-oriented thinking can lead to a lot of unnecessary problems.

2 Likes

Thanks!! I have also searched a lot and could not find anything that supports my scenario, so I have already started implementing my own "write_to_xml" following this pattern:


fn write_cdata<W: std::io::Write>(writer: &mut Writer<W>, key: &str, value: &CData) {
    let mut elem = BytesStart::new(key);
    writer.write_event(Event::Start(elem.borrow())).unwrap();
    writer.write_event(Event::Text(BytesText::new(format!("<![CDATA[{}]]>", &value.0).as_str()))).unwrap();
    writer.write_event(Event::End(BytesStart::new(key).to_end().borrow())).unwrap();
}

fn write_option_cdata<W: std::io::Write>(writer: &mut Writer<W>, key: &str, value: &Option<CData>) {
    if let Some(cdata) = value {
        write_cdata(writer, key, cdata);
    }
}

fn write_text<W: std::io::Write>(writer: &mut Writer<W>, key: &str, value: &str) {
    let mut elem = BytesStart::new(key);
    writer.write_event(Event::Start(elem.borrow())).unwrap();
    writer.write_event(Event::Text(BytesText::new(value))).unwrap();
    writer.write_event(Event::End(BytesStart::new(key).to_end().borrow())).unwrap();
}

#[derive(Serialize, Deserialize, Debug)]
pub struct XmlCountry {
    #[serde(rename = "ShortName")]
    pub short_name: String,
    #[serde(rename = "Name")]
    pub full_name: String,
    #[serde(rename = "Comment", default)]
    pub comment: String,
    #[serde(rename = "PrivateComment", default)]
    pub private_comment: Option<String>,
    #[serde(rename = "Order", default)]
    pub order: i32,
    #[serde(skip_serializing, skip_deserializing)]
    _extra: Option<serde_json::Value>,  // To ignore any extra elements
}

impl XmlCountry {
    pub fn write_to_xml_writer<W: std::io::Write>(&self, writer: &mut Writer<W>) {
        writer.write_event(Event::Start(BytesStart::new("COUNTRY"))).unwrap();

        write_text(writer, "ShortName", &self.short_name);
        write_text(writer, "Name", &self.full_name);
        write_cdata(writer, "Comment", &self.comment);
        write_option_cdata(writer, "PrivateComment", &self.private_comment);
        write_text(writer, "Order", &self.order.to_string());

        writer.write_event(Event::End(BytesStart::new("COUNTRY").to_end().borrow())).unwrap();
    }
}

Note that this is just a snippet. My real structs are much bigger and containing multiple sub-structs, so each calls the child's write_to_xml_writer in turn etc.

I will let you know how this works out.

Thanks again for confirming what I already assumed!

PS: in my case, CDATA is necessary because I then want to pass the XML to an existing compiled executable that expects it this way and I cannot change the executable itself... :confused:

Have you already notified all the stakeholders that “existing executable” is broken, doesn't support an XML standard and you are working on the workaround?

If program expects that certain parts of XML are passes as CDATA then it means said programs doesn't accept an XML but something that is resembling XML, but uses some different rules.

Keeping that explanation and reasoning in code is very important to “contain the damage”.

That's the approach used by Linux kernel and it works extremely well in practice: when hardware is broken one couldn't just say that and do nothing, kernel which only works with unbroken hardware is useless in practice because every existing piece of hardware is broken in one way or another, yet it's very important to clearly demarcate the part that is a workaround for a broken hardware to ensure that it wouldn't contaminate other parts of kernel, or else Hyrum's Law would make you very unpleasant things.

2 Likes

From the point of view of a conformant XML application, the fragments

<Comment><![CDATA[myvalue]]></Comment>

and

<Comment>myvalue</Comment>

are identical. They should produce the same events in an event-driven processor, and the same DOM in a DOM-driven processor, and the application should not be able to tell the two documents apart once they have been parsed.

The purpose of explicit <!CDATA[…]]> markup is to delimit character data when that data may contain things that look like markup - classically, things like scripts, which may include <, >, &, and other characters with significance to an XML processor's parsing stage. This avoids the need to individually escape those characters with XML entities (&lt;, &gt;, &amp;, and so on, or their numeric equivalents), but does not change the meaning of the document or the structure it encodes.

If you need to control serialization to the level of ensuring that a given character data sequence is always wrapped in <![CDATA[…]]>, then you'd almost certainly need to work below the level of a serde serializer. High-level libraries generally expect that the processor receiving the document will make similar assumptions to those laid out above, and do not provide the level of control you want.

Unfortunately, though I did do some brief digging on crates.io, I was unable to find a library that appeared likely to meet your specific needs. Is it practical for you to fix the program that receives this document, so that it does not differentiate between character data and character data wrapped in a <![CDATA[…]]> in this way?

FYI quick-xml already does something like this for disambiguating between children, attributes and text content. In particular #[serde(rename = "@foo")] will identify the attribute foo, and likewise $text/$value identify the text content. It would not be surprising if something else like #cdata would identify a text content as CDATA, but unfortunately this is not implemented. It doesn't seem too hard, so @HumanWannabe if you can you could try implementing it.

1 Like

Thank you everyone for your suggestions!

In the end, I worked around the problem by creating a (very) custom XML writer that writes directly into a String that is then written into a file. This was because even at the most basic level (e.g. writer.write_event(Event::Text(BytesText::new(value))).unwrap();), using quick_xml would always result in encoding "<" and ">" (as @derspiny rightly pointed out). So in the end, my solution was in the direction of:

fn write_cdata_element(writer: &mut String, key: &str, value: &CData) {
    writeln!(writer, "    <{}><![CDATA[{}]]></{}>", key, value.0, key).unwrap();
}

fn write_option_cdata_element(writer: &mut String, key: &str, value: &Option<CData>) {
    if let Some(cdata) = value {
        write_cdata_element(writer, key, cdata);
    }
}

fn write_element(writer: &mut String, key: &str, value: &str) {
    writeln!(writer, "    <{}>{}</{}>", key, value, key).unwrap();
}

(yes... even the indentation is hard-coded... :see_no_evil:)

As I said before, changing the receiver executable is not straight-forward as it is legacy software that is already distributed to specific parties etc. In the longer term, I will try to push for updating it, but for now I had to find a quick solution around the problem. I clearly take the point of @khimru that this should be marked as a workaround, and luckily it will not further contaminate my code, as it does not affect my other structures and the rest of the software.
If I find the time, I will also look into implementing a workaround in quick_xml, following the suggestion of SkiFire13 in the direction of jb-alvarado in the post you shared (i.e. allowing unencoded text).

Once again, thank you everyone for helping me out with this. Even if I had to devise a custom solution, your comments helped me a lot and saved me precious time.

If you were using quick-xml then Event::Text is for normal text with escaping, while Event::CData is for writing CDATA.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.