Serde serialize to not supported type

Hi all,

I'm implementing a serde Serializer serde::Serializer - Rust for a binary format that do not distinguish between signed and unsigned integer what should the functions serialize_i8, serialize_i16, serialize_i32 and serialize_i64 do?

One option is to just cast it to the unsigned variant and serialize that, then have the deserializer cast it back if asked for a signed type.

Ty! I'm always enchanted by the responsiveness of this community.

The fact is that trying to serialize a signed int to this format is always incorrect, so I would like the compiler to give me an error if I try to do it, is that something possible?

I read another time the Serde documentation but I think that I'm not understanding something fundamental.
For example if my binary format distinguish between 1,2, 3 and 4 bytes length integer, how can the Serializer knew how to serialize 0_u32 ?

Typically serde knows which kind of integer a variable is, which you can use to pick between 1, 2 and 4 bytes. You are not going to be able to pick the 3 byte format.

Ok maybe Serde is not the right tool. What I'm trying to do is implementing a binary network protocol, this protocol other than several messages and reponses define also a custom binary format, with basic integer sequences ecc ecc.

When I implement serde::Serializer for my Serializer I have to implement the function serialize_u32 so, following the serde tutorial I defined Serialzer as:

pub struct Serializer {
    output: Vec<u8>
}

Then at first i did:

// Implementing serde::Serializer for Serializer
fn serialize_u32(self, v: u32) -> Result<()> {
    let _16 = 2_u32.pow(2) - 1;
    let _24 = 2_u32.pow(3) - 1;
    let bytes = v.to_le_bytes();
    match v {
        0..=_16 => {
            self.output.push(bytes[0]);
            self.output.push(bytes[1]);
        }
        _16..=_24 => {
            self.output.push(bytes[0]);
            self.output.push(bytes[1]);
            self.output.push(bytes[2]);
        },
        _ => {
            for b in bytes.iter() {
                self.output.push(*b);
            }
        }
    }
    Ok(())
}

But this is obviously wrong because the right representation of a number depends on where I'm writing this number (which message and which field). So what I'm doing now is not using Serde to serialize to this format but just using a trait like:

trait IntoSv2 {
    fn into_sv2<T: std::io::Write>(a: &mut T) -> std::result::Result<(), Error> {Ok(())}
}

And then for each type defined in sv2 (e.g. U8, U16, U24, ...), I just define a new type (e.g. pub struct Sv2u8(u8); that implement From inner -> new type and new type -> inner and IntoSv2 so then I can easily define new structs containing my basic types and then implement IntoSv2 for these structs. These structs that contains my basic type are the messages of the binary protocol that I'm implementing.

So do you thing that Serde is actually the wrong tool to do something like that or I'm not understanding how to use Serde?

TY

This has nothing to with Serde, really. Serde isn't magic; it can't guess what format you are defining. In fact the whole point of Serde is to decouple the serialization format from the serialized types.

Thus, if you are defining a serialization format, you have to make sure the data format unambiguously describes itself. If you want to distinguish between integers of different size, you'll have to come up with a way to be able to do that. For example, encode the length in the most significant couple of bits of the first byte.

For example if is the serialization format is:

utf-8 strings
a numbers -> A[0-9]*
b numbers -> B[0-9]*
c numbers -> C[0-9]*
d numbers -> D[0-9]*
array     -> S[base64 encoded bytes]

When I implement a Serializer for the above serialization format I have to map the serde data format to the above format and the rust types are automatically mapped to the serde data format.
So for convenience I could do:

SERDE -> MY FORMAT
u8  -> a numbers
u16 -> b numbers
u32 -> c numbers

What you are telling me is that in my serialization library I should define a rust new type and map it to d numbers? for example:

pub struct DNumber(u16);

impl Serialize for DNumber {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: ser::Serializer,
    {
        let v_bytes = self.0.to_ne_bytes();
        // use first byte as a control byte 0 tell to the Serializer to
        // serialize the following bytes as a d number
        let control_byte: u8 = 0;
        serializer.serialize_bytes(&[control_byte, v_bytes[0], v_bytes[1]])
    }
}

pub struct Array(Vec<u8>);

impl Serialize for Array {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: ser::Serializer,
    {
        let control_byte: u8 = 1;
        self.0.insert(0, control_byte);
        serializer.serialize_bytes(&self.0)
    }
}


// Implementing serde::Serializer for Serializer
fn serialize_bytes(self, v: &[u8]) -> Result<()> {
    match v[0] {
        0 => {
            let v = u16::from_ne_bytes([v[1], v[2]]);
            self.output += "D";
            self.output += &v.to_string();
        }
        1 => {
            todo!();
        }
        _ => panic!(),
    }
    Ok(())
}

// Then I use it 
#[derive(Serlialize)]
pub struct MessageA {
    field1: u8,
    field2: u16,
    field3: u32,
    field4: Dnumber,
    field5: Array,
}

What should I do with the serde data format types that do not map to anything in my serialization format, in the example above bool float signed int ecc ecc?

No, that is usually not necessary. I'm not sure why you would want to do that. What I'm telling you is, when a type calls the various methods of your Serializer for serializing primitive types, e.g. MySerializer::serialize_u8() or MySerializer::serialize_u16(), you should encode the primitive values in such a way that you can tell what type to emit back when deserializing. For example:

impl Serializer for MySerializer {
    fn serialize_u8(value: u8) -> Result<Self::Ok, Self::Error> {
        self.buffer.extend(&[b'A', value]);
        Ok(())
    }

    fn serialize_u16(value: u16) -> Result<Self::Ok, Self::Error> {
        self.buffer.extend(&[b'B', (value & 0xff) as u8, (value >> 8) as u8]);
        Ok(())
    }
}

impl<'de> Deserializer<'de> for &'_ mut MyDeserializer {
    fn deserialize_u8<V: Visitor<'de>>(self, visitor: V) -> Result<V::Value, Self::Error> {
        if self.buffer[self.position] == b'A' {
            let value = visitor.visit_u8(self.buffer[self.position + 1])?;
            self.position += 2;
            Ok(value)
        } else {
            Err(Error::custom("expected u8"))
        }
    }

    fn deserialize_u16<V: Visitor<'de>>(self, visitor: V) -> Result<V::Value, Self::Error> {
        if self.buffer[self.position] == b'B' {
            let num = self.buffer[self.position + 1] as u16 | (self.buffer[self.position + 2] as u16) << 8;
            let value = visitor.visit_u16(num)?;
            self.position += 3;
            Ok(value)
        } else {
            Err(Error::custom("expected u16"))
        }
    }
}

Just return Err(…) from the serialize_XXX() methods in order to signal an unsupported data type. Note though that it is quite unusual for a serialization format to not support e.g. booleans or signed integers.

1 Like

Sorry, I have been too generic the fact is that the serialization format define several kind of numbers: 1 2 3 and 4 bytes length. I can map u8 to the first u16 to the second u32 to the last. What should I do for the 3 bytes length number? Also I'm not interested in parsing the format just serialize to it.

So looks like the problem is not with serialization, but rather deserialization.

I guess there are two options:

Self-describing

The format itself knows the types of data. This requires some headers. So let's say you have 1, 2, 3 and 4-bytes numbers in a row. This will require some headers.

      payload
 _ __ ___ _____
AaBbbCcccDddddd
^ ^  ^   ^
  headers

In this case, the deserializer can deserialize into the following types:

  • Aa can deserialize into u8, u16,… usize (as the u8 can be safely upcasted),
  • Bbb: u16, u32,…
  • Cccc: u32,… (3-bytes number wont fi't in u16)
  • Ddddd: u32,…

This is by the way the default mode for serde – deserializer calls deserialize's visitor methos like visit_u32, etc.

"no-headers"

  payload
___________
abbcccddddd

In this case, the caller needs to know the data format up-front, to consume correct number of bytes.

  • au8, i8 (note the i8. If the caller knows exactly data definition, they can safely assume i8 is stored in u8)
  • bbu16, i16
  • ccc – no std type available, attempt of use u32 will read cccd bytes, which will result of garbage. You can perhaps create some wrapper struct U24(pub u32), which deserializes u16 and then u8 and combines them together
  • ddddu32/ i32.

This is I think a little bit unconventional for serde, but I believe at least bincode works that way (?) – fortunately it looks like there is a thing called "type hints" (I never used them so I might be wrong there). I guess this works a little bit like that – the visitor delcares the expected value (I guess this type declaration is what's called the type hint?), eg.:

type Value = i32;

The visitor is allowed to ignore this hint, eg.:

The JSON Deserializer will call visit_i64 for any signed integer and visit_u64 for any unsigned integer, even if hinted a different type.

But in this case of non-self-describing format, deserializer should trust the type hint and given i32 should always read 4 bytes and call visit_i32 function.

2 Likes

The format is not self-describing so I'm in the "no-headers" case. I'm not interested in deserialize it. What I try to understand is which is the right way to handle them in serde (and if serde is the right tool). The 3 bytes numbers defined by the serialization format have not any equivalent in the serde data model, for that I think that I have to define a new type.

 pub struct 3BytesNumber(u32)

u32 is mapped by serde to 4 bytes.
3BytesNumber(u32) is mapped by serde to 3 bytes.

Then I can serialize things like:

#[derive(Serialize)]
pub struct MessageA {
    // when serialize serde write 4 bytes in the buffer
    field1: u32,
    // when serialize serde write 3 bytes in the buffer
    field2: 3BytesNumber.
}

u32 is mapped to whatever you define your serializer map it to, like any builtin number type.

1 Like

yep. Do you think that if I have to serialize into a format that have 3 bytes number and 4 bytes number the right approach is (1) use serde, (2) serialize u32 to the 4 bytes number, (3) define something like 3BytesNumber(u32) (4) tell serde to serialize 3BytesNumber(u32) to the 3 bytes numbers? In that case how should I tell serde to serialize it to the 3 bytes numbers? An approach like the one used here Serde serialize to not supported type - #8 by Fi3 make any sense to you?

It depends on how do you make 4-bytes and 3-bytes numbers different in your source data. This might be as simple as checking the value of u32 and writing either three or four bytes to the output, for example.

This is not an option in my case cause 4-bytes number enclose 3-bytes number in my source data.
4-bytes number 0...2^32 -1
3-bytes number 0...2^24 -1

so for example 0 can be either a 4-bytes number or a 3-bytes number

But why would each number be serialized as either 4-bytes or as 3-bytes? What's the difference?

If the difference can be represented with different types in source data, you can do this outside of Serializer - just make a newtype wrapper around u32 and implement Serialize for it by hand, converting it into sequence of three u8s.

1 Like

So something like what I did in my above message Serde serialize to not supported type - #8 by Fi3?

pub struct ThreeBytesNumber(u32);

impl Serialize for ThreeBytesNumber {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: ser::Serializer,
    {
        let v_bytes = self.0.to_le_bytes();
        // use first byte as a control byte 0 tell to the Serializer to
        // serialize the following bytes as a 3-bytes number
        let control_byte: u8 = 0;
        serializer.serialize_bytes(&[control_byte, v_bytes[0], v_bytes[1]])
    }
}

impl<'a, W: io::Write> ser::Serializer for &'a mut Serializer<W> {

fn serialize_bytes(self, v: &[u8]) -> Result<()> {
    // I use the first byte as a control byte so I can serialize both 3-bytes number and byte arrays.
    match v[0] {
        0 => {
            self.output.push(v[1]);
            self.output.push(v[2]);
            self.output.push(v[3]);
        }
        1 => {
            todo!();
        }
        _ => panic!(),
    }
    Ok(())
}
}

This will always panic, unless someone passes [u8; 4] (or larger) into your serializer explicitly - in particular, it will always panic for ThreeBytesNumber. And, even if you change last line in serialize to use v_bytes[2], this will still fail if someone passes [u8; N] with leading zero (by eating this zero, as well as all elements past the fourth). The thing you need is simply that:


impl Serialize for ThreeBytesNumber {
    fn serialize<S>(&self, serializer: S) -> std::result::Result<S::Ok, S::Error>
    where
        S: ser::Serializer,
    {
        let v_bytes = self.0.to_le_bytes();
        serializer.serialize_bytes(&[v_bytes[0], v_bytes[1], v_bytes[2]])
    }
}

and serialize byte sequence as, well, byte sequence.

2 Likes

@Fi3 sorry I've responded with info on deserialization instead of serialization. I think I must've misunderstood something :upside_down_face:

@Cerberuser (a little bit off-topic here, but I'm cursious to ask)
Wondering if it'd be possible to keep this custom 3-bytes serialization for ThreeBytesNumber with this serialization format, but still serialize as regular number when using json or yaml. I believe serialize_bytes in json would emit a base64 encoded string?

1 Like