How to create a safe space-efficient enum/union?

Background

I am trying to parse an external data format into a data structure with the help of Serde with the idea of editing the file. At the moment, I am exploring different ways of optimizing memory usage and performance.

Currently, the data structure I'm deserializing is a Box<[&StringSlice]> which boils down to a Box<[&[u8]]>. The deserialized data structure looks like this:

["EU4txt", "date", "=", "1383.11.21", ..., "savegame_version", "=", "{", "first", "=", "1", ..., "}", ...]

While this representation of the data is optimal for space and performance, it is completely flat. That makes it very hard to search for specific data, that I wish to manipulate.

On the other hand, if I'm trying to deserialize the data format to a more complex data structure with nested boxed slices, that is very easy to search for specific data, the memory usage and performance becomes terrible.

After a bit of brainstorming I came up with a simple data structure, that is not flat, but also shouldn't take up more memory than what I currently have:

enum Node<'a> {
    Data(&'a StringSlice),
    List(NodeReference),
}

struct StringSlice([u8]);

struct NodeReference {
    index: usize,
    length: usize,
}

// Previously: Box<[&StringSlice]>
// Result: ["EU4txt", "date", "=", "1383.11.21", ..., "savegame_version", "=", "{", "first", "=", "1", ..., "}", ...]

let parsed: Box<[Box<[Node]>]> = /* deserialize */;
// Result: [[Data("EU4txt"), Data("date"), Data("="), Data("1383.11.21"), ..., Data("savegame_version"), Data("="), List(0, _), ...], [Data("first"), Data("="), Data("1"), ...]]

This is the data structure I wish to use, but I have a huge problem with its size compared to the Box<[&StringSlice]> I had before. &StringSlice only takes up 16 bytes on my computer, while Node takes up 24 bytes. That may not sound like much, but when putting it in relation, this increases the size of the parsed data by 50%!

My goal is to reduce the size of Node to 16 bytes. What I have in mind is having a 1 bit field, that tags the union as representing either Data or List and limit the size of the length of both the borrowed slice and the NodeReference to 63 bit or put in another way u(size - 1), because I never have to deal with such extreme lengths. That should allow me to achieve my goal.

How can I do that? Any tips are welcome; it doesn't have to be a complete solution. I'm just a bit lost when it comes to transferring this idea into actual code. I've never had to deal with this kinda problem before.

Regarding your StringSlice type, the only way I know of constructing it is by transmuting a slice, so that's what I did here, but I'm not sure why you're using it? Anyway you can do something like this:

pub struct Node<'a> {
    field_a: usize,
    field_b: usize,
    phantom: PhantomData<&'a [u8]>,
}
#[derive(Debug)]
pub enum NodeEnum<'a> {
    Data(&'a StringSlice),
    List(NodeReference),
}

const FLAG: usize = isize::min_value() as usize;

impl<'a> Node<'a> {
    pub fn from_slice(slice: &'a StringSlice) -> Self {
        assert!((slice.0.len() & FLAG) == 0);
        Node {
            field_a: slice.0.as_ptr() as usize,
            field_b: slice.0.len() | FLAG,
            phantom: PhantomData,
        }
    }
    pub fn from_node_ref(node_ref: NodeReference) -> Node<'static> {
        assert!((node_ref.length & FLAG) == 0);
        Node {
            field_a: node_ref.index,
            field_b: node_ref.length,
            phantom: PhantomData,
        }
    }
    pub fn as_enum(&self) -> NodeEnum<'a> {
        if (self.field_b & FLAG) == 0 {
            NodeEnum::List(NodeReference {
                index: self.field_a,
                length: self.field_b,
            })
        } else {
            unsafe {
                let slice: &[u8] = std::slice::from_raw_parts(
                    self.field_a as *const u8,
                    self.field_b ^ FLAG
                );
                NodeEnum::Data(StringSlice::new(slice))
            }
        }
    }
}

playground

Note how we have to keep track of the lifetime on the slice manually using a PhantomData.

1 Like

Thank you! :heart:

That's exactly what I need. The trick for how you created the FLAG is genius. I didn't know that as usize would convert an isize with all 1s to a usize with only a single 1 on the most-left side.

From str description:

String slices are always valid UTF-8.

My StringSlice type, on the other hand, is always valid ISO 8859-1.

I think you need to look up how the smallest possible isize is represented :slight_smile: hint: the conversion I did does not change the bits

The isize with all ones is -1.

1 Like

Good to know. I'm glad I went to the forum. I'm always learning something new. :smile:

Btw how are you constructing the StringSlice type in your own code? If you're using the transmute like I do in the playground I posted earlier, I just wanted to make sure you remembered to put #[repr(transparent)] on it, because you don't have it in your original code sample.

#[repr(transparent)]
pub struct StringSlice([u8]);

impl ::core::fmt::Debug for StringSlice {
    #[inline]
    fn fmt(
        &self,
        formatter: &mut ::core::fmt::Formatter<'_>,
    ) -> ::core::result::Result<(), ::core::fmt::Error> {
        ::core::fmt::Debug::fmt(
            &::encoding::types::Encoding::decode(
                ::encoding::all::ISO_8859_1,
                &self.0,
                ::encoding::DecoderTrap::Strict,
            )
            .unwrap_or_else(|_| unsafe { ::core::hint::unreachable_unchecked() }),
            formatter,
        )
    }
}

impl<'a> ::core::convert::From<&'a [u8]> for &'a StringSlice {
    #[inline(always)]
    fn from(from: &'a [u8]) -> Self {
        #[repr(C)]
        union Slices<'a> {
            byte_slice: &'a [u8],
            string_slice: &'a StringSlice,
        }

        unsafe { Slices { byte_slice: from }.string_slice }
    }
}

impl<'de: 'a, 'a> ::serde::de::Deserialize<'de> for &'a StringSlice {
    #[inline]
    fn deserialize<D>(deserializer: D) -> ::core::result::Result<Self, D::Error>
    where
        D: ::serde::de::Deserializer<'de>,
    {
        struct StringSliceVisitor;

        impl<'de> ::serde::de::Visitor<'de> for StringSliceVisitor {
            type Value = &'de StringSlice;

            fn expecting(&self, formatter: &mut ::core::fmt::Formatter) -> ::core::fmt::Result {
                formatter.write_str("a borrowed ISO 8859-1 string")
            }

            #[inline]
            fn visit_borrowed_bytes<E>(
                self,
                value: &'de [u8],
            ) -> ::core::result::Result<Self::Value, E>
            where
                E: ::serde::de::Error,
            {
                ::core::result::Result::Ok(::core::convert::Into::into(value))
            }
        }

        deserializer.deserialize_bytes(StringSliceVisitor)
    }
}
1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.