Serde json Deserializer stream

Hi I posted this on stack overflow but got no replies. I foolishly posted on a weekend :slight_smile:

I have some json like below. The sites field is a few thousand entries in size. The file is around 4M. If I read this with serde_json::from_str My remaining memory of 5 or 6 Gb is quickly consumed and the app crashes.

I see serde has Deserializer::from_str(&contents).into_iter::();

How can I use this to stream just the sites field of the top level object, as that is where all the repeating objects are ?

{
 "debug": {
  "conn": {
   "database": "bmos",
   "host": "carneab4.memset.net",
   "password": "cms823nvc",
   "port": 5432,
   "user": "bmos"
  },
  "interval": 60,
  "name": "debug",
  "sites": {
   "house1": {
    "enable": true,
    "equip": [
     {
      "enable": true,
      "interval": 45,
      "ip": "127.0.0.1",
      "name": "gateway",
      "points": [
       {
        "auto_convert": true,
        "data_type": "float",
        "enable": true,
        "name": "point1",
        "order": "1234",
        "reg_type": "input",
        "register": 22,
        "scale": 1.0,
        "uid": 1,
        "units": "watts"
       },
       {
        "auto_convert": true,
        "data_type": "float",
        "enable": true,
        "name": "point2",
        "order": "1234",
        "reg_type": "input",
        "register": 22,
        "scale": 1.0,
        "uid": 2,
        "units": "watts"
       },
       .......
      
      ],
      "port": 502
     }
    ],
    "name": "house1",
    "plot": "1"
   },
   .....

   
  }
 }
}

Ideas I have tried are

1, I tried adding #[serde(deserialize_with = "site_stream_deserialize")] to the sites field and using the method to do a stream read but I don't know if that method has access to the original byte stream.

2, I tried to implement my own Deserialize. However, sites is still expected to be an object and cant just be read as a bytes stream or str.

impl<'de> Deserialize<'de> for Campus {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
        where D: serde::Deserializer<'de>
    {
        #[derive(Deserialize)]
        struct Outer {
            pub name: String,
            pub interval: i32,
            pub conn: DBConnection,
            pub sites: String,
        }

        let helper = Outer::deserialize(deserializer)?;
        let stream = Deserializer::from_str(&helper.sites).into_iter::<Site>();
        let mut sites: HashMap<String, Site> = HashMap::new();

        for site in stream {
            let s = site.unwrap();
            sites.insert(s.name.to_string(), s);
        }

        Ok(Campus {
            name: helper.name,
            interval: helper.interval,
            conn: helper.conn,
            sites: sites,
        })
    }
}

I am not sure I can read this json as a stream as the examples I have found are all toplevel repeating entities. Any advice would be great.

I guess I could try preprocessing the file to just leave the sites array

or I could create a nom parser

but 1 seems hacky and 2 more work and possible fragile.

Thanks

I think you won't be able to to "access the original byte stream" for the sites sub-object. (Seems you wanted to do this to be able to use the Deserializer.into_iter())

Check out example at https://serde.rs/stream-array.html if you haven't already seen in. Looks like you could just put your stream-processing logic in visit_seq function, basically, something like:

while let Some(site) = seq.next_element::<Site>()? { … }

(oh, I've just noticed the sites is not an array, but rather an object, so the visitor will look differently. edit https://serde.rs/deserialize-map.html and perhaps https://serde.rs/ignored-any.html examples should also be helpful for your case)


The file is really small, it shouldn't consume that much memory. What are you deserializing this into? Perhaps the simpler solution would be to optimize the structs there?

1 Like

Thanks will look at the links you sent.

the structures look kind of simple (below).
The memory did seem excessive.

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct DBConnection {
    pub user: String,
    pub password: String,
    pub host: String,
    pub port: u16,
    pub database: String,
}


#[derive(Clone, Serialize, Deserialize)]
pub struct Campus {
    pub name: String,
    pub interval: i32,
    pub conn: DBConnection,
    pub sites: HashMap<String, Site>,
}

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Site {
    pub name: String,
    pub enable: bool,
    pub plot: String,
    pub equip: Vec<Equipment>,

    #[serde(skip)]
    pub campus: Option<Box<Campus>>,
}

#[derive(Clone, Serialize, Deserialize)]
pub struct Equipment {
    pub name: String,
    pub enable: bool,
    pub ip: String,
    pub port: u16,
    pub points: Vec<Point>,
    pub interval: u32,

    #[serde(skip)]
    pub site: Option<Box<Site>>,
}

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Point {
    pub data_type: String,
    pub uid: u8,
    pub enable: bool,
    pub auto_convert: bool,
    pub name: String,
    pub order: String,
    pub reg_type: String,
    pub register: u16,
    pub units: String,
    pub scale: f64,

    #[serde(skip)]
    pub equip: Option<Box<Equipment>>,
}

That's certainly a lot of Strings which may contribute to the memory usage (on default allocator, short String costs about 56 bytes (at least that seem to be the case on 64bit linux: 24 bytes inline, 16 bytes malloc overhead + 16 bytes minimal malloc allocation)).

Looks like some of them could be changed to enums, like units, order, data_type. There are also some possible tricks like Cow<str> (it would require reading the file to string beforehand, so there's somewhere to borrow from) or crates like https://lib.rs/crates/smol_str.

Anyway, while these strings may increase mem usage, from 4MB json I'd expect something like 20–30MB, not 6GB :smiley: , so something here feels off...

Does seem to be an issue.
I have decided to work out why.

I have reduced the json down a lot until my system can just handle it I have then run
with valgrind and massif which produced the following file

https://www.dropbox.com/s/hk1l5u8v8ajj1jp/massif.out.2061274?dl=0

It wasn't a problem with serde

Below reading the json I have the following code

for (_, campus) in m.iter_mut() {

        let campus_clone = campus.clone();

        for (_, site) in &mut campus.sites {

            site.campus = Some(Box::new(campus_clone.clone()));
   
            let site_clone = site.clone();

            for equip in &mut site.equip {
      
                equip.site = Some(Box::new(site_clone.clone()));
                assert!(equip.site.is_some());

                let equip_clone = equip.clone();

                // for point in &mut *equip.points {
                for point in &mut equip.points {
                    point.equip = Some(Box::new(equip_clone.clone()));
                }
            }
        }
    }

This does silly cloning of sites and campuses so it can put a pointer in each equip -> site and site -> campus. I have been careless with clone probably because I got frustrated with the borrow checker here. Need to find a better way to do this section without cloning.

Wanted to ask where all these clone calls are from in the massif profile, but you've already answered :slight_smile:

Looks like you want to have access to the "parent" on each level (but not the full parent, but rather just a "medatada" part of it, if I'm understanding this correctly?). Looks like the perfect case for Rc or Arc. Basically, something like:

-let campus_clone = campus.clone();
+let campus_clone = Rc::new(campus.clone());

Now, cloning the clone should be cheap (as it's just a refcount bump).

Alternatively, you can try to just pass the reference, when calling some functions, instead of storing it in struct. Not sure how's that feasible in your design though.

1 Like