Serde - memory efficient parsing strategy for object that depends on other object

I'm working on a process to parse a JSON file with serde_json. I got stuck with having a Visitor calling another Visitor, if that is even possible. I would be grateful for any insight on how to continue. Also, if there is a simpler way to what I am trying to achieve, please let me know.

Here is a simplified version of my data. The complete version contains more properties, but they not relevant to this problem.
The important part is that,

  • a geometry object has a "boundary",
  • a "boundary" is defined by an array of vertex-indices,
  • a vertex-index is the index of a point in the "vertices" array.

Additionally,

  • a point is an array of two coordinates,
  • a coordinate is stored in the JSON file as an integer and we can obtain the original
    coordinate value by multiplying the coordinate-integer by the scaling factor from the
    "transform" object.
{
    "version": "1",
    "geometries": {
        "id1": {"type": "Polygon", "boundary": [0,1,2,3]},
        "id2": {"type": "Polygon", "boundary": [0,1,2]}
    },
    "transform": {"scale": [0.001,0.001]},
    "vertices": [[1000,1000],[2000,1000],[2000,2000],[2000,1000]]
}

My main goal is to make the parsing as memory efficient as possible, because the JSON files are large, up to a couple GB.

In order to parse the data, I need to replace the vertex-indices in the boundary array with the actual, scaled points. Thus what is "boundary": [0,1,2] in the source data, it needs to become "boundary": [ [1.0,1.0], [2.0,1.0], [2.0,2.0] ] in the target structure. Therefore, I need to make two passes over the data. In the first pass deserialize the "vertices" and store them. In the second pass deserialize the "geometries" parse their "boundary" by using the values from the deserialized vertices.

Problem:
The really big object in the data are the "geometries". Therefore, I want to do the boundary-parsing while deserializing one geometry object. The alternative would be to deserialize all the "geometries" into some intermediary storage and then loop through that storage and do the boundary-parsing. However, this is exactly what I want to avoid, because it requires to keep all the intermediary geometries in memory.

My strategy:
Parse the data in two passes over the file.

  1. In the first pass, all the properties are read except the "geometries". This is ok, because it is a relatively small amount of data.
    The first pass is simply done with the derive macro and skipping the "geometries" field.
#[derive(Deserialize)]
struct SourceVertices {
    version: String,
    #[serde(skip)]
    geometries: HashMap<String, SourceGeometry>,
    transform: Transform,
    vertices: Vec<[i32; 2]>,
}

let vertices: SourceVertices = serde_json::from_str(DATA).unwrap();
  1. In the second pass skip everything except the "geometries" object. During deserialize, enter the "geometries" object and when visiting an entry, create a TargetGeometry object directly, by doing the boundary-parsing with the data from the SourceVertices.vertices.
type Point = [f64; 2];

#[derive(Debug)]
struct TargetGeometry {
    type_geom: String,
    boundary: Vec<Point>,
}

What I did so far:
Loads of profiling. With deserializing into intermediary storage the peak memory use is around 6.5x the file size. I wish I could get to around 3x the file size, because that is what is allocated on the heap when the data is stored in my target structure.

The first pass is pretty straightforward and I have that.

The streaming-deserialize of the "geometries" is also done. For this I implemented a custom Visitor (adapted from Parsing 20MB file using from_reader is slow · Issue #160 · serde-rs/json · GitHub). However, this streaming-deserialize is only able to deserialize a "geometries". It is not able to navigate down the data to get to the "geometries".

I have a custom Visitor for the second pass which skips everything except the "geometries" object.

What I am missing (I think):
I don't know how to connect the Visitor of the streaming-deserialize with the Visitor of the second pass.

Gist: serde_json nested deserialize · GitHub (Rust Playground)

Please let me know if you have some ideas on how could I continue.

1 Like

In the meantime I figured it out. The deserialization in the second pass can be done with a seed from the first pass by using DeserializeSeed in serde::de - Rust .

Here is the updated Gist for future reference: deserialize with seed from a previous pass · GitHub