Processing JSON to construct a query

I'm trying to process a very large JSON file (1TB) of Graphson data using Serde to produce a Gremlin query. I know the exact format of the JSON file but I can't map it directly to Rust structs because I think it would overflow the memory. The deserialized struct would look something like:

struct G {
   vs: Vec<V>,
   es: Vec<E>
}

struct V {
  id: i32,
  label: String
}

struct E {
  id: i32,
  label: String,
  from_vid: i32,
  to_vid: i32
}

My idea was to call a callback for each produced V and E to update my global query. I may have found a way of doing this using serde-ignored, calling a callback which update my state for each ignored Struct. But I'm not sure it would works and I wanted to know if there was another way of doing this.

I think graphson produces line delimited json ? In that case you could use a serde_json::StreamDeserializer - Rust to deserialize line by line instead of trying to deserialize the whole file. (assuming a single line isn't a few GB worth of json)

1 Like

Note that if you are using async/await to read it, then you can't use serde_json::StreamDeserializer. Instead, since it is line delimited, just read each line separately with read_until in a loop and pass each line to serde_json::from_slice.

Of course, this would also work in non-async code, but there StreamDeserializer may or may not be simpler.

I just checked my data and it seem that it's not line delimited. I'm thinking that the data is using an old GraphSON format... Exemple:

{
  "mode": "EXTENDED",
  "vertices": [
    {
      "short": {
        "type": "string",
        "value": "YBR236C"
      },
      "pin": {
        "type": "string",
        "value": "T"
      },
      "oid": {
        "type": "integer",
        "value": 1
      },
      "long": {
        "type": "string",
        "value": "ABD1 mRNA cap methyltransferase"
      },
      "_id": 1,
      "_type": "vertex"
    },
    {
      "short": {
        "type": "string",
        "value": "YOR151C"
      },
      "pin": {
        "type": "string",
        "value": "T"
      },
      "oid": {
        "type": "integer",
        "value": 2
      },
      "long": {
        "type": "string",
        "value": "RPB2 DNA-directed RNA polymerase II,140 kDa chain"
      },
      "_id": 2,
      "_type": "vertex"
    },
    ...
   "edges": [
    {
      "_id": 14362,
      "_type": "edge",
      "_outV": 2353,
      "_inV": 2354,
      "_label": "U-U"
    },
    {
      "_id": 14363,
      "_type": "edge",
      "_outV": 2354,
      "_inV": 2353,
      "_label": "U-U"
    }
   ]
}

Perhaps side-stepping your question, but have you taken a look at simd_json lib? It’s fast and might help over-all if not with your particular challenge. Last time I took a look you instantiate the parts of the json you’re interested in.

The issue seems the same. I'm interseted in processing the whole JSON file (ie: get intermediate deserialized struct of the JSON), then use the indermediate struct to update a state (a query) and then read the next struct and so on. Line by line processing may work, but it requires that I update my dataset to put each objects on a line.

StreamDeserializer doesn't specifically require a new line :

"values need to be a self-delineating value e.g. arrays, objects, or strings, or be followed by whitespace or a self-delineating value."

So :

{ "foo": 1 }

{}
[2, 3]{ "bar": {} }

"hello world"

would be perfectly fine to deserialize sequentially, example

If the data really is a single massive JSON object, you are going to have to split it up manually and pass each piece to serde_json::from_slice separately.

2 Likes

This is already very similar to how serde works. serde calls functions on the Visitor trait for each item in the list. You can either call your callback from the Visitor (maybe with DeserializeSeed) or write the Visitor in a way that it transforms the data during deserialization.

Another option could be to use RawValue. It allows you to get the text range of each Vec such that you can then process the string with the StreamDeserializer. It requires the input file to be in memory since it borrows from it, but mmaping the file might work.

If the issue is parsing the file would exhaust your system's ram, could you map the file into your program's address space using memory-mapped files then deserialize into borrowed types (e.g. String fields are replaced by &str or Cow<'a, str>)?

That way the OS will automatically page chunks of the file into and out of memory as needed, and because you borrow from the original source where possible.

1 Like