Zero copy deserialize arbitrary json data with serde

To deserialize arbitrary json data, I can use Value

let json_str = r#"{"id":123,"name":"John Doe","screen_name":"Unidentified","location":"Fringe"}"#;
let result: Value = serde_json::from_str(json_str).unwrap();        

But it's not zero-copy, so in the case of string-rich large json data, this could be resource intensive.
Is it possible to do zero-copy deserialization in this case?

A zero-copy approach is to use serde_json::value::RawValue, while it doesn't copy the data, I need access to some fields in the json data. So I need to parse RawValue to Value anyways.

Any other idea is also welcome. I really appreciate any help you can provide.

1 Like

If you know the fields you need access to before parsing the whole Json string, you could write your own indexer storing a span denoting the beginning and end of the field's value's bytes in the Json string. The span can be used to get a substring and only parse that to serde_json::Value.

1 Like

If I understood your question correctly, using a field of type &'a str does zero-copy iff the json string contains no escape sequences, and returns and Error otherwise.

2 Likes

I don't know if such an approach would fit into OP's restriction of parsing arbitrary Json values.

1 Like

The Value type is not special in any way. What I mean is, you can copy the type and replace the String variant with something that fits your need better, such as using &'a str or Cow<'a, str>. Both types support zero-copy deserialization. Someone might have even released a crate for that already.

4 Likes

write your own indexer

Could you elaborate a bit? Do you mean I need to write a custom deserializer and store the indexes of the fields I am interested in?

Thank you. I kinda had the similar idea. But I was wondering if it can be done directly using serde or if there's already a crate supporting.

I was thinking about creating a HashMap<String, Span> from your json file where the keys are every field of your json (flattened) and the values are just pointers to the substring where you can retrieve the raw value. Here some pseudo-code:

// Could be helpful to parse the value later
enum Type {
    Number,
    String,
    EscapedString,
    Object,
    Array,
    Bool,
    Null,
}

struct Span {
    start: usize, // first byte of value inside json string
    end: usize, // last byte of value inside json string
    typ: Type,
}

as for an interface, I was imagining something like this:

use std::collection::HashMap;

fn main() {
    let json = r#"{
        "foo": 1,
        "bar": {
            "baz": "hello"
        },
        "bat": [
            "world"
        ]
    }"#;

    let index: HashMap<String, Span> = index_json(json).unwrap();

    // index: {
    //     "foo": Span { start: .., end: .., type: Number },
    //     "bar": Span { start: .., end: .., type: Object },
    //     "bar.baz": Span { start: .., end: .., type: String },
    //     "bat": Span { start: .., end: .., type: Array },
    //     "bat[0]": Span {start: .., end: .., type: String },
    // }
}

fn index_json(json: &str) -> Result<HashMap<String, Span>, Error> {
    // ...
}

As to how I would implement it, I'm pretty sure you can do it with serde, writing your own deserializer. I personally would try an use lalrpop for that, though I'm not 100% sure how exactly to do it (I'm not sure if lalrpop lets me retrieve the start and end indices of a value).


Having said all this, I think @jonasbb's solution is far less work and quite elegant.

1 Like

This crate seems to implement the approach of having a new Value type that borrows.

3 Likes

Thank you @jofas @jonasbb
So I played around as @jonasbb suggested - basically copied serde Value and the deserializer and modified them to use reference instead of creating new String.

#[derive(Clone, Eq, PartialEq, Debug, Serialize)]
#[serde(untagged)]
pub enum Value<'a> {
    Null,
    Bool(bool),
    Number(Number),
    Bytes(&'a [u8]),
    String(&'a str),
    Array(Vec<Value<'a>>),
    Object(HashMap<&'a str, Value<'a>>),
}

Here's the playground Rust Playground

I'm trying to figure out how to make Value::Bytes(&'a [u8]) work. I can avoid converting to &str in cases where my json input is &[u8] (from network call).