Reading JSON sequentially

The Rust JSON package needs the entire JSON text read, then parsed, and only then can you do anything. This uses a lot of memory at peak. Streaming XML parsers are common, but streaming JSON parsers are not. Is there a Rust streaming parser for JSON?

I don't think you can even read "back to back" JSON sequentially where multiple JSON outputs are concatenated.

(I have a program which reads a huge JSON file (tens of gigabytes), and can go over 32GB at peak. Once the data is loaded, it needs only a few hundred MB. It's mostly repeats of a modest size item, so all I really need is some serialization format that's readable sequentially. Do any of the already-implemented Serde formats offer that?)

Serde operates over the JSON text in a streaming fashion. You can write a custom Deserialize implementation that walks through the structure and pulls out whatever bits you need. It'll be a bit verbose (lots of visitors) but should be totally doable.

3 Likes

That makes sense. I will look into that. Once I can perform "get next token" (number, string, delimiter, etc.) on a JSON file, the rest is straightforward.

serde_json provides serde_json::StreamDeserializer which will provide an iterator over concatenated JSON objects. You can pair that with Deserializer::from_reader to pass it anything that implements Read

playground

use serde_json::{Deserializer, Value};

fn main() {
    let mut rdr = std::io::Cursor::new(r#"{"k": 3}1"cool""stuff" 3{}  [0, 1, 2]"#);

    let stream = Deserializer::from_reader(&mut rdr).into_iter::<Value>();

    for value in stream {
        println!("{}", value.unwrap());
    }
}

Edit: Though I guess rereading your post I can't tell if you do have concatenated json objects or it really is one large object. This would only help with the former situation

1 Like

That's good enough. Right now I have one big object, but it's really just an array of items logged by something I'm testing. It can easily be converted to separate objects. The whole point of this is to prevent 35GB of memory consumption doing it all at once. It's purely a test tool, so it doesn't have to go fast. Thanks.

Cargo itself uses newline-delimited JSON (each line is a JSON object).

If you first split by lines, you can parse each line in parallel.

1 Like

Likely a bit off topic, but as you were describing the amount of JSON your parsing, Iā€™m curious to know whether you have checked out the JSON simd parsing library: simd_json - Rust. I saw a presentation that describes the algorithms involved (published about a year ago). Good stuff.

Speed isn't a big issue. This is a temporary measure to connect two programs for easier debugging. One program is gathering data, and the other has a 3D GUI. It's easier to work on them separately, because the 3D library is still under development. Once things settle down I will combine them and eliminate the JSON file.

It turns out that serde_json and the json crate are not quite compatible. serde_json numbers have to explicitly converted to f32, but json numbers can be converted implicitly.. And the implicit conversion for strings in serde_json returns a quoted string. Not hard to fix, but an hour of extra work.