Async deserializing an array of json as a stream

Hi the rust community !

I have been trying to find such a solution for months and I seem to be the only one with an actual need to deserialize an AsyncRead array of json into structs. The idea is to be able to download a huge json file and to process its individual elements one by one, as it is downloading. This is mostly to not need 4GB of RAM to take everything into memory before processing a Vec of deserialized structs.

One library that does it very well for csv is the csv_async library with its AsyncDeserializer that creates a stream.

I tried to see if the destream_json crate would be appropriate for this use but in this issue the author kindly explained how serde is not compatible and that makes it hard to re-use easily.

I'm trying to do the same thing as the csv_async crate, to create a stream from an array of jsons in this test repo. An example I found that reflects what I have in mind is this blog post : Efficient parsing of JSON record sets in Rust but it is synchronous so uses the Trait constraint Read instead of AsyncRead

So, based on the blog post I'm trying to make it all async by replacing:

  • Read with AsyncRead
  • serde_json with tokio_serde
  • byteorder::ReadBytesExt with tokio_futures_byteorder::AsyncReadBytesExt

and by using async_compat to be able to use tokio_serde with my futures::AsyncRead reader

And after hours of trying to add AsyncSeek to my AsyncRead reader so that I can peek for the next char , I'm starting to be limited by my understanding of these types, Trait constraints & lifetimes.

In my test repo I created a Minimal, Reproducible Example in the hope that someone could help me out. Or maybe tell me that what I'm trying to do is simply not achievable in the way that I have tried?

I'm defining a type in my deser_json module, which surely could be hinted as a Generic Type when creating the AsyncReader.. But really I would be happy if I got something that works already as it is, because it's big enough of challenge without it.

repo: GitHub - arnaudpoullet/deser_async_json_array_stream: Attempt at deserializing an AsyncRead array of jsons' stream

You probably want to use the StreamDeserializer of serde_json ? It may not fit your payload if you don't have control over it. But for line (or whitespace) delimited json it works very well.

Last time a question like was posted Alice had a good suggestion as well.

Usually when people run into this problem they are not actually deserializing a single large JSON object, but rather a sequence of many JSON objects separated by newlines. This is a lot easier to handle in a streaming way than a single big JSON object, e.g. you can use any of the read-line-by-line utilities for it.

1 Like

I don't have control over my payload unfortunately. It's starts with [ and ends with ], separated by commas. There is no way to split the payload based on newlines or commas because these chars are also present in the rather large individual elements of the array.

The only way I can see this really being efficient without loading the entire array first is by having an Async Stream. But I understand this is not easy

You may be able to find some inspiration here. That file implements a Stream that internally counts the opening and closing braces to find out where each object in a JSON list starts and ends, and then passes each individual object to serde_json::from_slice. The counting of braces happens in this file.