Filtering a 200GB + *.json file in Rust

  1. I have a *.json file. It is 200+ GB. (Not exactly sure, still running bunzip2).

  2. The format of the file is:

[ { object 1 }, { object 2 }, { object 3 }, ... ]

  1. My goal is to run a filter on this json file, outputting another *.json file.

  2. Please note: size of file exceeds memory + swap space. Therefore I can not construct the entire object in memory. Furthermore, the entire file is one object, an array. So I need some "cursor" that lets me pointat sub sections of the json object without constructing the entire object.

  3. I am looking for advice on how to approach this.

Since it is a single object, it is somewhat tricky. I once wrote some code that can do this. You can find it here. It is used by repeatedly calling next() until it returns None, at which point you add more bytes to it with a call to push, and then you go back to calling next().

The code works by counting the number of currently open parentheses to keep track of when it reaches the end of an object. It knows that it shouldn't count those inside strings.

Producing a new json file is much easier. Just start by writing an [, then write an object, write a comma, write another object, and so on.

1 Like

Is "hunt down the person who generated this file and have a discussion with them about appropriate choice of data format" not an option here?

19 Likes

I've never done this, but this example seems potentially useful:

https://serde.rs/stream-array.html

3 Likes

The approach I would use is something like this:

  1. Get an object that implements Read
  2. Read one character at a time, until '[' is seen (and consumed).
  3. Use serde (or your favorite JSON parser) to read a single object, and process it. This leaves the reader pointing after the read object.
  4. Continue reading one character at a time. If you read a ']', you’re done and if you read a ',', loop back to 3 to process another object.

I haven’t used serde yet, so I can’t be more specific than this pseudocode. To be a bit more robust, you can panic if you see any characters you’re not expecting while scanning for the next separator— The only legal whitespace characters in JSON are "\x09\x0A\x0D\x20".

4 Likes

@2e71828: Yeah, thanks, this seems most reasonable to me too.

Basically, I am manually responsible of extracting strings[1] that represent a single Json object from the array, which I then pass to serde to get json and do filtering.

[1] I think, technically, your approach is even more efficient in that by passing an offset to serde, these strings are never constructed in the frist place.

1 Like

In an effort to convince serde_json to do such streaming directly (i.e. without me having to parse the outer array myself), I wrote this code. I haven’t tested yet if it really doesn’t load the whole file into memory, but IMO there’s a good chance it works and I’m moving on from this topic for today. Feel free to test this approach.

Edit: With another look at the serde docs, I come to think that this could all be approached way easier by using the Deserializer of serde_json directly. I didn’t really notice that one before. I’ll continue this later or tomorrow.

2 Likes

Sees like this problem has been solved already: Array of values without buffering · Serde

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.