I am using serde_json to parse large JSON files from the healthcare world. The healthcare industry has some of the worst data quality issues I have ever encountered, and there are many instances of malformed JSON files. Some issues I have been able to work around with attributes and custom deserializers, and others are a lost cause.
The latest issue I'm trying to work around are files with, essentially:
{"field": 0000000000}
which I realize is not valid JSON. I would like to treat the repeated zeros as just 0, but since it's not a valid Number, it fails to parse before I can do anything in a custom deserializer. Does anyone have any ideas for a way to approach this?
Some details: the files are monstrously large, so I must use from_reader(). So a simple preprocessing step to search and replace ten zeros with 0 is not trivial. jq does accept numbers with leading zeros, so I could insert that into the pipeline to clean up the input, but that would probably introduce some overhead and I'd prefer to solve it in Rust. I would be willing to entertain using a different crate, but the rest of the code is heavily serde dependent so would prefer to stay in the serde world.
This may be a "lost cause" situation, but I'm interested to see if anyone has any good ideas. Thanks.
In your position, needing to parse a variety of things that are “not quite JSON”, I would fork serde_json — creating a local “serde_not_quite_json” crate — and extend its parser as needed to deal with the specific cases encountered.
I tried a small workaround that might help with the 0000000000 issue. Since serde_json rejects it before custom deserialization, you can sanitize the input stream before parsing:
use std::io::{BufRead, BufReader, Cursor, Read};
use regex::Regex;
use serde::Deserialize;
fn sanitize_json<R: Read>(reader: R) -> impl BufRead {
let re = Regex::new(r#":\s*0{2,}"#).unwrap();
let mut buffer = String::new();
BufReader::new(reader).read_to_string(&mut buffer).unwrap();
let sanitized = re.replace_all(&buffer, ": 0");
BufReader::new(Cursor::new(sanitized.into_owned()))
}
#[derive(Debug, Deserialize)]
struct MyStruct {
field: i32,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let reader = Cursor::new(r#"{"field": 0000000000}"#);
let clean_reader = sanitize_json(reader);
let data: MyStruct = serde_json::from_reader(clean_reader)?;
println!("{:?}", data);
Ok(())
}
This replaces repeated zeros with 0 and lets you continue using serde_json::from_reader as usual. It’s a simple fix for predictable patterns like this.
I also tested the json5 crate, which handles other not-quite-JSON features like comments, single quotes, and trailing commas, but it still rejects numbers with leading zeros. So for that specific case, a sanitizer like this might be the most direct fix.
You're right, the original regex could mess up strings like "foo:000 bar". I’ve tried to fix it by tracking whether we're inside a string and only replacing repeated zeros after a colon, outside of quoted values:
use std::io::{BufRead, BufReader, Cursor, Read};
use serde::Deserialize;
fn sanitize_json<R: Read>(reader: R) -> impl BufRead {
let mut buffer = String::new();
BufReader::new(reader).read_to_string(&mut buffer).unwrap();
let mut sanitized = String::with_capacity(buffer.len());
let mut in_string = false;
let mut chars = buffer.chars().peekable();
while let Some(c) = chars.next() {
match c {
'"' => {
in_string = !in_string;
sanitized.push(c);
}
':' if !in_string => {
sanitized.push(':');
// Skip optional whitespace
while let Some(' ') = chars.peek() {
sanitized.push(chars.next().unwrap());
}
// Check for repeated zeros
let mut zero_count = 0;
while let Some('0') = chars.peek() {
chars.next();
zero_count += 1;
}
if zero_count >= 2 {
sanitized.push('0');
} else {
for _ in 0..zero_count {
sanitized.push('0');
}
}
}
_ => {
sanitized.push(c);
}
}
}
BufReader::new(Cursor::new(sanitized))
}
#[derive(Debug, Deserialize)]
struct MyStruct {
field: i32,
note: String,
}
fn main() -> Result<(), Box<dyn std::error::Error>> {
let reader = Cursor::new(r#"{"field": 0000000000, "note": "foo:000 bar"}"#);
let clean_reader = sanitize_json(reader);
let data: MyStruct = serde_json::from_reader(clean_reader)?;
println!("{:?}", data);
Ok(())
}
Output:
MyStruct { field: 0, note: "foo:000 bar" }
Seems to work for flat structures, might need more logic for nested objects or arrays, but this handles the basic case.
I'm unfamiliar with "simjson" and can't find a crate for it. I assume you meant the simd_json crate? That doesn't support streaming. It has a from_reader(), but it just reads the whole file into memory. The files I'm working with are often many tens of gigabytes compressed, or hundreds of gigabytes uncompressed; streaming is required.
Yes, something like this might work, but I cannot just call read_to_string and work with the entire file in memory, which is why I said doing a preprocessing step would be non-trivial. The uncompressed full JSON can be hundreds of gigabytes.
I also looked into json5 and didn't see anything about accepting zero prefixes, so I assumed it didn't.
Ah, probably. Not workable, as it actually parses JSON into a JsonData enum. I am using custom deserializers to parse JSON into bespoke data structures, deduping along the way, to reduce the vast amount of often replicated data into something manageable.
Thanks @erelde and @jinschoi you're absolutely right. My example buffers the full input with read_to_string, which wouldn’t work for multi-GB streams or constrained setups like the OP described. I should’ve clarified it was more of a proof-of-concept to show how repeated zeros could be sanitized without touching quoted strings.
A streaming-safe version would need to track string state and colon context across buffer boundaries, definitely non-trivial. I might explore a BufRead adapter or tokenizer for that.
Also confirmed json5 doesn’t accept leading-zero numbers either, so a sanitizer still seems necessary for that case.
Thanks for the response, anyway! I was thinking along those lines, too, but a BufRead adapter seems like it would involve a lot of unnecessary copying and slow down processing for every file just to handle this somewhat rare failure case. The first suggestion by @kpreid is probably more efficient, but... forking and maintaining my own "nearly serde_json" is not practical at the moment.
This might be a lost cause failure case.
What should happen for {"field": 0001}, if that case ever exists?
JavaScript truncates in that case, so maybe you can make use of QuickJS. There's a QuickJS crate in Rust world. Or, just grab QuickJS NG compiled to WASM, and do something like this
I can't use anything that can't be driven directly from a BufReader without reading the entire file first. A, because the files are way larger than available memory, and B, most often they are being streamed and uncompressed from S3. Thanks for the suggestion, though.
I'm not a Rustacean, so I don't know about the constraints. Rust has to have some kind of subprocess mechanism. If you have to read the whole file first a simple RegExp will do the job
Unlikely simjson will work for you. Although it doesn't read an entires JSON in memory, it stores the final result there. I simply pointed on the fact that you need to do a little research to find a crate fitting in your requirements perfectly . For example the simjson parses several JSON fragments glued together. It was a requirement for my case, but can be not a requirement for yours. Why I pointed to the simjson? Because it is a very simple and easy to modify for a particular use case.
You may experiment with saphyr-parser. It does not do any interpretations of the values; it is a parser only, so you can write code to handle any input you would encounter. It is a YAML parser, but would take JSON input also. It is slower as a YAML parser, but adding a full-blown text preprocessor appears to be a significant amount of work and would also negatively impact performance.