No_std streaming JSON parser

Hi,

I've been working on this for a bit - picojson

The gist is: a streaming parser that does not allocate, recurse or panic - e.g. fully deterministic behavior ( some test/demo results here ). The caller will always get an error if we are out of buffers or the input is broken.

Motivation is mostly usage in really resource constrained env, e.g microcontrollers with kilobytes of memory, that aren't allowed to crash.

I'd be looking for more feedback on whether the toplevel API makes sense, and any other improvements for the internal implementations as well of course. I know i need to work on code de-duplication between the Slice / Stream parsers, and the longer i wait the harder it gets :slight_smile:

The absolute minimal use looks like this:

let json = r#"{"name": "value"}"#;
let mut parser = SliceParser::new(json);

while let Some(event) = parser.next() {
    match event.expect("Parse error") {
        Event::Key(key) => println!("Found key: {}", key),
        _ => {}
    }
}

In a more real world use case target you'd want to pass in scratch buffer allow dealing with string escapes:

let mut scratch = [0u8; 16]; // Just enough to hold the longest string
let parser = SliceParser::with_buffer(r#"{"msg": "Hello\nWorld"}"#, &mut scratch);

But parsing from a slice isn't that interesting, more useful ( and complex ) is the StreamParser:

let json = br#"{"large": "document with lots of data..."}"#;
// Simulate reading only 4 bytes at a time
let reader = ChunkReader::new(json, 4);
let mut buffer = [0u8; 128];
let mut parser = StreamParser::new(reader, &mut buffer);

Of course - this is yet another demo reading from a slice. In real uses anything that implements the simple Reader trait can be used as a source ( UART, network socket, file etc )

The parsing loop looks similar as above.

Here's a more full featured example how to selectively extract real data from a more complex document.

There's a lot more code in the crate than i anticipated that this would need ( parsers are never easy huh ? ) so I'm not looking for anyone to dig through it all. Would just appreciate any comments as to how to make this better / ergonomic / usable.

One specific bit that i don't know how to really do better: i have int-8 / 32 / 64 feature selection in Config.toml, to prevent 64-bit math primitives to be included on say, 8-bit AVR targets. serde-json-core can solve this much more conveniently as the destination de-serialization type is known at compile time:

Deserialization of integers doesn’t go through u64; instead the string is directly parsed into the requested integer type. This avoids pulling in KBs of compiler intrinsics when targeting a non 64-bit architecture.

With pull parser I'm not sure there's a good way to make that work. Appreciate any thoughts!

1 Like

I work in the same area but took a different path: I have a streaming JSON parser, and I am planning to migrate everything to "no_std". Perhaps I could have gone a different route, maybe I could fork your parser for my needs or persuade you to collaborate.

What I have: I began with "Jiter", added streaming through "RJiter", and created a callback interface "scan_json". To facilitate the migration to "no_std", I am working on "bufvec". My git repo: GitHub - olpa/streaming_json: Process json while it's being generated

whether the top-level API makes sense

yes

any other improvements

  1. If there is an error, the error message should include the line and column.

  2. You need something like "write_long_bytes" (RJiter in rjiter::rjiter - Rust) to support values larger than the buffer. For example, images encoded as base64.

  3. If the application logic requires collecting values before processing them and the application is no_std, there is a challenge. My untested solution is "bufvec". Looking ahead, "bufvec" seems suitable for libraries as well, and using u8 slices can be ergonomic, eliminating the need for custom "string" implementations.

This is interesting to me. I have need for a streaming JSON parser that preserves formatting so that a document can round trip through my program exactly. I.e. I would need spaces and newlines to roundtrip as well as how numbers are formatted (for example 2 vs 2.0).

Is that something that this parser can manage and/or would it be difficult to build on top of this parser to add that?

Thanks for the feedback !

Currently there is no whitespace preserving at all, but numbers do get preserved. JsonNumber keeps the raw string around.

Adding a whitespace preserving mode shouldn't be hard, and everything else should already be kept. I added a todo here

1 Like

Thanks - that's a good call-out. I'll have to think about how to fit that in, as currently full content position in the stream is not being tracked at all.

Also in a full streaming use line count might actually overflow usize

I'll have a look at how are you doing this. I get a feeling this could get very complicated - it's hard to determine on the parser side which data bits should be chunked and split and and which ones wouldn't.

My current approach is to basically leave the decisions on whether to copy or preserve the data entirely to the application side.

I think there's maybe room for a layer on top of the parser to deal with selective extraction of data ( that's the main use case for streaming parser anyway ) and dealing with copies.

Thanks for the comments - i'm having a look at your approach as well, it looks well aligned with what i'm trying to do.

Yes, it was a pain, but mostly because I wanted to re-use Jiter and also support decoding escapes. Otherwise, actually the task is surprisingly easy after getting an insight: Find next active backslash, write the part before it, shift the buffer, repeat. Sounds confusing in this short note, but can be easy after a thought.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.