Streaming JSON parser


#1

I am planning to implement a streaming JSON parser. The idea is to
encapsulate a stream-reader (of bytes/chars) and return an iterator
that returns parsed json value.

For Example, for the following sequence of text:

true
100
[2, 3]
{"name": "joe", "age": 20}

a stream parser should return 4 json values, one for each iteration. I am
aware of implementing the Iterator trait, but I don’t know whether there
is anything similar to golang io.reader


#2

Isn’t this what you want? https://docs.rs/serde_json/1.0.34/serde_json/de/struct.StreamDeserializer.html


#3

For now I can use serde_json, and impl From<Value> for Json.

Though for long term there might be issues regarding

  • JSON5 support
  • Sortable JSON that defines min-bound and max-bound as types.

Note that jsondata is focused on JSON from the perspective of document database. Full scope is listed here.

I might still want some kind of streaming trait for reading sequence of bytes/chars.


#4

std::io::Read or something from the bytes crate?


#5

read() method says:

If n is 0 , then it can indicate one of two scenarios:

  1. This reader has reached its “end of file” and will likely no longer be able to produce bytes. Note that this does not mean that the reader will always no longer be able to produce bytes.
  2. The buffer specified was 0 bytes in length.

Looks like Read::read() may or may not be blocking. To begin with, following are
the three scenarios for streaming read:

  1. Reading input from stdin.
  2. Reading from File.
  3. Reading from network io.

Does Read implementation for the above three types/cases guarantee EOF ? I don’t have experience using any of the above three in Rust, hence I am asking here.

Thanks,


#6

The end of file thing ‒ if you passed non-zero-sized buffer, 0 always means EOF. There are very weird things that can have EOF in the middle (like unix named pipes, where EOF means end of one „connection“, but another „connection“ might come later), but in general you want to handle that situation in the same way ‒ just stop reading.

All the „usual“ things are Read ‒ files, stdin, TcpSocket, etc, etc. Some more things are too (byte slices, decompression wrappers).

Anyway, read corresponds to the OS read syscall, with all the properties. Unless it’s something that is not a file, of course.

If you haven’t put the thing into non-blocking mode, then it’s blocking.

So if you want to do the usual „blocking“ implementation, just expect to get a blocking Read and propagate any errors from there. If you want to handle non-blocking as well, you’ll want to implement futures::Stream, not an Iterator and return NotReady when you get „would block“ error.

By the way, the read method is pretty low level thing ‒ there’s bunch of other ones (read_to_end, read_exact, …) that are much more convenient and they handle all the corner cases. Also, you might want to use BufRead ‒ that provides few more methods and can be created from arbitrary Read implementation by wrapping it into BufReader.


#7

Thanks for the pointer @vorner

Looks like unicode_reader crate has an adaptor to Read::bytes()
https://docs.rs/unicode_reader/0.1.1/unicode_reader/struct.CodePoints.html

This should help to iterate over characters.