This is my rust crate debut, so any feedback is greatly appreciated.
BodyImage provides uniform access to HTTP body payloads which may be buffered in RAM, or in a temporary file, with optional memory mapped. This effectively enables trading some file I/O cost in return for supporting significantly larger bodies without risk of exhausting RAM.
The BARC container file format, reader and writer supports high fidelity serialization of complete HTTP request/response dialogs, with additional meta-data, and has broad use cases as test fixtures or for caching and web crawling. A
barc command line tool is also available. A sample file of one record:
Such foundational building blocks really help other people "doing-it-right"; I've seen my fair share of DoS and injection attacks via image payloads, so you have my appreciation for that reason alone
As for feedback:
your BARC-container sound like something that should be its own crate (that
body-image depends on), not just a submodule.
Serialisation of web requests can be useful, also outside of the context of body-image, so making it available as a stand-alone component is the right thing to do.
I also tried looking up this BARC-format, but couldn't find anything in a few minutes of searching. Is this your own invention?
For my use case of
BodyImage, I process ridiculous large bodies (100MiB to 1GiB). I'm excited by all the other RAM (and thus cost) savings in replacing Java with Rust, but keeping these bodies out of RAM, at least while they are being downloaded, is essential, not just a security concern.
The BARC format is my invention, yes, but inspired by and improved upon:
I have thus far resisted the urge to split this into 2 (or even 3 or 4, for the hyper integration and CLI) crates, because the features have co-evolved and the barc module is currently utilizing some crate-private access to the root module, before I've found satisfactory public interface. Also the barc module doesn't add much bulk or external dependencies. The split could be done in the future, with only an (albeit breaking) name change.
Thanks for the feedback.
Wow, that is large indeed!
Thanks for the links! That helps!
Is there anything specific you needed to improve/change in the existing formats?
Speaking of ridiculous request sizes: I am currently using
s3cmd to upload 600GB genomics/bioinformatics files to an on-site object store; S3 protocol uses http under the hood, so I sympathize
Mind if I ask, out of curiosity, what your use-case is?
Besides just being simpler, BARC offers:
- application extensible metadata archiving
- complete/symmetric request and response archiving (ex: include request body of POST)
- per-record compression, via gzip or Brotli, which is well suited for HTML, CSS, JS and other text/* formats.
My current use case is crawling US government public records for application in data journalism.