How to store binary files in your own format?

I want to store my custom structures in a special format, like CSV, but I'm unsatisfied with CSV bulkiness. For now, I use base-62 encoding for integer numbers and Google polyline algorithm for geospatial, and store this in CSV.

I've read some specs of custom formats that describe it like 1) first N bytes are version number, 2) then N bytes for author and some metadata, 3) bytes from X show offsets of block 1, 2, etc. 4) block 1 start shows how many records, each record is written in a struct like this and that etc.

I thought it would be nice to save data like this. But apparently the only ways are

  1. Write own code, a lot, buggy, etc.

  2. Protobuf. Lots of intermediate code. Also seems suitable for little messages, rather than big files with lots of records. Some formats use their own custom struct with Protobuf messages embedded in blobs (which is essintially doing #1, offloading only part of the work on pbf).

  3. BSON. Tried a crate for it. It can only handle core Rust types, plus works incredibly slowly. (edit: tested in against CSV, both in release mode, CSV was faster.)

I've heard a speech that Serde is so powerful, it can save to any format. Is there no binary format driver under Serde? (I understand it will require some special treatment, like to define how to store enums, probably to specify format version, etc. But I'd be willing to do this.)

You have a range of choices for binary data formats. Overview · Serde

BSON should work for any type implementing serde's traits? (their own example)

1 Like

BSON definitely can't only handle primitive types. Also, define "slowly". How do you know it's slow? How slow is it? Are you compiling and running your code in release (i.e., optimized) mode?

Also, you are presenting BSON as if it were the only binary format. That's not even remotely the case. There are also MessagePack and CBOR, to mention only the two most popular. (I also designed a binary format with de-duplication of struct field names in mind, which makes it even smaller than MessagePack on average.)


Incidentally, the following statement of yours makes me think whether what you are looking for is really serialization to a file:

If you have "big files" with "lots of records", then you should probably use a real database, rather than serialize everything to an ad-hoc data exchange format that can only be queried and updated by serializing and parsing the whole dataset at once.

2 Likes
  • msgpack has variable-length integers.
  • bincode is quite simple and fast to serialize/deserialize.

You can also use whatever format and then brotli it.

1 Like

I almost always run in release mode. IIRC, BSON was slower than Serde CSV.

At least it seemed like this after a bit of research. I'll give a look at these, thanks!

That's a very good question that I wrote almost an essay. :slight_smile: Over the last 10 years of work with geospatial data I gradually gravitated from storing data in databases (PostGIS, SqLite) to simple files (Flatgeobuf and CSV). Even when I had a database, I had a makefile with lines like:

t/db_created.touch: sql/db_struct.sql
    postgres $(connect_params) -f ^ && touch $@

t/some_heavy_calc.touch: sql/calculation.sql ... 
    postgres $(...) -f ^< && touch $@

and the calculation files were like

drop table if exists ... cascade;
create table ... as (select ...)

The reason I'd drop the db entirely, or re-create it was that the data was the same, and the logic of calculations updated significantly. Running the entire pipeline was too long, so I only needed to re-do the single step that got changed, hence Makefile (BTW, I couldn't find a similar tool, able to track what to (not) update at all!).

We also thought of such logic in case when we needed history of datasets and preferred to keep them entirely rather than to figure out ways to keep the historic records together with actual ones. Projects with interconnected data (like road graph, or gravitational model between consumers and shops) will also gravitate towards entire datasets updates, which are a lot easier to keep and track in flat files.

DBs will work in projects where logic is stable, and data is so huge, while updates are very small compared to the entire datasets, that it makes sense to have tiny updates.

SQLite isn't bad as a basic file format. The way it encodes rows is relatively space efficient, it's self describing so you won't run into issues where an old file can't be read because you can't figure out how it was encoded, and its very reliable.

If you ever need to search through your data but don't need to load the whole thing into memory, SQLite could be a great choice. If you don't ever need to do that, it's a bit less useful.

It probably won't be as fast to read and write the data as a custom format would be, but if you use prepared statements I don't imagine it would be prohibitively slow.

3 Likes

I had an impression SQLite is too tedious in Rust. Is it?

I did use SQLite in Python, even as persistent storage for collections (this package gives a good key-value API, you can make a persistent dictionary stored in SqLite, Mongo, etc.).

Look at rkyv. Not the smallest format, but in my research is one of the fastest and very memory efficient.

(GitHub - rkyv/rkyv: Zero-copy deserialization framework for Rust)

1 Like

If you're used to using an ORM to create queries for you, yeah it's definitely more tedious at the moment in Rust. If you were writing the queries yourself it will be basically the same.

There is also the postcard format, which I like.

1 Like

It sounds like you're actually looking for a symmetrical version of nom_derive - Rust, similar to another thread here a week or two ago, so perhaps there is some missing crate here that lets you fine-tune your binary format locally.

You also should be able to use serde and write your own de/serializer, but at that point you're not getting much over using a low level non-self describing binary format that's already supported, since you can't use local information as easily (you basically have to play games with implementing de/serialize (not -izer))

1 Like

Flexbuffers seems to solve the problem. Thanks!

This may be a little off topic, but it looks like in addition to a data format you also need some kind of data pipeline, like dvc.

Is there any rust equivalent?

1 Like

No, it's not particularly tedious to use. What makes you think so?

I saw these examples:


from here

and this:

use sqlx::{migrate::MigrateDatabase, Row, Sqlite, SqlitePool};

const DB_URL: &str = "sqlite://sqlite.db";

#[tokio::main]
async fn main() {
    if !Sqlite::database_exists(DB_URL).await.unwrap_or(false) {
        println!("Creating database {}", DB_URL);
        match Sqlite::create_database(DB_URL).await {
            Ok(_) => println!("Create db success"),
            Err(error) => panic!("error: {}", error),
        }
    } else {
        println!("Database already exists");
    }

    let db = SqlitePool::connect(DB_URL).await.unwrap();

    let result = sqlx::query("CREATE TABLE IF NOT EXISTS users (id INTEGER PRIMARY KEY NOT NULL, name VARCHAR(250) NOT NULL);").execute(&db).await.unwrap();
    println!("Create user table result: {:?}", result);

    let result = sqlx::query(
        "SELECT name
         FROM sqlite_schema
         WHERE type ='table' 
         AND name NOT LIKE 'sqlite_%';",
    )
    .fetch_all(&db)
    .await
    .unwrap();

    for (idx, row) in result.iter().enumerate() {
        println!("[{}]: {:?}", idx, row.get::<String, &str>("name"));
    }
}

from here

Or is this from too low a level crate?

Sqlx is a query mapper, so it's an additional layer of abstraction above whatever database you happen to be using. The code doesn't seem too bad to me, though. If you want a lower-level, sqlite-specific wrapper, try rusqlite.


By the way, the above example code uses the "if not exists, create" anti-pattern that is vulnerable to TOCTOU. This makes me question its quality as a whole.

1 Like

Depends, SQLite is often used as a local settings store for a single user, this is how you would do that.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.