Configuration file formats

I try to use toml as much as possible, because it feels like something the Rust ecosystem has mostly settled on as a default format, and I think most sysadmins have encountered .ini files at some point in their career. However, I'm working on something that uses a highly hierarchical configuration format -- and noticed that this is a weak spot for toml.

I checked through what formats serde supports and found something called "ron", which felt a little verbose, but it's explicit in a way that I like -- or so I thought. Apparently to set something that's an Option in Rust it must be specified as a Some in the ron buffer. That's a little too explicit for me. Or rather, I personally could live with it but I'm certain my non-Rust-programming users would be confused by the syntax.

I loathe yaml with a flaming passion, however it is very hierarchical by it's nature so I might use that. However, before I do: Are there any other formats that are suitable for configuration files that you think are uncontroversial to use both with regards to the Rust community and for non-programmers? .. with serde support, of course.

1 Like

I was going to suggest YAML (that's what we use for our customers) but...

I have used INI files for hierarchical configuration. I know it works well for certain kinds of structure.

Instead of jumping straight to the what-file-format maybe you'd be better served discussing the nature of the data. That discussion may reveal something you (and us) had not considered.

1 Like

The user needs to define a "service", which contains one or more "interfaces", which contains one or more "binds" -- all of which can contain some other key/value pairs, and some of the values are required and some are optional.

Conceptually:

service:
  id: "1234-5678"
  name "some_service"
  interface:
    id: "8765-4321"
    name: "some_interface"
    bind:
      address: "192.168.1.1:8888"
      register: true
    bind:
       address: "127.0.0.1:2000"
       register: false
  interface:
    id: "1111-2222"
    name: "yet_another_interface"
    bind:
      address: "192.168.1.1:9999"
      register: true
service:
  id: "4444-5555"
  name: "more_service"
  interface:
    id: "7777-8888"
    bind:
      address: "10.0.0.1:11111"
      register: true

I'm leaning toward yaml, despite my dislike for it, because it seems to fit the internal format the best. But I would really like to know if there's a better option.

Is this a proper use of Serde? My knowledge may be outdated, but, last I heard, Serde was meant to serialize a program's state and then deserialize that same state: it was meant to parse input that was generated by itself or another piece of software, not to parse something a user wrote, a consequence of which is that its handling of errors in the input is somewhat an afterthought and not user-friendly. Maybe this has changed in the last few years; I don't know.

(Edit: My opinion on the actual question might be unhelpful, as I like YAML and wouldn't think to use anything else. :slightly_smiling_face:)

(Edit 2: I guess I might think to use JSON instead if YAML support were too heavyweight for where the program needs to fit.)

Are names required?

If so, you could consider flattening it by having interfaces defined separately, and referenced by name from the services, or similar.

Dunno if that'd really be better, but it'd give more options for formats.

Interface 7777-8888 has no name, but they do all have ids.

1 Like

Is that triple typically distinct or does there tend to be a lot of duplication?

For example, pick one bind configuration. Is that bind configuration often duplicated throughout? Or, does that bind configuration tend to appear just once? Ditto for services and interfaces.

I am a recent convert away from loathing yaml myself, but only because there's one redeeming quality to it. It's a superset of JSON, so this YAML works:

# YAML is a way to comment JSON.
{
  "foo": "bar",
  "baz": true  # Set to false if you dare.
}

And, weirdly, so does this:

- FooConfig:
    {
        "foo": "bar",
        "baz": true  # Set to false if you dare.
    }
- BarConfig:
    {
        "bar": "baz",
        "baz": false # Set to true if you like.
    }

I'd probably choose JSON in your position, or YAML-as-JSON for comments.

1 Like

This is definitely not correct. Why couldn't you parse something a user wrote? I think you should be using Serde for reading and writing configs, because it's well-supported across the ecosystem, and many popular formats have serde parsers/writers.

1 Like

I respectfully disagree.

At work, we use a YAML file for something that is almost always written by a human and the actual experience is terrible. Serde doesn't provide any way to get the location of something in its source text, and previous proposals have been rejected as out-of-scope, which means if there are any semantic issues with the document you have no way to say "there's an error on this line over here" like rustc does.

For a file that is meant to be written by humans and will be later checked for higher-level consistency (e.g. to make sure there is a particular relationship between different items in the document), not having any way to track where each element of your document comes from is a deal breaker.

Additionally, the error messages you get from #[derive(Deserialize)] when something fails to deserialize are fine for a programmer, but not something you would ever want to show a user. If you want to get nice errors from deserializing it often means writing the Deserialize implementations yourself, which is impractical and goes against a lot of the reasons for using serde in the first place.

3 Likes

YAML looks perfectly suited for this kind of data, but I can appreciate a distaste for it.
You might enjoy StrictYAML: StrictYAML - HitchDev

There's a crate for it, because of course there is :wink:
https://crates.io/crates/strict-yaml-rust
This makes me think that the python developer of StrictYAML is a Rustacean kindred spirit: The Norway Problem - why StrictYAML refuses to do implicit typing and so should you - HitchDev

I've not actually used either StrictYAML or that rust implementation, but I must have made a mental note when stumbling across it because your question made me think to search for it.

1 Like

There is an extension for ron, which you can enable by adding #![enable(implicit_some)] to the file.

Unfortunately, it also doesn't emit spans, like Michael-F-Bryan mentions above.

I don't think you are right at all.

But it doesn't need to, does it? It's the parser's job to track source location – it's the only one that knows the low-level format, after all.

Just because serde-yaml doesn't currently do it, it's not impossible. It could implement such functionality if it chose to do so. In fact I have written a Serde format that does just that – it gives out byte-exact, Unicode-aware locations for every parser error.

2 Likes

https://lib.rs/crates/serde_path_to_error

I bet something like this could also be implemented for other serde implementations

Edit : https://lib.rs/crates/format_serde_error someone did that, which I wasn't aware of and seems pretty good

1 Like

My point is that it does.

I think you might have misunderstood me - what I am referring to are semantic errors, not syntactic errors like invalid YAML or missing fields which would cause deserialize() to fail immediately.

For example, imagine you are using a YAML file to describe some sort of DAG (e.g. a data processing pipeline or orchestration tool where services might depend on another having already started) where an input node creates data and a processing node reads data from some input node and does stuff with it. The DAG's edges are denoted using an inputs array on processing nodes.

first:
  type: input
second:
  type: processing
  inputs:
  - non-existent-node
third:
  type: processing
  inputs:
  - third

This file has a couple semantic errors,

  • The second node receives inputs from the non-existent-node node, but no such node was defined
  • The third node receives input from itself, which creates a loop in the DAG and doesn't make sense

You can write a document type which uses #[derive(serde::Deserialize)] to deserialize from YAML/JSON/whatever, however this will only detect syntactic errors. You would need to do a separate post-processing stage to make sure the overall document is semantically valid (possibly accessing external resources like checking if a file exists) - similar to why rustc has distinct parsing and type checking phases.

You then end up in a sticky situation where, because serde has no mechanism for retrieving the location of an arbitrary element (not just a parsing error), and because proposals have been rejected in both serde_yaml specifically and a more general mechanism in serde proper, it's not possible to get nice rustc-style error messages which provide line numbers and let you attach hints that refer to other sections of the config file.

2 Likes

I'll add that even the handling of these errors by failing immediately provides a suboptimal user experience compared to collecting all the syntactic errors in the file to display at once (like a typical compiler), rather than making the user restart a server or otherwise reload the configuration after fixing each error to see the next error.

I don't mean that Serde should collect errors like that (which I imagine would pessimize it for reading actual serialized program state) but that Serde makes trade-offs that favor reading actual serialized program state over user-written configuration.

(Edit: maybe moderators should split this discussion about Serde out of this topic about configuration file formats.)

2 Likes

However, wouldn't one need to perform such post-processing anyway? No configuration file parsing library can hope to detect arbitrary domain-specific semantic errors, so some manual work is surely required in such cases.

Furthermore, while it is theoretically nice to be able to report such errors with source location, if the configuration has deep semantics, then it likely also provides a means of identifying entities by something more meaningful than line/column. So instead of saying "circle detected at line X character Y", the type-checker could say "entities Foo, Bar and Baz form a cycle which isn't allowed". This would arguably be even better and more human-friendly than spitting out a pair of numbers.

While this is orthogonal to the point being discussed, as a frequent user of command-line tools, I very much disagree with this. For instance, I wish rustc stopped after the first error instead of its current strategy of "you forgot to import this item but it's used in a macro so here's 1 MB of error messages". It's very distracting and overwhelming, and it doesn't add any value, since I can't focus on fixing more than one error at a time anyway.

1 Like

I'm interested to see this perspective. I suppose it depends on how fast or slow it is to re-run a program to get another error (potentially many times) and how many errors are redundant (would be fixed by fixing another error), in addition to how many errors are reported in total. I don't know how these variables look for the OP's program.

(For me, running rustc is slow enough to be irritating and the errors go in a side pane in my Emacs, where I can page around and easily navigate to their source locations. I suppose things are different for you?)

Just as a note: dhall might be a solution to the problem. I have no experience with it whatsoever, but wanted to throw it into the discussion in case OP doesn't know it.

1 Like

Similarly, KDL may be worth looking into - the reference implementation is written in Rust (and specifically aims to give good error messages), and it works very nicely with hierarchical data.

4 Likes