On using procedural macros to implement config files

TLDR

Have you any wisdom to offer on using Rust source as config files, perhaps supported by proc macros?

Background

I have an ever-growing bunch of utilities written in Rust, ranging from the tiny to huge, which are chained together as separate steps in scientific simulations and data analysis. These tend to start off having a clap-based CLI, which tends to become more complicated over time as the need for more parameters, tweaks and use cases emerges.

In one of these, the CLI got so complex and unwieldy that I replaced it with a configuration file written in TOML. However, this feels unsatisfactory, clunky, unergonomical. So I'm wondering whether it would make sense to use Rust source code for the config, perhaps supported by some procedural macros.

Cons

Run-time dependence on Rust compiler

Using Rust source as config files would require the user to have a Rust compiler available in order to run the code.

This is research code, so it is continually in flux as new requirements are discovered and implemented on almost a daily basis. Hence, the compiler is effectively a de-facto requirement today. But maybe this would become a problem in the future, when we have explored the problem space more thoroughly and want to provide a working product for others to use.

Pros

Liberation from TOML / Direct integration with Rust type system

The complexity of the information that needs to be expressed in the config file keeps growing. Coming up with mappings between TOML and what is usually a Rust datatype, implementing the conversion and testing various constraints is quite a pain, even with the support of serde.

Being able to simply write Rust expressions seems appealing in this context.

Extra compile-time feedback in IDE

Rust-source configs would get edit-time feedback from rust-analyzer, which is going to be orders of magnitude more useful than anything we could get in TOML.

As the code is in flux, the config file parser needs to keep up with the code's evolution. At the moment, such divergences only appear at run-time. Yes, there are tests, but these require extra effort to maintain and still give less direct feedback and give it later than what we would get in Rust-source configs.

Enhanced feedback via proc macros

One feature of these config files is that they must refer to data files in very long-winded locations. It is most irksome to launch a process and only then to discover typos in these names.

Proc macros should be able to issue errors right inside the editor, if some specified input file is missing, or the destination of some output file is not writable.

The questions

Can you see any other pros or cons?

Is this a crazy idea: will it cause more trouble than it's worth?

Are you aware of any relevant prior art?

Unless you are doing something really weird, you don't (shouldn't) need to "implement conversions" when working with Serde. That's exactly the purpose of the Serde data model – #[derive(Serialize, Deserialize)] will define mappings that convert seamlessly between the dynamically-typed serialized data and the statically-typed Rust data structures.

Can you give a specific example of how this does not work with your setup?

This doesn't seem like something that can only be mitigated using a proc-macro. Why couldn't you validate the passed config very early at runtime? E.g., I could imagine a Deserialize impl for a newtype that wraps PathBuf and only succeeds deserialization if the file at the path exists/writable/etc.

In general, it is security-wise ill-advised to use a full programming language for configuration. The list of security vulnerabilities related to allowing arbitrary code to be executed in a "configuration" file is endless. In principle, I should be telling you that "it's OK if you trust all users, since it's research code", but the sad truth is that "research" code inevitably ends up in production, so no, it's probably not a great idea to allow all of Rust to be run in a config file.

1 Like

A mapping which is verified at run-time. There is value in having it verified at compile-time and even edit-time, which comes for free by writing the config in Rust.

My memory of the pain points is vague at the moment, but looking through the code the things that stand out as annoying are:

  1. I'm using uom to encode physical dimensions in the type system. Explaining to serde how to parse values with such units is an exercise in writing boilerplate (with a signal to noise ratio of about 6%) by hand. (Perhaps this one can be mitigated with some macro_rules!.)

  2. Explaining to serde with things which are mandatory / optional / forbidden depending on what other things are set, was a pain.

  3. Having to write a bunch of tests that check that all this does what it's supposed to. Much of this would be verified by the type system if the config were written in Rust, or there would be nothing to verify because there would be no translation code.

Of course it can be done early at run-time, but that's still less helpful than what a proc-macro can achieve. If checked at run-time, the message will

  • appear at run-time,
  • point to the line where the check is performed [edit: or maybe not: Spanned in toml - Rust];

while a proc-macro can make it

  • appear at edit-time
  • point to the line where the datum is written
  • inside the editor

The latter is clearly more valuable than the former. The question is, is it more trouble than it is worth?

And how is this worse than allowing arbitrary code to be executed in any other file?

In order to run this software, the user has to

  1. Download the code. It consists of many lines of unvetted arbitrary code in a Turing-complete language.

  2. Compile the code. This implicitly downloads a few orders of magnitude more unvetted code in direct and transitive dependencies.

  3. Write a config file. These few dozen lines of code written in a high-level DSL that talks about things that the user actually understands will be the only part of the code

    • written by the user
    • probably even seen by the user

And, somehow, number 3 is where we need to start worrying about the security implications of executing arbitrary code?

I appreciate that there may be issues with using a Turing-complete language for configuration, but I find the security argument to be completely specious, at least in this case.

Summary

Put another way, as configurations become more complex, the boundary between a program and its configuration becomes more ambiguous. In this sort of research code, the distinction between developer and user is often very nebulous. Splitting it into code and configuration and implementing them in different languages is, perhaps, both arbitrary and unhelpful.

It seems to me that you have a predetermined opinion, you have already decided, and you are not, in fact, asking for advice, but for confirmation. I cannot give you that confirmation with good conscience, based on the problems I explained above. If you wish to go with this approach, then feel free to do so anyway, but be warned that it is unusual and dangerous.

The difference between source code and configuration is that you don't control the configuration. You can give users a pre-compiled version of the software to be run, but you can't give them a pre-compiled version of the configuration, because their ability to modify the configuration is exactly the point of a config file.

Thus, when you give them some code to run, they have to trust you, and you can trust that your own code does what you intended. In contrast, when they specify an arbitrary executable configuration, then 1. you can no longer trust that it will nicely co-operate with your own code and that it will do what you intended, and 2. they now also have to trust the potentially open set of people who tells them what to put in the config file (because users will have questions/problems with configuring the software, and they will ask for unverifiable advice).

I wonder how you reached this conclusion. But I suspect that going down this rabbit hole will not lead to anything constructive, so please don't be offended if I ignore further comments along these lines.

It's not clear to me who the attacker and the victim would be.

  • The developer doesn't need to trust the user (unless the code includes some sort of credentials allowing code to execute on the developer's machines).

  • The user needs to trust the developer (and those of all direct and transitive dependencies), regardless of whether the config language is Turing-complete, because the implementation language already is.

  • Yes, the user might solicit advice from the nasty world out there and get back dodgy advice. Whether or not the config language is Turing-complete doesn't significantly change the security implications, because the user has access to a shell on the local system and might be given the advice to type sudo rm -rf /* (or whatever).

Do you have any concrete examples of how you see this playing out? Are any of them, in practice, any more dangerous than the user having CLI access to the user's own machine?

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.