Strategies for parsing a format with different versions?

Does anyone know of strategies that can be used when writing a parser for a format which has gone through several versions over time?

A good example of what I'm talking about is the syntax for a programming language. Over time you'll add new bits of syntax or remove deprecated ones.

Similarly, there are about a dozen versions of DXF out there, and it's quite common to need to support parsing files (and serializing, but that's another conversation) written in different versions.

Also, does this idea have a proper name? I've been struggling to find the correct incantation to feed google when trying to find out more about the topic.

TL;DR:

  • Small syntax changes could be absorbed by Rust feature.
    • #[non_exhaustive] and #[deprecated] is useful to extend / deprecate the native data structures.
  • No idea to handle formats uniformly when they have large structure differences or semantics differences.
    • In my case, I implemented common entry-point to load data, but cannot unify data structures.

common belief

Syntax

Some syntax / structure updates can be handled well by #[non_exhaustive].
It allows us to add enum variants and public struct fields without breaking existing user code.
For example, when you implement syntax tree type of a programming language, this non-exhaustiveness allows you to add new expression types and new token types.
(Actually, syn crate uses this technique (in legacy way) to support new syntax potentially added in future.)

However, you should be careful for removal of existing elements.
Removing public fields of a struct or removing enum variants are breaking change, so it should be avoided when possible.
You can simply put deprecated items unused, but not removed.
#[deprecated] attribute may be useful for this purpose.

Structure and semantics

I don't have clear answer for how to handle structure changes and semantics changes.
I personally think they are essentially different format if structure or semantics largely differs.

It would be possible for the format creator to convert data into common internal representation or the least format.
However, it could be hard for third-party developers, because they have less idea how the format may change in future, or what is guaranteed.

My personal experience

When I wrote an FBX (yes, it's also by Autodesk!) parser (fbxcel) and its former experimental projects, I've chosen to make it explicit that there might be parsers for different versions.
The format may change silently and drastically, so I gave up supporting all the versions with single interface.

Additionally, there would be unknown or unpopular features I don't know for now, because they are proprietary.
This means I would have to handle additional specifications previously unknown, even if there are no format updates...

common entry point, different interface

In fbxcel crate, I created common entrypoint for any supported versions, and let it return version-specific parsers put in single enum type.
(I implemented DOM types and did the same way: common loader and different DOM types put in single enum type.)

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.