API design question: how to wrap a family of types?

I am in the process of improving my rowan syntax tree library, and there’s a design question that has been bugging me for weeks, for which I don’t know a good answer. So, I’d like to dump my thoughts here in case someone comes up with a good way to solve my problem :slight_smile:

Simplifying, the syntax tree looks like this:

struct SyntaxKind(u16);

struct Node {
  kind: SyntaxKind,
  len: usize,
  children: Vec<NodeOrToken>
}

struct Token {
  kind: SyntaxKind,
  text: String,
}

enum NodeOrToken { Node(Node), Token(Token) }

in the real implementation, an elaborate memory management scheme is used instead, but that is not relevant for the present API question.

The interesting bit is the kind. The idea is that each language defines it’s own set of kinds, so, for example, Rust would have const IF_KEYWORD = SyntaxKind(10) and TOML would have const L_BRACKET = SyntaxKind(10).

However, when I print a Token of a specific language, I want to see SyntaxToken("if keyword") and not SyntaxToken(SyntaxKind(10)).

The easiest way to do that would be to make all types generic over K: Copy + Debug syntax kind. I don’t like this solution for the following reasons though:

  • as there’s a family of types, threading K everywhere is a pain
  • the implementation doesn’t actually use K anywhere, so making it generic seems unnecessary
  • the implementation is tricky (unsafe), so not woring about generics helps quite a bit
  • similar to the previous point, each language will get it’s own monomorhised copy of a tree, which is wasteful
  • because trees for different languages are fundamentally of different types now, you can’t, for example, put TOML and Rust trees into the same hash-map, which might be useful.

So, this is the fmt::Debug problem. There’s another one: syntax tree is a pretty foundational, public datatype. For this reason, I’d love to be able to write inherent and trait impls for syntax trees, even if I am using rowan as a library.

Because of this two issues, I want to find a way to conveniently wrap a family of types in newtypes.

Specifically, I expect the clients like Rust-analyzer to do something like

pub struct RustNode(rowan::Node);

impl fmt::Debug for RustNode { ... }

pub struct RustToken(rowan::Token);

What would be the most convenient way to do that?

The API for SyntaxNode is pretty big, and complex (it returns iterators). Just wrapping each and every function in RustNode seems like a lot of busywork, which needs to be repeated by every language.

I can imagine this can be helped a bit with roughly the following trait setup:

trat TreeApi {
    type Node;
    type Token;
    fn node_from_rowan(node: rowan::Node)  -> Self::Node;
    fn node_to_rowan(&node: Self::Node)  -> &rowan::Node;
    // a ton of other wrappers-unwrappers
}

trait NodeApi {
    type TreeApi = TreeApi<Node=Self>;
    fn parent(&self) -> Option<Self> { 
        Self::TreeApi::node_to_rowan(self).parent().map(Self::TreeApi::node_from_rowan) 
    }
    ...
}

but this seems very complicated, especially for iterator-returning methods.

I can also write a macro that generates the stupid boilerplate, but I’d love to avoid macro-based solutions: a macro would make this completely impenetrable for people new to the code base.

Here’s how API that I would like to wrap looks like: https://github.com/rust-analyzer/rowan/blob/e7a34eafdc0ccc1a6938cf74d07d88a69eae27cb/src/syntax_node.rs

How about the really dumb option?

Stop using fmt::Debug, and instead write your own debug<F: Formatter>(&self, writer: &mut F, language: &HashMap<u16, String>) function on your tree nodes. That way, you only need to thread the language down into diagnostic functions, instead of threading it through to all functions.

This might be a bit silly but if the main goal is to get SyntaxToken("if keyword") in the Debug output, another low-tech solution could be something like this:

struct SyntaxKind(&'static str);

You've mentioned that the memory management story for syntax trees is non-trivial, so it's probably a good idea to make everything else as dumb as possible.

Is it really important that different "families" of syntax trees have a different set of types? I imagine relaxing that constraint would make your life a lot easier in the long run, and if you were concerned that someone is accidentally mixing syntax trees from two documents/languages, you could probably sprinkle in some asserts (e.g. debug_assert_eq!(this_node.root(), other_node.root()))... It means detecting errors at runtime instead of compile-time, but I imagine most consumers of your library will be running high-level integration tests anyway.

I definitely want fmt::Debug to work for end-user types though... But yeah, providing at least convenience function to make implementing fmt::Debug easier is a good idea!

Yeah, in general, struct SyntaxKind(&'static SyntaxKindData) would be pretty general and work. There are two problems with it:

  • It will increase the size of every syntax tree node (that are aligned to 32 bits at the moment). This is significant, full-fidelity syntax trees weigh a ton :frowning:

  • I think I want for the users of the library to be able to write inherent methods and trait implementations for (wrapped) rowan types.

In that case, since generics are out and function parameters are out, about the only remaining way to get that data from the language implementer to the debugger, that I know of, would be putting the language definition into a lazy_static, RwLock<HashMap<NodeKind, Strung>>.

This, of course, requires token types to be globally unique within the application, but if you want to co-locate Rust nodes and TOML nodes within the same hash map without tagging them at runtime with the language type, wouldn't you need to do that anyway?

I can tag them at compile time, like this:

// in rowan crate
struct SyntaxNode { kind: SyntaxKind ... };

// in rust-analyzer crate
struct RustNode(rowan::SyntaxNode)

// here, we statically know the meaning of kind from `self.0`, so can write the correct string
impl fmt::Debug for RunstNode {}

// in tom crate
struct TomlNode(rowan::SyntaxNode)

// Ditto
impl fmt::Debug for TomlNode {}

If I need to process trees generically, I can extract .0 out of these typed wrappers. This solution, if implemented, works perfectly for me. The problem is, implemnting it is a pain, as there are like seven types I need to wrap, and each has a ton of methods.

1 Like

For the time-being, I ended up implementing something resembling "non-templated base class" pattern by just manually delegating all the methods :weary:

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.