Is borrowing a good choice if it lasts longer?

I’m struggling with the following decision. Any feedback is appreciated.

Borrowing is an easy choice when it doesn’t last long:

  • Function parameters
  • Iterated values (which will often be used and discarded)

But how about longer-term borrowing? For example, in a parser:

  • Input: String
  • Tokens are fragments of the input.
  • The abstract syntax tree also contains input string fragments.

I see two main alternatives:

  1. Tokens borrow (contain input string slices), AST nodes own (contain cloned input string slices).

    • The top-level lifetime contains the input string and the parser invocation, so borrowing should work(?) The AST should survive after the input string is thrown away, so borrowing is not an option.
  2. Switch to Rc<String> (or Arc<String>) as early as possible.

    • Tokens and AST nodes contain cloned Rc<String>.

Pros and cons:

  1. With borrowing, I’m mainly worried about painting myself into a corner. The viral propagation of lifetime annotations also makes the code considerably more complicated.

  2. Not using slices wastes memory.

1 Like

I usually recommend that long-term borrowing should be avoided in pretty much all cases except for parsers.

7 Likes

This is likely to draw highly opinionated responses. The real answer is it will depend on your use case. Is copying AST nodes a performance bottleneck in your program? Is the issue with copying (program running time) or with total memory usage? Have you profiled? Do your programs run in memory constrained environments? How often do you copy ASTs? How often do you mutate them?

The viral lifetime issue is solved fairly elegantly by something like yoke. There are also many zero-copy serialization crates out there, which purport to solve this problem with varying degrees of efficacy, usually trading performance for legibility/generality.

Any potential decision comes down to, is the cognitive overload worth the performance gain. For that you really have to profile.

1 Like

It depends a lot on what the design you anticipate is. For parsers, there's two common cases:

  • You want to keep the full source string around anyway, for e.g. diagnostics purposes.
  • You want to toss the full source string, because it's large and the amount of text fragments you actually need to keep around e.g. for distinguishing identifiers is actually relatively small comparatively.

The majority of parsers end up taking the simpler route, and just borrowing from the source string. More sophisticated solutions might hold byte ranges for most tokens, and use some sort of string interning for identifiers where remembering the actual text matters.

Depending on your appetite for unsafe, there's nearly endless options available. But parsing is usually not the bottleneck of any processing you're doing, so while it's flashy to optimize text parsing and AST creation, it's typically not important enough to justify spending more effort on it than strictly necessary.

There's a middle-ground option, though: with a Cow-like structure, you can have a borrowed Ast<'src> tree which borrows from the input and offer an into_owned transform into Ast<'static> which owns its own memory. Without clever use of unsafe or and/or string interning this might result in more fragmented memory usage, but also typically still less overall usage than the full source string (for string storage) since you can drop a lot of irrelevant source information (e.g. formatting, punctuation).

4 Likes

There’s also a third option of using a type such as yoke::Yoke<&'static str, Rc<str>> that combines shared ownership with the ability to do slicing. yoke was already mentioned by now, but I included a small demonstration :slight_smile:

3 Likes

And it (probably) doesn't matter.

The amount of text that a realistic project written by humans can accumulate is at the very most in the millions of lines ballpark, and that's the size of the most homongous code bases (eg. LLVM, if memory serves).

This means that you will have strings which are a few MB or 10s of MBs at most, with the AST consuming a comparable amount of memory (or even less, given that the AST doesn't copy all of the text verbatim). This seems like minuscule compared with shitty modern webapps also routinely slurping 10s of MBs of JavaScript per page load, for example.

Not to mention that borrowed slices can actually end up not saving any meaningful amount of memory. On a typical 64-bit architecture, a slice is 16 bytes. Now, most tokens are less than 16 bytes. Keywords are 3-8 bytes, operators/symbols are 1-3 typically, the longer ones are comments and string literals. If you allocate a 16-byte slice for every 2-byte operator, you have already used 8x more memory than necessary. At that point, you should probably not worry about storing it as a String and using 12 times the required memory.

Lastly, I clearly don't think it's worth it to trust an unsafe dependency such as yoke. I think it's the implementation of an anti-pattern anyway, as are almost all uses of self-referential data. Just store owned copies if you need to copy parts of the string and optimize later if it actually turns out to be necessary.

8 Likes

To the Yoke solution, note that you'd typically apply it only at the top level. I.e. you'd typically not store Yoke<&str, Rc<str>> throughout your parse tree, you'd store &str and have Yoke<Syntax<'static>, Rc<str>> at the top level to "cover" the lifetime by attaching the data source to the data view.

I wouldn't say yoke is an antipattern, but it is and should be a niche pattern / use case. 99% of cases are fine with non-zero-copy parsing, well-nested, or leaking the source data; yoke is designed for the 1% case where that's not the case and yoking is a significant improvement (the data source is dense, which is fundamentally not the case for text-based protocols).

3 Likes

Actually, this could be a case where Rc<String> is preferable. Usual usage of the Yoke value would not dereference the Rc anyways, so the double indirection won’t matter; and the Rc<String> has the advantages of being half the size on the stack, and cheaper to construct if you have a String already.

1 Like

Often that isn't even a choice you can make. For example growable collections with borrowed data can't simply accept insertion of a new item, because that item has to be stored somewhere else first, for a lifetime/scope known to be sufficient, and that has to be done without invalidating any other references (which collections in std don't guarantee).

Borrowed AST would be in that camp. You can make an AST in one shot borrowing from its source code, but then it's awful to work on it, because you can't just insert new nodes into it. Minting loans with appropriate lifetime needs addition of a special ever-growing memory pool. Or you can make it use Cow to support both borrowed and newly-allocated data, but that needs consideration whether extra overhead of Cow costs more than you saved by borrowing.

Also in many cases support for long-term borrowed data gives underwhelming results, because instead of:

  1. Feed the data into an object that will own the data

it changes to:

  1. Feed the data into temporary storage
  2. Loan the temporary data to an object that will borrow the data
3 Likes

To be clear, parsing usually involves two steps:

  1. Turning the raw content into a syntax tree, where the syntax tree would typically contain borrowed slices of the raw input
  2. Turning the syntax tree into domain-specific objects, which would typically have a new owner
1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.