I am considering using integer codes instead of strings in my code for attribute names. Specifically, I want to parse XML-like files and represent attribute names as integers in the parsed struct. Additionally, I want to access attributes by name, where the name is known at compile time.
This leads to an API that can be used like this:
let code = intern(attr_name); // Get code dynamically
get_attr(intern!("foo")); // Or statically, should expand to get_attr(42)
To achieve this, I need to:
Share context between different calls to the intern! macro. This can be done as follows. While it works, I am not entirely sure it is correct. I can imagine scenarios where it might not work, such as the compiler caching macro execution results and not calling my function for some intern! calls, while recreating INTERNED from scratch.
use proc_macro::TokenStream;
use std::{collections::HashMap, sync::OnceLock, sync::RwLock};
use syn::{parse_macro_input, LitStr};
static INTERNED: OnceLock<RwLock<HashMap<String, u64>>> = OnceLock::new();
#[proc_macro]
pub fn intern(input: TokenStream) -> TokenStream {
let input = parse_macro_input!(input as LitStr);
let value = input.value();
let mut interned = INTERNED
.get_or_init(|| RwLock::new(HashMap::new()))
.write()
.unwrap();
let code = interned.get(&value).cloned().unwrap_or_else(|| {
let code = interned.len() as u64;
interned.insert(value.clone(), code);
code
});
quote::quote!(#code).into()
}
Access the entire map at runtime to resolve statically interned strings. However, this does not work because the `interned_map!` macro can be called before the `intern!` macro, and the returned map will not include some strings.
As far as I’m aware, there’s not a great way to do this right now. You could maybe use something like linkme or inventory to collect the list of strings, but they both rely on some platform-specific tricks that may not work everywhere.
Another option is to write a build script that greps your codebase for intern! invocations and then outputs a .rs file that you can import as a module with a #[path] directive.
If all your strings are known at compile time, this should just be an enum that knows how to turn itself into a &'static str and back (and/or is Display/FromStr).
Then why don't you go full dynamic and just pass string literals to the interning function when you have one, and not-a-string-literal otherwise? I really don't see the benefit of the approach above.
If you cannot define a list of predefined symbols centrally, then you need to do a dynamic insertion at runtime. The linker tricks used by linkme can't deduplicate entries, so you'd end up with a separate entry for each instance of intern!("attr"), which definitely isn't what you want.
What you could do with linkme/inventory is have a list of tokens to insert up front so intern! just does a "try_get" instead of a "get_or_insert", but the difference is likely to be small with a decent interner. (The insert path should be #[cold] and outlined.)
What could be more directly advantageous would be to pre-hash fixed strings to remove that little bit of work from lookups. If you use a deterministic hash (e.g. FxHasher) you can do this, although this opens you up to potential HashDoS style attacks.
I somewhat doubt a lookup would be expensive enough to justify it (with a good interner), but you also could use an approach like is used by tracing to amortize the cost and only do lookup the first time each intern! is hit, e.g. expand to something similar to { static CELL: OnceLock<Token> = OnceLock::new(); CELL.get_or_init(|| intern("attr")) }. This will bypass the global interner lock after the first execution by instead first checking a lock localized to that call site.