Hello,
I have some code that is extremely perf-critical, and I'm looking for ideas about how to stay on the fast path.
A: This runs twice as fast as B:
struct Node<T> {
stuff: Vec<Option<T>>
}
impl<T> Node<T> {
fn get_stuff(&self) -> &Option<T> {
return self.stuff.last().unwrap()
}
}
//Caller; get_stuff method is dispatched through some indirection
...
let opt = node.get_stuff();
match opt {
...
}
B: This runs half as fast as A:
//Common parts shared with above
impl<T> Node<T> {
fn get_stuff(&self) -> Option<&T> {
return self.stuff.last().unwrap().as_ref()
}
}
//Common parts shared with above
When I say there is a 2x difference, I don't just mean in this call. I mean in the program runtime overall. This is a massive difference.
I'm pretty sure, in the slow case, get_stuff
hits a pipeline stall, waiting to return until the contents of the Option can be fetched, to decide if its None
, before allowing the return to proceed.
In the fast case, I'm guessing the CPU is probably able to start the fetch of the Option, while the return is happening in parallel, and finally when the calling code is ready to match on the option, it's already in cache.
My problem is that I can't just return a reference to the Option because Node
is an abstract interface, and for some Node implementations the contents aren't stored as Option<T>
I understand that it sounds like some of my stated goals are in conflict. So I'm just asking how other folks would tackle this particular bear.
Thank you for any thoughts.
Follow-up Note: I considered replacing Option
with my own (&valid_bool, &T_union)
pair, and then returning the references to each component separately. Which does stay on the fast path. But I wanted to see if there is a way to keep the notch (Null ptr) optimization from Option.