I've got a small tool that is a slightly more featureful version of the cut unix command. I've run into an issue trying to reuse a vec to store &str in it. My current solution is unsafe and I'm wondering if anyone has a performant way to stay safe?
I don't think it's ok. As far as rustc is concerned, you're aliasing a mutable reference, and that's undefined behavior.
I've been working off of @quinedot 's playground, I think it resembles your situation close enough. The problem is that you're keeping staging alive through the loop iterations, and the borrow checker can't see that you're dropping all the contents later on. One easy solution is to just recreate staging in every loop. It seems somewhat wasteful, but in my experience the optimizer can easily see through this and remove the extra allocations (or the allocator is smart enough to reuse here). You might want to check if this is a hot spot anyways.
Adapting @quinedot’s playground, not the original:
You could try if something like this (playground link) compiles to the desired result. In principle, doing .into_iter().map(…).collect() on a Vec<T> turning it back into Vec<T> or even into Vec<T1> where T1 has the same size and alignment as T should operate in-place due to specializations in the standard library using the InPlaceIterable trait. I haven’t got the time to check if this approach does indeed optimize in the desired way.
The idea is to have a Vec<Vec<&'static str>> outside of the loop, and start by turning it into Vec<Vec<&'lifetime str>> by variance, then working with that and after draining every inner vec, turn it back into Vec<Vec<&'static str>>. This works by using the into_iter-map-collect approach on two levels. I don’t know if the take(0) I’ve added helps or hurts, I thought it might help the optimizer to be sure that the vec is empty at this point, one should probably look at the assembly (or do benchmarking) with or without the take(0) call. Another thing I don’t know is if .map(|_| "") something like .map(|_| unreachable!()) is better or if it doesn’t make a difference.
Yep, release mode and everything! It makes sense that it would be slower here given the lifetimes involved I don't think optimizer could see a way to reuse that vec.
Looking around the internet a bit more someone pointed out that this problem is similar to what threads have, where the expensive resource (the thread) outlives the values sent to it for computation. I haven't figured out how to apply the scoping idea here, but it also seems promising
The typical example is that &'a T can be converted into &'b T if 'b is a shorter lifetime than 'a, but the same mechanism also allows converting Vec<&'a T> into Vec<&'b T> or even Vec<Vec<&'a T>> into Vec<Vec<&'b T>>.
I have a feeling you could avoid those new allocations by some shennigans with Vec::from_raw_parts and Vec::into_raw_parts, but I couldn't really do it with confidence (partially because of the "vec of vec"s, which adds an additional layer...).
@duck_tape In case that wasn’t clear, this step is the line
let mut staging = staging_empty;
A new variable inside of the loop is needed because variables can’t change their type and the lifetime of the strings is, conceptually, a new lifetime for every loop iteration. But variables don’t cost anything. The variable staging then has a type Vec<Vec<&'some_shorter_lifetime str>>. The coercion happens implicitly.
@steffahn That makes sense and is pretty nifty. Any recommendations on how to get visibility into exactly what the compiler is doing there and if the InPlaceIterable specializations take place? It looks like running with nightly produces the same performance as stable as it stands now which makes me think it's not yet as optimal as possible.
Overall I think I need to think hard about reorganizing the entire loop to try to get rid of the staging vec entirely. I just can't imagine how to do that yet.
I wouldn’t know how to get good insight into this, but at least it’s not an LLVM-optimization but a hand-coded specialization in the standard library, so it’s in a sense more reliable. And it even applies in debug mode. You can be fairly sure that there won’t ever be any allocations introduced by these .collect() calls in the future. Really, I wouldn’t have been surprised if there was no overhead at all from my solution. So if there still is some, it’s either LLVM failing so optimize away some stuff or maybe some differences in the panic paths getting in the way of certain optimizations?
One thing for example that you could try is if replacing
values.into_iter().take(0).map(|_| "").collect()
with a transmute [i.e. unsafe { mem::transmute(values) }] makes a difference in performance. (I wouldn’t be surprised if it doesn’t make a difference though.)
Regarding optimizations around this loop in general, you could also do something like turning
for value in values.drain(..) {
if print_delim {
write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
}
print_delim = true;
write!(&mut writer, "{}", value).unwrap();
}
into
let drain = values.drain(..);
if let Some(value) = drain.next() {
write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
write!(&mut writer, "{}", value).unwrap();
}
drop(drain);
or
let drain = values.drain(..);
if let Some(value) = drain.next() {
write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
write!(&mut writer, "{}{}", &opts.output_delimiter, value).unwrap();
}
drop(drain);
Okay, I believe I understand everything that is happening.
We create staging_empty outside the loop with a &'static str. Then in the loop we create a variable that coerces the &'static str lifetime to a lifetime that fits the loop.
Then we iterate over staging and drain each inner vec, ensuring they are empty. then we return an effectively empty vec created by collecting on an iterator over the now empty values.
We are relying on the InPlaceIterator to optimize the collect calls to reuse the underlying memory becuse InPlaceIterator guarentees should apply since we are removing an item of the correct size in both instances.
The InPlaceIterator stuff feels a little hand wavy, but it seems to be working.
Replacing values.into_iter().take(0).map(...).collect() with iter::empty().collect() seems to keep the same performance which I think makes sense and looks a little tidier.
That change should introduce overhead though. Not using values as the basis for the iterator means that you’re deallocating the memory for all the values vectors. iter::empty().collect() is pretty much the same as Vec::new(). In particular, this will make the subsequent staging[pos].push(part) calls of later iterations more expensive.
Which may explain why nightly and stable seem about the same performance even thought I would expect nightly to be faster since it can use the inplace trait.
Oh, no, that trait is used on stable as-well. It’s unstable for use by anything external to std, but the standard library itself can use it internally and does use it on stable and nightly.