Review of unsafe usage

Hello!

I've got a small tool that is a slightly more featureful version of the cut unix command. I've run into an issue trying to reuse a vec to store &str in it. My current solution is unsafe and I'm wondering if anyone has a performant way to stay safe?

https://github.com/sstadick/hck/pull/2#discussion_r658429825

(There is a comment in the PR, the preview box doesn’t seem to go to the comment)

I am pretty convinced the my usage of unsafe here is in fact safe and I could happily release this, but there must be a better way!

Comments welcome on any other sections of this code as well.

1 Like

Here's a simplified version on the playground for convenience. The comments and logic may no longer make sense but I think I've captured the borrowing situation.

1 Like

I don't think it's ok. As far as rustc is concerned, you're aliasing a mutable reference, and that's undefined behavior.

I've been working off of @quinedot 's playground, I think it resembles your situation close enough. The problem is that you're keeping staging alive through the loop iterations, and the borrow checker can't see that you're dropping all the contents later on. One easy solution is to just recreate staging in every loop. It seems somewhat wasteful, but in my experience the optimizer can easily see through this and remove the extra allocations (or the allocator is smart enough to reuse here). You might want to check if this is a hot spot anyways.

Adapting @quinedot’s playground, not the original:

You could try if something like this (playground link) compiles to the desired result. In principle, doing .into_iter().map(…).collect() on a Vec<T> turning it back into Vec<T> or even into Vec<T1> where T1 has the same size and alignment as T should operate in-place due to specializations in the standard library using the InPlaceIterable trait. I haven’t got the time to check if this approach does indeed optimize in the desired way.

The idea is to have a Vec<Vec<&'static str>> outside of the loop, and start by turning it into Vec<Vec<&'lifetime str>> by variance, then working with that and after draining every inner vec, turn it back into Vec<Vec<&'static str>>. This works by using the into_iter-map-collect approach on two levels. I don’t know if the take(0) I’ve added helps or hurts, I thought it might help the optimizer to be sure that the vec is empty at this point, one should probably look at the assembly (or do benchmarking) with or without the take(0) call. Another thing I don’t know is if .map(|_| "") something like .map(|_| unreachable!()) is better or if it doesn’t make a difference.

2 Likes

Moving staging into the loop makes it take about twice as long :frowning:

Good point on the mutable aliasing though!

Just to be sure, you are running and measuring in release mode?

Too bad, I once had a similar problem, and allocating inside the loop really surprised me to have no overhead.

Yep, release mode and everything! It makes sense that it would be slower here given the lifetimes involved I don't think optimizer could see a way to reuse that vec.

This is close to the performance using unsafe. I've haven't dug into the assembly or looked at how it's optimizing yet.

How does ^ work / where could I read a bit more about that?

Looking around the internet a bit more someone pointed out that this problem is similar to what threads have, where the expensive resource (the thread) outlives the values sent to it for computation. I haven't figured out how to apply the scoping idea here, but it also seems promising

You should look here if you haven’t read that yet:

Subtyping and Variance - The Rustonomicon

Subtyping gives rise to subtyping coercion (first bullet point here), i.e. if T is a subtype of S than I can coerce values of type T into type S.

The typical example is that &'a T can be converted into &'b T if 'b is a shorter lifetime than 'a, but the same mechanism also allows converting Vec<&'a T> into Vec<&'b T> or even Vec<Vec<&'a T>> into Vec<Vec<&'b T>>.

1 Like

I have a feeling you could avoid those new allocations by some shennigans with Vec::from_raw_parts and Vec::into_raw_parts, but I couldn't really do it with confidence (partially because of the "vec of vec"s, which adds an additional layer...).

@duck_tape In case that wasn’t clear, this step is the line

let mut staging = staging_empty;

A new variable inside of the loop is needed because variables can’t change their type and the lifetime of the strings is, conceptually, a new lifetime for every loop iteration. But variables don’t cost anything. The variable staging then has a type Vec<Vec<&'some_shorter_lifetime str>>. The coercion happens implicitly.

1 Like

@steffahn That makes sense and is pretty nifty. Any recommendations on how to get visibility into exactly what the compiler is doing there and if the InPlaceIterable specializations take place? It looks like running with nightly produces the same performance as stable as it stands now which makes me think it's not yet as optimal as possible.

Overall I think I need to think hard about reorganizing the entire loop to try to get rid of the staging vec entirely. I just can't imagine how to do that yet.

Here's a branch with your solution baked in case anyone wants to see it with full context: https://github.com/sstadick/hck/tree/remove_unsafe

I played around with that a bit and agree, the nested layers make it complex enough that I'd rather just not.

I wouldn’t know how to get good insight into this, but at least it’s not an LLVM-optimization but a hand-coded specialization in the standard library, so it’s in a sense more reliable. And it even applies in debug mode. You can be fairly sure that there won’t ever be any allocations introduced by these .collect() calls in the future. Really, I wouldn’t have been surprised if there was no overhead at all from my solution. So if there still is some, it’s either LLVM failing so optimize away some stuff or maybe some differences in the panic paths getting in the way of certain optimizations?

One thing for example that you could try is if replacing

values.into_iter().take(0).map(|_| "").collect()

with a transmute [i.e. unsafe { mem::transmute(values) }] makes a difference in performance. (I wouldn’t be surprised if it doesn’t make a difference though.)

Regarding optimizations around this loop in general, you could also do something like turning

for value in values.drain(..) {
    if print_delim {
        write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
    }
    print_delim = true;
    write!(&mut writer, "{}", value).unwrap();
}

into

let drain = values.drain(..);
if let Some(value) = drain.next() {
    write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
    write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
    write!(&mut writer, "{}", value).unwrap();
}
drop(drain);

or

let drain = values.drain(..);
if let Some(value) = drain.next() {
    write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
    write!(&mut writer, "{}{}", &opts.output_delimiter, value).unwrap();
}
drop(drain);

makes any difference.

Okay, I believe I understand everything that is happening.

We create staging_empty outside the loop with a &'static str. Then in the loop we create a variable that coerces the &'static str lifetime to a lifetime that fits the loop.

Then we iterate over staging and drain each inner vec, ensuring they are empty. then we return an effectively empty vec created by collecting on an iterator over the now empty values.

We are relying on the InPlaceIterator to optimize the collect calls to reuse the underlying memory becuse InPlaceIterator guarentees should apply since we are removing an item of the correct size in both instances.

The InPlaceIterator stuff feels a little hand wavy, but it seems to be working.

Replacing values.into_iter().take(0).map(...).collect() with iter::empty().collect() seems to keep the same performance which I think makes sense and looks a little tidier.

That change should introduce overhead though. Not using values as the basis for the iterator means that you’re deallocating the memory for all the values vectors. iter::empty().collect() is pretty much the same as Vec::new(). In particular, this will make the subsequent staging[pos].push(part) calls of later iterations more expensive.

By the way, the optimization of x.into_iter().take(0).map(|_| "").collect() can be seen here

Compiler Explorer

2 Likes

My benchmarks are unreliable it seems. :upside_down_face:

Which may explain why nightly and stable seem about the same performance even thought I would expect nightly to be faster since it can use the inplace trait.

Oh, no, that trait is used on stable as-well. It’s unstable for use by anything external to std, but the standard library itself can use it internally and does use it on stable and nightly.

1 Like