Review of unsafe usage

Just to be sure, you are running and measuring in release mode?

Too bad, I once had a similar problem, and allocating inside the loop really surprised me to have no overhead.

Yep, release mode and everything! It makes sense that it would be slower here given the lifetimes involved I don't think optimizer could see a way to reuse that vec.

This is close to the performance using unsafe. I've haven't dug into the assembly or looked at how it's optimizing yet.

How does ^ work / where could I read a bit more about that?

Looking around the internet a bit more someone pointed out that this problem is similar to what threads have, where the expensive resource (the thread) outlives the values sent to it for computation. I haven't figured out how to apply the scoping idea here, but it also seems promising

You should look here if you haven’t read that yet:

Subtyping and Variance - The Rustonomicon

Subtyping gives rise to subtyping coercion (first bullet point here), i.e. if T is a subtype of S than I can coerce values of type T into type S.

The typical example is that &'a T can be converted into &'b T if 'b is a shorter lifetime than 'a, but the same mechanism also allows converting Vec<&'a T> into Vec<&'b T> or even Vec<Vec<&'a T>> into Vec<Vec<&'b T>>.

1 Like

I have a feeling you could avoid those new allocations by some shennigans with Vec::from_raw_parts and Vec::into_raw_parts, but I couldn't really do it with confidence (partially because of the "vec of vec"s, which adds an additional layer...).

@duck_tape In case that wasn’t clear, this step is the line

let mut staging = staging_empty;

A new variable inside of the loop is needed because variables can’t change their type and the lifetime of the strings is, conceptually, a new lifetime for every loop iteration. But variables don’t cost anything. The variable staging then has a type Vec<Vec<&'some_shorter_lifetime str>>. The coercion happens implicitly.

1 Like

@steffahn That makes sense and is pretty nifty. Any recommendations on how to get visibility into exactly what the compiler is doing there and if the InPlaceIterable specializations take place? It looks like running with nightly produces the same performance as stable as it stands now which makes me think it's not yet as optimal as possible.

Overall I think I need to think hard about reorganizing the entire loop to try to get rid of the staging vec entirely. I just can't imagine how to do that yet.

Here's a branch with your solution baked in case anyone wants to see it with full context: GitHub - sstadick/hck at remove_unsafe

I played around with that a bit and agree, the nested layers make it complex enough that I'd rather just not.

I wouldn’t know how to get good insight into this, but at least it’s not an LLVM-optimization but a hand-coded specialization in the standard library, so it’s in a sense more reliable. And it even applies in debug mode. You can be fairly sure that there won’t ever be any allocations introduced by these .collect() calls in the future. Really, I wouldn’t have been surprised if there was no overhead at all from my solution. So if there still is some, it’s either LLVM failing so optimize away some stuff or maybe some differences in the panic paths getting in the way of certain optimizations?

One thing for example that you could try is if replacing

values.into_iter().take(0).map(|_| "").collect()

with a transmute [i.e. unsafe { mem::transmute(values) }] makes a difference in performance. (I wouldn’t be surprised if it doesn’t make a difference though.)

Regarding optimizations around this loop in general, you could also do something like turning

for value in values.drain(..) {
    if print_delim {
        write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
    }
    print_delim = true;
    write!(&mut writer, "{}", value).unwrap();
}

into

let drain = values.drain(..);
if let Some(value) = drain.next() {
    write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
    write!(&mut writer, "{}", &opts.output_delimiter).unwrap();
    write!(&mut writer, "{}", value).unwrap();
}
drop(drain);

or

let drain = values.drain(..);
if let Some(value) = drain.next() {
    write!(&mut writer, "{}", value).unwrap();
}
for value in drain {
    write!(&mut writer, "{}{}", &opts.output_delimiter, value).unwrap();
}
drop(drain);

makes any difference.

Okay, I believe I understand everything that is happening.

We create staging_empty outside the loop with a &'static str. Then in the loop we create a variable that coerces the &'static str lifetime to a lifetime that fits the loop.

Then we iterate over staging and drain each inner vec, ensuring they are empty. then we return an effectively empty vec created by collecting on an iterator over the now empty values.

We are relying on the InPlaceIterator to optimize the collect calls to reuse the underlying memory becuse InPlaceIterator guarentees should apply since we are removing an item of the correct size in both instances.

The InPlaceIterator stuff feels a little hand wavy, but it seems to be working.

Replacing values.into_iter().take(0).map(...).collect() with iter::empty().collect() seems to keep the same performance which I think makes sense and looks a little tidier.

That change should introduce overhead though. Not using values as the basis for the iterator means that you’re deallocating the memory for all the values vectors. iter::empty().collect() is pretty much the same as Vec::new(). In particular, this will make the subsequent staging[pos].push(part) calls of later iterations more expensive.

By the way, the optimization of x.into_iter().take(0).map(|_| "").collect() can be seen here

Compiler Explorer

2 Likes

My benchmarks are unreliable it seems. :upside_down_face:

Which may explain why nightly and stable seem about the same performance even thought I would expect nightly to be faster since it can use the inplace trait.

Oh, no, that trait is used on stable as-well. It’s unstable for use by anything external to std, but the standard library itself can use it internally and does use it on stable and nightly.

1 Like

:exploding_head:

I'm not awesome at reading assembly, but as far as I can tell no allocations happen in the staging_empty = staging...

And just to round it out fully, if I swap iter::empty().collect() in that godbolt example you can see the allocation slotted in there. So the InPlaceIterator appears to be working which is super cool.

Thanks @steffahn! I learned a ton in this thread :+1:

Thanks for making a playground example of this. That proved to be vital :+1:

2 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.