Yes, I am running it with --release.
Thanks you, I will try it. So in your view, is there some obvious "wrong" codes which can massively slow down the speed?
This line requires reallocating the contents of hold_set:
hold_set = &hold_set - &stop_set;
The reallocation can be avoided by modifying hold_set in-place:
for stop in stop_set.iter() {
hold_set.remove(stop);
}
(Though note that the optimal version of this is to iterate over whichever set is smaller, so if stop_set might be significantly larger than hold_set, that's worth doing — the other way around would work by calling hold_set.retain() to decide whether to keep each item.)
Yes, I tried it, but the result is the same. I guess this changing can speed up the code, but for the data: Vec<f32> I am using, there is no difference. So I guess there must lie some other codes slowing down the performance.
It feels like the inner-most loop will re-process already-processed data quite a lot.
I feel like I'm missing a lot, but it looks like you can run it at most once for every element that was not just added to hold_set.
i.e.if something was added to hold_set in the first outermost loop iteration, every iteration it stays, it will take longer to process, since you're re-chrcking against index 0..(j-1) even though you already know the answer
How large are your vectors? What you've written looks pretty fundamentally an O(n³) algorithm. Let's say you can process units of work in the loop at 5GHz. With 2000 data points in your vector at O(n³), that'll take ~2s. With 3000 data points, ~5s. 4000, 13s; 5000, 25s; 6000, 43s; 10000, 200s; 20000, 30m.
There are probably some data points less than 0, so you'll grow somewhat under O(n³). The main thing you'll want to do is figure out ways to just do less work.
The outer i loop seems necessary. However, there's one place I think there's wasted work that should be fairly simple to cut out:
Each iteration of i, you're comparing all of hold_set against 0..i. For every value except the one you just added, you already checked it against 0..i-1, and the answer isn't going to have changed.
Fixing that duplicated work will still leave you at O(n²) work, but that means 20000 units of work at 5GHz will only take 80ms instead of 30m. That's a big difference.