Unbelievably slow in these codes, what am I doing wrong?

Here is the codes:

#[derive(Debug, Clone)]
pub enum Trade {

fn oo(time_vec: &Vec<NaiveDateTime>, data: Vec<f32>) -> (Vec<Trade>, Vec<Trade>) {
    let mut o_vec = vec![Trade::TradeNo; time_vec.len()];
    let mut e_vec = vec![Trade::TradeNo; time_vec.len()];
    let mut hold_set: HashSet<usize> = HashSet::new();
    let mut stop_set: HashSet<usize> = HashSet::new();
    for i in 0..o_vec.len() {
        if data[i] > 0f32 {
            o_vec[i] = Trade::TradeOpen(i);
        for x in hold_set.iter() {
            let open_time = time_vec[*x];
            for j in (0..i).rev() {
                if time_vec[j] <= open_time {
                    if data[j] < 0f32 {
        if stop_set.len() != 0 {
            e_vec[i] = Trade::TradeExit(stop_set.iter().cloned().collect());
            hold_set = &hold_set - &stop_set;
    (o_vec, e_vec)

The function oo is too slow, so the body of function oo must lie some wrong things, I can not figure out which part, perhaps cloned?

Are you running with --release?

Consider running this under flamegraph to find hot spots.

Yes, I am running it with --release.
Thanks you, I will try it. So in your view, is there some obvious "wrong" codes which can massively slow down the speed?

Triply nested iterations look suspicious :face_with_raised_eyebrow:

This line requires reallocating the contents of hold_set:

hold_set = &hold_set - &stop_set;

The reallocation can be avoided by modifying hold_set in-place:

for stop in stop_set.iter() {

(Though note that the optimal version of this is to iterate over whichever set is smaller, so if stop_set might be significantly larger than hold_set, that's worth doing — the other way around would work by calling hold_set.retain() to decide whether to keep each item.)


Yes, I tried it, but the result is the same. I guess this changing can speed up the code, but for the data: Vec<f32> I am using, there is no difference. So I guess there must lie some other codes slowing down the performance.

It feels like the inner-most loop will re-process already-processed data quite a lot.
I feel like I'm missing a lot, but it looks like you can run it at most once for every element that was not just added to hold_set.

i.e.if something was added to hold_set in the first outermost loop iteration, every iteration it stays, it will take longer to process, since you're re-chrcking against index 0..(j-1) even though you already know the answer

How large are your vectors? What you've written looks pretty fundamentally an O(n³) algorithm. Let's say you can process units of work in the loop at 5GHz. With 2000 data points in your vector at O(n³), that'll take ~2s. With 3000 data points, ~5s. 4000, 13s; 5000, 25s; 6000, 43s; 10000, 200s; 20000, 30m.

There are probably some data points less than 0, so you'll grow somewhat under O(n³). The main thing you'll want to do is figure out ways to just do less work.

The outer i loop seems necessary. However, there's one place I think there's wasted work that should be fairly simple to cut out:

Each iteration of i, you're comparing all of hold_set against 0..i. For every value except the one you just added, you already checked it against 0..i-1, and the answer isn't going to have changed.

Fixing that duplicated work will still leave you at O(n²) work, but that means 20000 units of work at 5GHz will only take 80ms instead of 30m. That's a big difference.