Performance issue - High complexity code

I bought RUST by its performance so I decided to translate one project where performance matters a lot, from JAVA 11 to Rust.

The thing is the Version written in JAVA performance pretty much better more than 3x in single thread, +10X in multi thread

For context purpose: The most complex code is a function that trys to find an assigment between 2 sets, imagine that you have houses and stores, the stores have a fixed capacity and houses have necesities, you want to find the best assignment to walk less.

With all this in mind, I guess that the problem is how I use the variables, maybe clone() is called too much automatically, maybe reference access cause some unknown behavior.

Any upgrade that reduce while loop time will be great because it iterate over 5000 times. Sorry for the long code but I think everything is relevant in this case. You can't copy and paste this code if you want I can send you the git project link Here.

PD: I'm running with cargo run --release

pub fn evaluate(elem: &Element) -> EvaluatedElement {

    let p1 = properties::get_cast::<f64>("p1");
    let p2 = properties::get_cast::<usize>("p2");
    let p3 = properties::get_cast::<usize>("p3");
    let p4 = properties::get_cast::<f64>("p4");
    let p5 = properties::get_array::<usize>("p5");

    let mut kinds1 = kind1::get_map(); //almost 300 elements
    let kinds2 = = kind2::get_map(); //almost 300 elements

    let usables = elem.usables();

    for (i, &a) in usables.iter().enumerate() {
        if !a {
            &kinds1.remove(&(i + 1));
        }
    }

    let mut assignations = HashMap::new();

    for k in (1..=p2).rev() {
        let mut kinds2_sub = HashMap::with_capacity((&kinds2).len());
        for (_, p) in kinds2.iter() {
            if p.val1[k - 1] == 0 {
                continue;
            }

            &kinds2_sub.insert(p.id, Kind2Sub {
                parent: p.clone(),
                val2: p.val1[k - 1],
                val3: std::f64::MAX,
                kind1_id: std::usize::MAX,
            });
        }

        let mut opt_kind1_id: Option<usize> = Option::None;

        while !&kinds2_sub.is_empty() {//arround 5500 times loop

            for mut l in kinds2_sub.values_mut() {
                match opt_kind1_id {
                    None => (),
                    Some(id) => if !l.kind1_id == id { continue; },
                }

                l.val3 = std::f64::MAX;
                l.kind1_id = std::usize::MAX;

                for b in kinds1.values_mut() {
                    let dist_b_l = calc_dist(b.id, l.id);
                    if dist_b_l > p4
                        || (p1 as usize).min(l.val2) > p4 + b.val3
                        || b.val2 < k
                        || (l.val2 < (2 * p4) && (b.val3 as i16 - l.val2 as i16) < 0)
                    { continue; }

                    let tmp = dist_b_l * p1.min(l.val2 as f64);

                    if l.val3 > tmp {
                        l.val3 = tmp;
                        l.kind1_id = b.id;
                    }
                }
            }

            let lc = kinds2_sub.values_mut().min_by(|x, y| x.val3.partial_cmp(&y.val3).unwrap()).unwrap();
            let obc = kinds1.get_mut(&lc.kind1_id);
            let bc = obc.unwrap_or_else(|| {
                panic!("No assignation able")
            });
            let b_c_id = (*bc).id;
            let l_c_id = (*lc).id;

            let time = if lc.val2 < (2usize * p1 as usize) { lc.val2 } else { p1 as usize };
            let val = (*bc).val3 as i16 - time as i16;

            let assignation = Assignation { kind1_id: (*bc).id, kind2_id: lc.id, val3: k, val4: 0 };
            let assignation_id = assignation.id();//id() = fn concatenate first 3 values
            if !assignations.contains_key(&assignation_id) {
                assignations.insert(assignation.id(), assignation);
            }
            let mut assignation = assignations.get_mut(&assignation_id).unwrap_or_else(|| panic!("Assignation not found {}", assignation_id));

            if val >= 0 {
                assignation.val4 += time;
                lc.val2 -= time;
                (*bc).val3 -= time;
            } else {
                assignation.val4 += (*bc).val3;
                lc.val2 -= (*bc).val3;
                (*bc).val3 = 0;
            }

            if (*bc).val3 < p4 {
                &kinds1.remove(&b_c_id);
            }

            if lc.val2 == 0 {
                &kinds2_sub.remove(&l_c_id);
            }

            opt_kind1_id = Some(b_c_id);
        }
    }


    let assignations_values = assignations.iter().map(|(_, v)| v.clone()).collect();

    EvaluatedElement::evaluation(assignations_values)
}

Have you tried a profiler? e.g. perf or Xcode instruments?

The code seems to be hashmap-heavy. Rust standard library uses a hasher that is safe against collision attacks, but it's not the fastest. Try using hashmaps with ahash.

Also try building with RUSTFLAGS="-Ctarget-cpu=native"

1 Like

Here you mention 3 things

  • Profiler: where can I find info to use it?
  • Hasher from ahash, I read other topic about fxhash what is the difference between both?
  • RUSTFLAG: How can I set only for this project, I can set it on {home}/.cargo/config

/path/to/project/.cargo/config

1 Like
cargo build
warning: unused config key `build.RUSTFLAGS` in `C:\Users\...\V3\backend\.cargo\config`

Maybe is in lowercase RUSTFLAGS

Needs to be lower case:

[build]
rustflags = "-Ctarget-cpu=native"

Since you're on Windows, you should try Windows Performance Analyzer.

Profilers depend on what platform you use. In general, profiling Rust is the same as profiling C, so if you find instructions for profiling a C executable, you can do the same thing with a Rust executable.

For example on macOS you launch Instruments.app, select time profiler, select Rust binary to run, and hit record.
For Linux there's perf which requires some fiddling with command line and cachegrind for visualization.


ahash is basically the same as fxhash.

For profiling I've found [cargo-flamegraph] (https://github.com/flamegraph-rs/flamegraph) to be incredibly easy to use. I'm not sure how good Windows support is, but on macOS I've found it works well and is generally quite useful. It doesn't give information on things like memory usage or where allocations are occuring, but it can at least help you determine which functions are taking up the most time, and therefore point you to where you should focus your optimization efforts.

1 Like

Update:

  • flamegraph has some kind of problem and I can't run
  • I replace hashMap with fxHash, without improvements
  • I download WPR & WPA (profiler 4 Windows) but I still trying to use it
  • I read abour targetCpu it is used in order to optimize binaries to my CPU :smiley: (no improvements)
  • I push the code, If you want I think just download and cargo run :confused:

I will try vec_map since my key are short integers.
then I will borrow a MacOS to try a Profiler, I didnt understand WPA, it show my process but without many information

Did you confirm that the Rust version does the same computation as the Java version? e.g. by comparing the number of times of the innermost loop is executed.
I read both versions and I think the difference of code is too little to justify "3x performance difference" while there are things can be tuned. If things doing are different, a comparison of languages is misleading.

1 Like

I think Java is using 90 - 100% CPU but Rust 30 - 50 %
The code is the same I just translate it, you can read it in the same repo, V2 folder

Then: I read some user had performance issues on windows

Like others say learn to read profiles. Expensive hash iterator sticks out.
For 40%

let v = bahias.values().collect::<Vec<_>>();
for mut l in kinds2_sub.values_mut() {
...
    for b in &v {
    //for b in kinds1.values_mut() {

(I would not recommend targetting development using nightly other than for future compatibility testing.)

Im here from a Mac :smiley:

  • Here is the Flamegraph <- What is the meaning of Samples? (9,400 samples 1.40%)
  • It performs a bit better but not enough 10% - 15% better than the Windows one
  • I will check bahias.values iteration
  • Nightly is because I have Rocket on the project, Do you know an easy way to listen http port?
  • Now, a doubt comes to me:
fn best_improve(s: SolucionEvaluada) -> SolucionEvaluada {
    let mut timer = crate::common::timer::Timer::start();

    let _n = 1;//properties::prop().n;
    let mut _a = s;
    for i in 0.._n {//25sec
        _a = _a.vns1swap().par_iter().map(super::evaluador::evaluate).min_by(|x, y| x.cost().partial_cmp(&y.cost()).unwrap()).unwrap();
        println!("\t\t\t\t\tBI {}: {}\t{}\t{}", i, _a.cost(), _a.count(), timer.lap());
    }
    _a
}

The previous way to find the best neighbor in parallel stream could be introducing an overhead?
Because, in single thread I have almost the same time.

The flame graph is hard to analyse because you are in rayon in different contexts calling into the evaluate function.

To optimise the evaluate function, try removing the rayon par_iter work so the flame graph is just showing evaluador being called in a loop. Also, break the evaluator into functions for create_kind2_sub and the individual bits in your function so you can see what's being done on each bit. (flame graph won't tell you about each line like many tools will so it breaks down in scientific code that is often large complicated functions.

Eventually when you make it parallel again you probably want to make the for b in kinds1.values_mut() loop parallel but if you're updating values in place this won't be possible without locks (which will prevent good performance).

A superior strategy would be to stick all the data into a kdtree and ask it what the nearest value is for each point.

At that level you look at annotated assembly.

No idea why you get underperforming CPU usage. You could (probably) use par_chunks() but that will only give an improvement if each step takes so little computation.

Here times on avanced step:

        Step  Value   Time  Used Stores (Bay)
RUST -> BI 90 2672540 28057 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]
Java -> BI 90 2672625  4704 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]

As you see it is 6X worst, I will split evaluate fn in order to get more information, but I need the mac again :upside_down_face:

YEAH!!!! My god! I never give up :stuck_out_tongue:

        Step  Value   Time  Used Stores (Bay)
RUST -> BI 90 2672540 28057 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]
Java -> BI 90 2672625  4704 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]
NEW: -> BI 90 2672540  1093 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0]

Now I have a 4X increase, you can't imagine how I feel Right now :smiley:

  • The "bug"
match opt_kind1_id {
     None => (),
     Some(id) => if !l.kind1_id == id { continue; },
}
  • The fix
    if let Some(id) = opt_kind1_id {
        if l.kind1_id != id {
            continue;
        }
    }

This continue skips 90% of finding a new value

Thank you so much guys I learn a lot this days :smiley:

6 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.