Strange Performance and Memory Issues


#1

I am writing an algorithm (poorly) which has somehow incurred extreme memory consumption and slowness.

The algorithm started out in JavaScript which was then ported to Dart. I have also tried porting it to Swift (but was slower than Dart). I finally ported it to Rust which is having severe issues.

The algorithm is setup as stages. The stage which has issues is compute_matches_b. I have a descending iteration counter implemented for that stage so the iterations can be viewed in the terminal. There is only one variable to change: players. Players has to be set at intervals of 4 (4, 8, 12, etc). The problem is for 20 players there is 11,884 iterations with each having a varying workload.

The implementation for Dart and Rust are almost identical. With Rust, I split the matches stage into matches_a and matches_b due to lifetime issues (I am still learning the language).

When running in Dart with 16 players, compute_matches:
• takes around 1.5s

When running in Rust with 16 players, compute_matches_a + compute_matches_b:
• takes around 1.4s

When running in Dart with 20 players, compute_matches:
• takes around two hours
• memory consumption maxes out at 20GB around the last quarter (< 1,000 mark)
• gets down to 5,000 in a short amount of time

When running in Rust with 20 players, compute_matches_b:
• takes around ?? hours or ?? days
• memory consumption maxes out at around 70GB around the 7,500 mark
• gets down to 7,440 after a long time and exits (signal: 9)

I ran them both side by side and Dart absolutely trounces Rust during this stage for 20 players. By the time Rust gets down to 11,500, Dart is already at around 5,500. For <= 16 players – and all other stages of the algorithm – it is the opposite.

I am using version 1.1.0 on OS X 10.10, but it also exhibits that same behaviour with version 1.0.0.

As you can see, this behaviour definitely seems odd.

OS X has memory compression which allows such massive amounts of memory beyond the hardware’s limits (16GB physical memory) to to be used without resorting to a swapfile. This still may be the cause of the slowness as OS X may not be able to keep up with the demand, but there is still the problem of the excessive memory consumption which the Dart version does not have.

Below is the terminal output for 20 players (ellipses denote omission for brevity)

cargo run --release --verbose
    Fresh rustc-serialize v0.3.14
    Fresh gcc v0.3.5
    Fresh libc v0.1.8
Compiling rust v0.1.0 ...
  Running 'rustc src/main.rs --crate-name rust --crate-type bin -C opt-level=3...'
    Fresh time v0.1.25
  Running 'target/release/rust'
11884
...
7440
Process didn't exit successfully: 'target/release/rust' (signal: 9)

And below is the Rust code (apologies for the quality of the code being substandard). Scroll down for the problem compute_matches_b function:

use std::collections::*;

fn has_all_ranks(round: &Vec<&Vec<&Vec<i8>>>, rank_tally: &mut Vec<bool>) -> bool {
    for teams in round.iter() {
        for team in *teams {
            for rank in *team {
                if rank_tally[(*rank - 1) as usize] {
                    for mark in rank_tally.iter_mut() {
                        *mark = false;
                    }
                    return false;
                }
                else {
                    rank_tally[(*rank - 1) as usize] = true;
                }
            }
        }
    }
    for mark in rank_tally.iter_mut() {
        *mark = false;
    }
    true
}

fn compute_pairs(ranks: &Vec<i8>) -> Vec<Vec<i8>> {
    let count = ranks.len();
    let mut pairs: Vec<Vec<i8>> = Vec::with_capacity(count * count / 2 - count / 2);
    for outer in 0..count {
      for inner in outer + 1..count {
          pairs.push(vec![ranks[outer], ranks[inner]]);
      }
    }
    pairs
}

fn compute_evens(ranks: &Vec<i8>) -> HashSet<Vec<i32>> {
    let largest_even: i32 = ranks[ranks.len() - 4 .. ranks.len()]
        .iter()
        .fold(0, |total, &value| total + value as i32);

    let total: i32 = ranks
        .iter()
        .fold(0, |total, &value| total + value as i32);

    let matches = ranks.len() as i32 / 4;
    let minimum = 10;
    let maximum = total / 2;

    let mut result = vec![0; matches as usize];
    let mut results: HashSet<Vec<i32>> = HashSet::new();

    fn recurse(remainder: i32, minimum: i32, maximum: i32, matches: i32, largest_even: i32,
        depth: i32, result: &mut Vec<i32>, results: &mut HashSet<Vec<i32>>) {

        if depth + 1 == matches {
            if remainder < minimum || remainder > largest_even {
              return;
            }

            result[depth as usize] = remainder;

            if result.iter().fold(0, |x, &y| x + y) / 2 != maximum {
                return;
            }

            let mut copy = result.clone();
            copy.sort();

            if results.contains(&copy) {
                return;
            }

            results.insert(copy);
            return;
        }

        let mut number = minimum;

        while number <= maximum {
          result[depth as usize] = number;

          // minimum as row number from previous player quantity
          recurse(remainder - number, minimum + 8, maximum, matches, largest_even, depth + 1, result, results);
          number += 2;
        }
    }
    recurse(total, minimum, maximum, matches, largest_even, 0, &mut result, &mut results);
    results
}

fn compute_teams<'a>(pairs: &'a Vec<Vec<i8>>, evens: &HashSet<Vec<i32>>) -> Vec<Vec<Vec<&'a Vec<i8>>>> {
    let mut results: Vec<Vec<Vec<&Vec<i8>>>> = Vec::with_capacity(evens.len());
    for even in evens.iter() {
        let mut current = Vec::with_capacity(even.len());
        for match_rank in even.iter() {
            let team_rank = match_rank / 2;
            let teams: Vec<&Vec<i8>> = pairs.iter().filter(|pair| ((pair[0] + pair[1]) as i32) == team_rank).collect();
            current.push(teams);
        }
        results.push(current);
    }
    results
}

fn compute_matches_a<'a>(teams : &Vec<Vec<Vec<&'a Vec<i8>>>>) -> Vec<Vec<Vec<Vec<&'a Vec<i8>>>>> {
    let mut to_return : Vec<Vec<Vec<Vec<&Vec<i8>>>>> = Vec::with_capacity(teams.len());
    for matches in teams {
        let transitive_a = {
            let mut temp : Vec<Vec<Vec<&Vec<i8>>>> = Vec::with_capacity(matches.len());
            for teams in matches {
                let count = teams.len();
                let mut results: Vec<Vec<&Vec<i8>>> = Vec::with_capacity(count * count / 2 - count / 2);
                for outer in 0..count {
                  for inner in outer + 1..count {
                      results.push(vec![teams[outer], teams[inner]]);
                  }
                }
                temp.push(results);
            }
            temp
        };
        to_return.push(transitive_a);
    }
    to_return
}
// NOTE: This is the problem function
fn compute_matches_b<'a>(matches_a: &'a Vec<Vec<Vec<Vec<&'a Vec<i8>>>>>, players: i8) -> Vec<HashSet<Vec<&'a Vec<&'a Vec<i8>>>>> {

    let mut count = matches_a.len();

    let mut rank_tally = vec![false; players as usize];
    let mut to_return : Vec<HashSet<Vec<&Vec<&Vec<i8>>>>> = Vec::with_capacity(matches_a.len());

    // Finds the cartesian product
    for matches in matches_a {
        let length = matches.len();
        let solutions = {
            let mut solutions = 1;
            for array in matches {
                solutions *= array.len();
            }
            solutions
        };
        let mut results = HashSet::with_capacity(solutions);
        let mut current_solution = 0;
        while current_solution < solutions {
            let mut j = 1;
            let mut array = 0;
            let mut solution: Vec<&Vec<&Vec<i8>>> = Vec::with_capacity(length);
            while array < length {
                let length = matches[array].len();
                solution.push(&matches[array][(current_solution / j) % length]);
                j *= length;
                array += 1;
            }

            // Only add to results if all ranks are present
            if has_all_ranks(&solution, &mut rank_tally) {
                // Sort so duplicates can be distinguished by set
                solution.sort_by(|left, right| {
                    let ordering = left[0][0] - right[0][0];
                    if ordering < 0 {
                        std::cmp::Ordering::Less
                    }
                    else if ordering > 0 {
                        std::cmp::Ordering::Greater
                    }
                    else {
                        std::cmp::Ordering::Equal
                    }
                });
                results.insert(solution);
            }
            current_solution += 1;
        }

        if !results.is_empty() {
            to_return.push(results);
        }

        count = count - 1;
        println!("{}", count);

    }
    to_return
}

fn main() {
    let players = 16;
    let ranks = (1..players + 1).collect::<Vec<i8>>();
    let pairs = compute_pairs(&ranks);
    let evens = compute_evens(&ranks);
    let teams = compute_teams(&pairs, &evens);
    let matches_a = compute_matches_a(&teams);
    let matches_b = compute_matches_b(&matches_a, players);
}

I am using nested vectors as the structure helps me reason about the algorithm during development. I know I could probably vastly improve performance by dealing with a flat vector, but doing that may be beyond my abilities (I am not a CS kind of developer). I am also just generally curious about what is going on.

Any thoughts will greatly appreciated. Thanks!


#2

Oh, dear… vectors of vectors of vectors of vectors of vectors of vectors…

You can get away with things like this in Dart since it’s garbage collected. GC’d languages have little-to-no concept of ownership and allow shared mutable aliases to be passed around willy-nilly, so you almost never have to explicitly clone them unless you want two independent sets of the same data. However, every time you create a new vector in Rust, it has to use a new, unique heap allocation, and vectors in Rust allocate space in powers of 2.

On top of that, I’m seeing a lot of wild exponential growth in this program: deeply nested loops performing allocations in their innermost bodies. If you did the math and added up the heap usage for every new vector, you’d probably see where all your memory’s disappearing off to.


#3

Hmm… I was under the impression I wasn’t doing much cloning as each stage borrows data from the previous using references. If I remove the references and changed it to cloning, it goes about five times slower for 16 players.

My understanding of your reply is that creating new vectors generally – not cloning – is more costly in Rust than GC languages. I guess I have to either use a flat vector or C/C++.

And yes, it is definitely an exponential algorithm. I am very well aware. :smile:


#4

I guess I have to either use a flat vector or C/C++.

Actually, I’m interested in seeing how you’d implement it in C++ because that program would probably be closer to what you’d want to create in Rust than one created in Dart, simply because the memory management models are closer.

Even vectors of references take heap allocations because you have to store the pointers somewhere. Cloning will, obviously, only make the problem worse.

You may not necessarily need to create a flat vector, but I’m betting you can improve the situation quite a bit with some careful consideration.


#5

I don’t have that much insight into what’s causing the performance issues, but here are a few other tips for the code:

If you know how many elements there are in a vec, it’s generally better to replace it with a fixed-sized array (if it’s a fairly small number). You can replace a few of the instances of Vec<X> with [X; 2] for a fixed-sized array of 2 values.

For instance, if there’s an Vec<i8> which only ever has two values, and is created by vec![..., ...], replace it with [i8; 2] (created with just [..., ...]). A fixed-size array doesn’t allocate any memory on the heap, and thus should at least be a bit of an increase in performance (negligible compared to how much the program is actually going, but a slight increase none-the-less).

One other note about style: where ever you use &Vec<T>, it’s generally preferred to use &[T] instead. The main difference being that a slice can represent part of a vec (instead of all of it), and although you don’t need that in this case, it is the generally accepted style choice. Vec dereferences to [T], so the only things you would need to change are function signatures and type declarations.


#6

Thanks for the tips.

My main reason for using a vector is that they are supported by rustc_serialize for JSON encoding. Apparently, I can convert arrays to slices for automatic JSON encoding. I could also implement my own encoding for arrays.