BFS is encountering OOM in Rust but not Python

So, the goal of my script is to identify all of the nodes that lie on any path between any source-sink pair. The issue is that my graph can be pretty wide. I prototyped a solution in python first, got that working, and translated the solution over to rust (as a sanity check). Unsurprisingly, the python code is slow but works just fine, while (very surprisingly) the Rust version actually runs out of memory.

This is my working version that still OOMs, but much more slowly after adding an Arc (very much a skill issue). For completeness, I've listed my original Rust and python code below.

fn find_paths_with_progress(
    start: CMatIdx,
    end: CMatIdx,
    adjacency: &HashMap<CMatIdx, HashSet<CMatIdx>>,
    nodes_on_path: &mut HashSet<CMatIdx>,
    max_depth: usize,
) {
    let mut queue = VecDeque::new();
    
    let start_visited = Arc::new(HashSet::new());
    let start_path = Arc::new(Vec::new());
    queue.push_back((start, start_path, start_visited));

    while let Some((current, path_arc, visited_arc)) = queue.pop_front() {
        // Clone and update both visited and path ONCE at the top
        let mut new_visited = visited_arc.as_ref().clone();
        new_visited.insert(current);
        let new_visited_arc = Arc::new(new_visited);
        
        let mut new_path = path_arc.as_ref().clone();
        new_path.push(current);
        let new_path_arc = Arc::new(new_path);

        if current == end {
            for node in new_path_arc.iter() {
                nodes_on_path.insert(*node);
            }
            continue;
        }

        if new_path_arc.len() >= max_depth {
            continue;
        }

        // All neighbors share the same Arc instances
        for &neighbor in adjacency.get(&current).unwrap_or(&HashSet::new()) {
            if !new_visited_arc.contains(&neighbor) {
                let shared_visited = Arc::clone(&new_visited_arc);
                let shared_path = Arc::clone(&new_path_arc);
                queue.push_back((neighbor, shared_path, shared_visited));
            }
        }
    }
}

For Completeness

use std::collections::VecDeque;
use std::collections::{HashMap, HashSet};
use tqdm::pbar;

pub(crate) type NeuronID = i64;
pub(crate) type CMatIdx = i64;

fn find_paths_with_progress(
    start: CMatIdx,
    end: CMatIdx,
    adjacency: &HashMap<CMatIdx, HashSet<CMatIdx>>,
    nodes_on_path: &mut HashSet<CMatIdx>,
    max_depth: usize,
) {
    let mut queue = VecDeque::new();

    let mut start_visited = HashSet::new();
    start_visited.insert(start);
    queue.push_back((start, vec![start], start_visited));

    while !queue.is_empty() {
        let (current, path, visited) = queue.pop_front().unwrap();

        if current == end {
            for node in path.iter() {
                nodes_on_path.insert(*node);
            }
            continue;
        }

        if path.len() >= max_depth {
            continue;
        }

        for neighbor in adjacency.get(&current).unwrap_or(&HashSet::new()) {
            if !visited.contains(neighbor) {
                let mut new_visited = visited.clone();
                new_visited.insert(*neighbor);
                let mut new_path = path.clone();
                new_path.push(*neighbor);
                queue.push_back((*neighbor, new_path, new_visited));
            }
        }
    }
}

What's wild to me is that the python naively stores all of the paths between the start and end before returning, but seems to work just fine.

def find_paths_with_progress(
    start: int, end: int, adjacency: dict[int, set[int]], max_depth: int
) -> list[list[int]]:
    """Find all simple paths from start to end with depth limit."""
    paths = []
    queue = deque([(start, [start], {start})])

    while queue:
        current, path, visited = queue.popleft()

        if current == end:
            paths.append(path)
            continue

        if len(path) >= max_depth:
            continue

        for neighbor in adjacency.get(current, set()):
            if neighbor not in visited:
                new_visited = visited | {neighbor}
                queue.append((neighbor, path + [neighbor], new_visited))

    return paths

Note I'm exposing the rust implementation to python via maturin, and PyO3, so the function call is essentially a simple swapping out of the function from:

find_paths_with_progress(start, end, adjacency, max_depth)

# Over to 

rust_pkg_name.find_paths_with_progress(start, end, adjacency, max_depth)

so all the parameters and such are exactly the same

It seems like you don't care about path order anymore, just nodes along some path? The only thing you're doing with the Vecs is to extend the nodes_on_path HashSet with them, but they have the same contents as the visisted HashSets.

So just don't have the Vecs?

(Non-Arc version.)

You seems to not attach the Python code being discussed.

That's a really good observation! That'll cut down on the memory consumption quite a bit, so thank you! Do you think the Arc is worth it? And am I using it correctly with the goal of saving memory between all the neighbors so they don't each have a copy of it?

I did (the second last code block), but I'll paste it below for

def find_paths_with_progress(
    start: int, end: int, adjacency: dict[int, set[int]], max_depth: int
) -> list[list[int]]:
    """Find all simple paths from start to end with depth limit."""
    paths = []
    queue = deque([(start, [start], {start})])

    while queue:
        current, path, visited = queue.popleft()

        if current == end:
            paths.append(path)
            continue

        if len(path) >= max_depth:
            continue

        for neighbor in adjacency.get(current, set()):
            if neighbor not in visited:
                new_visited = visited | {neighbor}
                queue.append((neighbor, path + [neighbor], new_visited))

    return paths
1 Like

Apparently it was useful to you. However, after some mulling, I think it may be possible to avoid the Arc but get similar savings by storing a different sort of data structure in your queue: A Vec of unvisited neighbors and one visited HashSet that they all share.

Click for an attempt at summarizing my thinking.

Key:

  • v: Overhead ratio of Vec, 1..=2 probably
  • h: Overhead ratio of HashSet, bigger than Vec but a low constant probably
  • a: Allocation size of an Ark<HashSet<Idx>>, 64 I believe
  • s: Size of an index (8 apparently?)
  • k: Number of unvisited neighbors
Approach Allocation count Approx. allocation size
k of (Idx, HashSet) k k * h * depth * s
k of (Idx, Arc<HashSet>) k+1 k * a + h * (depth - 1)* s
1 of (Vec<Idx>, HashSet) 2 k * v * s + h * (depth - 1)* s

If the second row is an improvement, we can intuit that k*a - a*h*s must be better than (k-1) *h*depth*s, even with one more allocation / a bunch of tiny allocations.

And my intuition for why the third row is better than the second is that v*s < a probably, plus you get a lot less individual allocations.

(Of course this is all back-of-an-envelope reasoning. Implement and measure is the only real disciplined approach.)

Passing side-note: You could have used Rc<_> since you're not multi-threading here, but that doesn't change the memory footprint AFAIK.

Below is a sketch of the idea, but be warned that I did zero testing, so there may be logic errors etc.

reference cycles are the archenemy of reference counters, while GC are designed to handle cycles. you cannot simply naively tranlate code from a GC managed language to an ownership based language. at the bare minimal, you need to carefully use Weak to break cycles.

however, the BFS algorithm don't need Arc (or Rc) to implement. in your rust implementation, I don't understand why you use Arc in the first place, and they are very distracting to read the code.

@quinedot already showed how to do it without Arc. on the other hand, you mentioned your graph can be quite wide, I wonder did you try DFS and compare the memory consumption to BFS?

with BFS, you store the partial paths along with the next node in the queue, the length of this queue depends on the width of the graph, and some (maybe most) of these partial paths are discarded since they don't reach the target node. this is not particularly memory efficient.

with DFS, you don't need to store the partial paths explicitly: the paths are implicitly on the stack and can be re-constructed during backtracking. however, only the paths that reached the target node needs to be constructed and put into the final result, no "partial" paths are allocated.

DFS is very straightforward to implement in recursion, but an iterative implementation isn't very hard either.

Thanks for the feedback and great questions :slight_smile:

in your rust implementation, I don't understand why you use Arc in the first place, and they are very distracting to read the code.

I did it because I was doing a lot of .clone() and it was suggested to me by someone who (presumably) knows better than I do. I think it was in response to this snippet of mine:

for neighbor in adjacency.get(&current).unwrap_or(&HashSet::new()) {
    if !visited.contains(neighbor) {
        let mut new_visited = visited.clone();
        new_visited.insert(*neighbor);
        let mut new_path = path.clone();
        new_path.push(*neighbor);
        queue.push_back((*neighbor, new_path, new_visited));
    }
}

did you try DFS

I actually did DFS after my Rust BFS crapped out, but it's obnoxiously slow, even with some optimizations. After that I revisited my BFS solution and came here to ask :person_shrugging:

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.