Parallelising game tree search

I am writing a perfect connect 4 AI using a minimax game tree search with alpha-beta pruning and a transposition table. I would like to make it faster by parallelising the search, but I am not finding much success.

Originally a single-threaded search of one specific position took ~10s. Changing the transposition table to a lock-free atomic-based Arc<Vec> of entries rather than a Rc<RefCell<Vec>>> increased this to ~11s. So far my attempts to parallelise this have made it 6-7x slower (60-70s) and I'm not sure why.

My parallel version takes a given position and spawns a thread for each possible move to make a search one level down (max. 6 threads), and the best score found out of those is returned. Given that the transposition table is lock-free I wouldn't suspect contention.

I have also tried using rayon and async-std to try and put each search operation into a thread pool, but wasn't able to make it work due to the recursive nature of the minimax tree search.

Does anyone have any ideas why the parallel version would be so much slower?

What troubles did you run into? A quick look at the rayon docs suggests that join() could be used at each branching point to schedule the two subsearches on the thread pool.

join() didn't seem useful to me since each branching position can have anywhere from 1 to 6 children, the current implementation looks like:

fn minimax(board, alpha, beta) -> i32 {
    // do some heuristics, check the transposition table etc.
    for move in possible_moves {
        let new_board = board.play(move);
        score = -minimax(new_board, -beta, -alpha);
        // check the new score against alpha and beta for pruning, comparison to previous calculations
    }
    // return the actual score of the position
}

With rayon I used a scope() at the top level and spawn() calls in the search to try to add new nodes, but couldn't figure out how to get the scores out of the spawn() calls properly given that the current call would have to block on those values. I have no idea how rayon would perform with so many tasks just blocking waiting for lower ones to finish. This seemed like an ideal use-case for async but converting it over gave me a cryptic error about cyclic Futures that wouldn't let it compile.

If you have any suggestions on good ways to thread-pool a tree search I'm all ears!

For join, you could try some like this. Obviously its a bit contrived as an example since it doesn't have your data types. I'm not sure if it will actually be more efficient. And I think there's probably a risk to overflowing the stack if your search is too deep, but you could give it a shot.

Playground Link

If neither that or your rayon::scope version isn't faster than a single threaded version, then another possibility is that in the process of parallelizing it, you're copying a lot of data.

You probably don't want async. Async is primarily valuable when you are waiting for things out of your control (ie networking). In this case, you have full control over your calculation so async likely won't help in any special way other than as a mechanism for distributing calculations.

1 Like

@Algorhythm would you mind sharing your code? This way we can make more accurate suggestions to optimize it :slightly_smiling_face:

I just cleaned up the code and put it in a git repo. Here is a link to the test that I get the 11s number from. The functions you will be interested in are Solver::negamax, Solver::solve and Solver::par_solve. You can change the test to use the parallel solve by switching line 417 with let (calc, _) = solver.par_solve();

Can you add the test_data folder? It seems to be missing from the repo.

Ofc. Done.

First question. Are you sure your par_solve is doing the same thing? I removed the threads to run it serially and its taking much longer, which might explain the poor comparison to the original solve.

Since each thread is doing its own search, probably not, since they can't prune their trees based on other threads results. I just some good resources over on the chess programming wiki. I'll try out implementing one of their suggested parallel searches and see how that goes.

N.B. It looks like running the benchmark as a test results in it running another ~10x slower (110s) than copying the same code into main which I had been doing previously. Not something I expected

Cargo has separate dev and test profiles, and you only set the optimization level for dev. The tests are being compiled and run unoptimized. You probably also want to disable concurrent test execution so that multiple tests aren’t competing for the same set of processors.

For that you can get it to run at the same speed by doing which does the release profile. --nocapture is needed to see the output

cargo test --release begin_hard -- --nocapture
1 Like

I poked around at it this morning. I’m not too familiar with search optimization like this. It seems tricky though because I think you would need the tasks to have a cancellation mechanism or you need to be more careful about which paths to search before getting into a recursive call so that you can avoid going down paths altogether. The naive approach of just running things in parallel only really works if you want to follow substantially all search paths.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.