My program runs slow on windows

I'm facing a performance issue on my laptop.
my othello(aka reversi) AI program takes around 2.6 seconds to solve a move on kde neon, while it takes 8.8 seconds on windows (both program runs on the same laptop).
I'm pretty sure that it wasn't the antivirus program or windows defender interfering my program, because my program start only once and have a loop input, I think it will be faster on the second input if it was the antivirus scanning the program, but it didn't.
I've tried both cargo build --release and rustc -C opt-level=3, but it performed as a same result.
This is my program https://github.com/lemon37564/othello-board/blob/main/builtinai/ai6.rs if needed.
The input will be "++++++++++++++OX++++XO++++++++++++++ 1"
I have to submit the executable since this is my project in university, can anyone help me?

It looks like your program spends a lot of time allocating memory (the call to ntdll.dll!RtlpAllocateHeapInternal near the top left of the flamegraph):

The system allocators on Windows and Linux have different performance characteristics and since your program runtime is dominated by memory allocation, it's likely the reason for the performance difference as well.

On a hunch, I guessed the issue was the call to Vec::with_capacity(32) on line 91 which is called from sorted_valid_nodes() on line 401. After changing this to simply Vec::new(), the program ran noticeably faster.

2 Likes

Thanks for replying!
I've just tried the Vec::new() and the speed has a notably difference!
But I'm little bit curious about the reason. Theoretically, I think it should be slower if my vector is appended dynamically than allocated the space at first.
Why would this happened?
BTW, may I ask what's the profiling tool in the picture? I liked the pprof in golang a lot, but I didn't found a similar tool on rust (I am new to rust).

You're welcome!

I don't have a good grasp of your program so these are just general guesses that may not be valid for this particular situation:

  • If nothing is ever actually pushed to the Vec, Vec::new() does not actually allocate any memory and so it will be much faster than than Vec::with_capacity().

  • Many allocators have optimizations for small sized allocations. 32 * sizeof<Node>() might be large enough that it by passes that fast path. If you only ever end up pushing a few items to the Vec in most instances, this could make a difference.

  • Depending on the distribution of data, it could be that you never need any more than the initial allocation Vec::new() gives you once an item is pushed into the Vec. In that case, there are no reallocations going on which is a situation where Vec::with_capacity() makes a lot of sense.

The tool is Windows Performance Analyzer. It's part of the Windows SDK and is free to use. It works pretty well with Rust code but you do need to make sure you always tell it to Load Symbols which will resolve function symbols from the PDBs (under the Trace menu) since Windows does not know how to demangle Rust symbols.

1 Like

It seems to be the reason in my opinion. I didn't thought of the condition that the content of vector will be a small size or even empty. I should modify the code to avoid problems like this.

Cool. I'll give it a try. Thanks a lot.

If the allocator dominated the performance, why don't you use faster allocator instead? You can replace the global allocator with the mimalloc. But never optimize without measuring, as there's no silver bullet which is guaranteed to make everything actually faster.

1 Like

Interesting, This is the first time I investigate the allocator that deep.
I think the details in allocator are too hard to me for now, but I'll try to simply use it and compare the performance.
But I think what I need to do now is to improve my awful algorithm, it should impact the performance the most.(lol)

If you have a lot of small vecs, consider using a stack-based vec variant. There are crates that provide this. That way you can avoid heap allocations.

2 Likes

I've search stack-based vec for a while, but it seems that they all need a constant capacity when declared.(It makes sense because basically it is an array?)
So I just tried to use a larger capacity when initialized, and it is 5-10% faster than the original one.
(might be a measurement error, idk)
But this is also very helpful, Thanks!

smallvec, for example, will store up to a certain amount of items on the stack before spilling to the heap.

Well, I'm using arrayvec right now but I also tested for smallvec. After some experimentation, I cannot found significant speed difference between these two.
I don't know how they implemented and which one will be better.(maybe they have a similar approach?)
For me this is a kind of dilemma.

Arrayvec has an upper limit on the array size and will never use heap allocations. Smallvec uses the stack up to the specified limit and then switches to a heap allocation once the limit is exceeded. Use arrayvec when there is a hard limit on the size and use smallvec when the size is unbounded but below a small amount most of the time.

That's a clear explanation, thanks!
Now I think smallvec will fit better in my case, I'll consider change to use that later.