I have code in my metaverse viewer where, typically, 18 threads (1.5x the number of CPUs) are fetching and decoding content from the network. To prevent this from sucking up all available CPU time if they all are in the decode stage at the same time, they go though a lock which prevents no more than 6 threads (half the number of CPUs) from entering the compute heavy section at the same time.
The lock code is here. It's just a classic Djykstra P and V, built on top of parking_lot's condvar.
This works great on Linux. It works on Windows. But running a Windows executable under Wine hits some strange bug where the Wine windows libraries are running a spinlock, instead of blocking. I end up with all 12 CPUs running mostly spinlocks, while performance drops to a crawl. After a minute or two, the asset fetcher finally does get the work done, and performance returns to normal.
This appears to be some obscure Wine/MinGW bug. I've been able to make the problem happen under "winedbg", and have backtraces showing all those threads stuck in a spinlock way down in a Windows DLL. (Link to Wine forums won't work until a moderator gets around to approving the topic, so that link is not live yet.)
So I'd like to try another approach. I need a fair condvar, or a P and V, other than the one in parking_lot, which is what I'm using now. Something less clever and more reliable than parking_lot. Fairness is needed because otherwise some requests might be stalled for a minute or so, stuck behind newer requests. So std::sync::Condvar is out. Performance isn't a big issue; this gets hit maybe 40 times per second.
Using higher-level primitives, you could write something like this:
(Just a draft; untested)
struct CriticalSection<const MAX:usize>(Mutex<(usize, VecDeque<SyncSender<()>>)>);
impl<const MAX:usize> CriticalSection<MAX> {
fn run<T>(&self, f:impl FnOnce()->T)->T {
let (send, recv) = sync_channel(1);
{
let mut lock=self.0.lock().unwrap();
if lock.0 < MAX {
// Space available; run immediately
lock.0 += 1;
send.send(());
} else {
// Get in the queue
lock.1.push_back(send);
}
}
recv.recv().unwrap();
let result = f();
// NB: Better to do this in a drop guard, in case f panics
let mut lock=self.0.lock().unwrap();
if let Some(send) = lock.1.pop_front() {
// Start the next thread in the queue
send.send(());
} else {
// We’re done and haven’t given our execution slot to another thread
lock.0 -= 1;
}
result
}
}
Thanks. It's beginning to look like the spinlock isn't in that code. It may be in memory deallocation. Here's a backtrace of one of the stuck threads, from "winedbg".
Wine-dbg>bt
Backtrace:
=>0 0x0000017000eba4 in ntdll (+0xeba4) (0x000000003400e0)
1 0x00000170063994 _InterlockedCompareExchange(addr=00000000003400E0, cmp=0000000170082D68, size=0x4, timeout=0000000006EDE638) [Z
:\usr\src\packages\BUILD\include\winnt.h:6630] in ntdll (0x000000003400e0)
2 0x00000170063994 spin_lock(addr=<register RBP not accessible in this frame>, cmp=<register R13 not accessible in this frame>, si
ze=<register RSI not accessible in this frame>, timeout=<register R12 not accessible in this frame>) [Z:\usr\src\packages\BUILD\dlls
\ntdll\sync.c:937] in ntdll (0x000000003400e0)
3 0x00000170063994 RtlWaitOnAddress+0x164(addr=<register RBP not accessible in this frame>, cmp=<register R13 not accessible in th
is frame>, size=<register RSI not accessible in this frame>, timeout=<register R12 not accessible in this frame>) [Z:\usr\src\packag
es\BUILD\dlls\ntdll\sync.c:937] in ntdll (0x000000003400e0)
4 0x00000170063c53 wait_semaphore+0x43(timeout=<internal error>, crit=<internal error>) [Z:\usr\src\packages\BUILD\dlls\ntdll\sync
.c:197] in ntdll (0x00000006ede638)
5 0x00000170063c53 RtlpWaitForCriticalSection+0xa3(crit=<register RBX not accessible in this frame>) [Z:\usr\src\packages\BUILD\dl
ls\ntdll\sync.c:303] in ntdll (0x00000006ede638)
6 0x00000170064761 RtlEnterCriticalSection+0x91(crit=<register RBX not accessible in this frame>) [Z:\usr\src\packages\BUILD\dlls\
ntdll\sync.c:412] in ntdll (0x000000183301b0)
7 0x0000017002bf84 heap_lock+0x15(flags=<internal error>, heap=<internal error>) [Z:\usr\src\packages\BUILD\dlls\ntdll\heap.c:506]
in ntdll (0x000000183301b0)
8 0x0000017002bf84 RtlFreeHeap+0x40c(handle=<register RDI not accessible in this frame>, flags=<register R12 not accessible in thi
s frame>, ptr=<register RBX not accessible in this frame>) [Z:\usr\src\packages\BUILD\dlls\ntdll\heap.c:512] in ntdll (0x00000018330
1b0)
9 0x0000017002d763 RtlFreeHeap+0xe(ptr=<internal error>, flags=<internal error>, handle=<internal error>) [Z:\usr\src\packages\BUI
LD\dlls\ntdll\heap.c:1573] in ntdll (0x000000183301b0)
10 0x0000017002d763 RtlReAllocateHeap+0x423(handle=<register RDI not accessible in this frame>, flags=<register R13 not accessible
in this frame>, ptr=<register RBX not accessible in this frame>, size=<register RSI not accessible in this frame>) [Z:\usr\src\pack
ages\BUILD\dlls\ntdll\heap.c:1700] in ntdll (0x000000183301b0)
11 0x0000014048c115 in sharpview (+0x48c115) (0x00000000000045)
12 0x0000014048d721 in sharpview (+0x48d721) (0x00000000000045)
So it's stuck in a spinlock in [Z:\usr\src\packages\BUILD\dlls\ntdll\sync.c:937] in ntdll, which is something provided by Wine.
This is tough to debug. Winedbg doesn't seem to be able to decode Rust-generated debug info in the executable. I've asked for help over in Wine land. Someone there may be able to help. Wine has a huge number of options for dealing with uncooperative programs, because it's mostly used to port games where the source isn't available.
(I'm building a program for multiple platforms, with all the development on Linux. This works surprisingly well, until something obscure like this comes along and you need all the heavy-duty debug tools.)
My Rust code does a lot of "push" operations on vectors, from multiple threads. This results in many memory allocations. On Linux, this seems to be fine. On real Windows, I'm not getting bug complaints. But under Wine, the locking around the memory allocator causes problems.
Wine has its own memory allocation library, in its dummy versions of Windows DLLs. There are spinlocks for short locks, and spinlocks with maximum counts for longer locks. Those are followed by kernel locks, except that for fairness, there's some list manipulation that's inside a spinlock.
All this is optimized for the fast case of no locking. Not so much for the heavy contention case.
It has pathologically bad performance under overload.
If you have a lot of threads growing vectors via .push, there's contention for the locks around memory allocation. The end result is 100% CPU utilization on 12 CPUs, most of that going into spinlocks. The program isn't hung, just wasting over 99% of the CPU time spinning.
So it's not condvar at the Rust level that's the problem. It's a contention problem at internal locks way down in the Wine library that emulates Windows.
I may do some pre-allocation around some of the places that do .push on vectors to work around this.
If you're already doing context-switches to coordinate threads wouldn't it be better to separate IO and compute into distinct thread-pools so the compute pool can be limited to half the CPUs?
Replacing the global allocator so it doesn't use the system (wine's) malloc may help too.