Hi,
I have a problem that is very parallel and running on a two socket system exposing 4 numa nodes to the os (windows). Now, it seems to be latency bound so I’m looking to make it numa aware.
The memory used is roughly in two parts, a working set for each thread and a large global result Vec.
The plan is roughly to on process start lock each of the rayon threads to a node. Thread working memory will then be allocated in each thread which should make windows allocate it on the right numa node.
The problem I have is that each worker thread needs to allocate lots of relatively small objects and occasionally reset them.
So I found bumpalo which looks perfect for this use case.
Now to my question - I would like to use huge pages (I can find any crate for that on windows so will be using VirtualAlloc2 directly) - but I can’t seem to find any way in bumpalo to myself provide the backing array - I.e. is it possible to not get bumpalo to allocate the backing array from the standard allocator (which I assume will use normal pages)? Or is there some other smart solution I’m not think about apart from forking bumpalo and modifying the allocation code myself?
Any pointers appreciated.