What's the best way to optimize a **very** large amount of small, disjoint computations without using multithreading?


Right now, I have a scenario where I have a very large amount (on the order of 10,000 units) of computations to do, however, they’re all disjoint and all very small (each computational “unit” really just table lookup (actually tree lookup, because it’s multidimensional - so if the input vector to the unit is [2, 3], it just starts at the tree’s root, looks for the branch labeled 2, and “walks” there. From that node, it looks for the branch labeled 3 and “walks” there)). I’m operating in a 1 core, 1 CPU resource constrained environment so I don’t believe multithreading is actually going to boost speed (it’ll probably hurt it). Any way to optimize it, without resorting to parallelization (high level suggestions are perfectly fine)



A few questions coming to mind:

  1. Is there a gpu in the system
  2. How many times (Or how often) must this set of computations be done?
  3. Is there a more efficient data model for your lookup? Like a serial data structure to make it easier to solve from there (By using a some slicing perhaps?)


  1. Gpus are very efficient at parallel operations (Even more so than a cpu) because they will often run a small algorithm (Kind of like your “unit” but even more complex (Think perlin noise/lighting)) on every pixel on the screen, which means that it runs it on about… 2_073_600 pixels at its best, making it a parallelization beast.
  2. If your scenario is, for example a website with wasm and the client’s pc is for some reason only 1 core, then it doesn’t really matter because they will only click a button once, not 60 times a second (unless they have a macro… but why?), but if for example you are running it 60 times a second in a game, or real time animation, then it really does matter.
  3. As is the case with binary trees and their cache misses, there can be a similar type of problem with a 2D structure like this that isn’t an array (From what I understand), so you could turn it into a Vec<Vec<T>> (What is usually referred to as a jagged array, from what I’ve heard) and make it more efficient if your lookups are now “ordered” in some way (Hence the slicing part)


No, there is no GPU. For the second question, each computational “cycle” involves computing 10,000 or so “units” - the cycle may be run up to 40-50 times. As for the lookup part, I’m really just doing integer table lookup, so the key is an integer. Right now, I have an enum to represent the tree. Each enum keeps a Vec of Box’s, which contain the same enum, so soemthing like this: [Box<TheEnum>, Box<TheEnum>, Box<TheEnum>, Box<TheEnum>] If the integer key is, let’s say, 2, it goes to the 3rd item (0 indexing) of the vec and sets that as the new “root”, then fetches the next key and goes on like that.



Sounds like you could improve memory locality by avoiding boxing the enums. That should speed things up unless perhaps the enum is huge in which case it could result in Vec allocating way too much memory.

Is the multidimensional data of fixed dimension? Are there bounds of the indices at each level of the dimension? Turning a tree into an array can be a much bigger win…



Does the cpu you’re running on have any sort of way to schedule multiple loads in flight (out-of-order cores would automatically do this)?

If so, you could re-write the code to have multiple table lookups happening simultaneously akin to software pipelining. This might effectively let you run multiple table lookups in parallel if memory latency is bottlenecking you. Even if you’re running on an out-of-order core, it still helps a lot to make this structure explicit instead of hoping the cpu will reorder across individual table lookups.

If your system has a cache and prefetching, you could get the same sort of benefit just from prefetching.



Yeah, it’s multidimensional data, with a fixed dimension.



Can you schedule the work more like a visitor pattern, so that nodes are found and calculated as you walk the tree, to cut down on repeated walks?

Is there some element of the per-node calculation (derived from the coordinates, for instance) that can be precomputed and reused, say along the same row?

Just for curiosity, if the access has to be random, and especially if the tree is sparse or uneven (though it sounds not), try a HashMap of the full coordinate (and a cheap hash algorithm). You might be surprised.



Yeah, I switched from a “linked tree” to a “flattened array”, shaved 2674 nanoseconds off of the time required for each call, so 1.337 seconds less lmao (this function is called 10,000 * 50 times in early use, eventually, it might be called up to 70,000 * 400 = 28,000,000 times (in one run!), so that would save about 74 seconds in one run. Nice.

Yeah, given the massive amount of computations, MAYBE using a platform with only 1 thread isn’t a good idea lmao. It’s a computer vision task, FYI.