Here's a simple problem - find the limits of all the points in a point cloud, efficiently.
The playground above shows the obvious solution, but it is procedural and has mutable variables. Is there some iterator-based solution that isn't slower?
There's a standard iterator function for max or min value of an array, but here I want both max and min for each element of each point, and those standard functions don't help.
That's very elegant. I should look at the generated code for that.
It's one of those things which can be very fast if the compiler does all the possible optimizations, and really slow if it doesn't.
Some terminology that might help in searching the internet for algorithms:
- You're trying to find the axis-aligned minimum bounding box of the point cloud.
- A related, but much harder, problem is finding the vertices of the convex hull of the point cloud. Unlike the bounding box corners, these will all be members of the input set.
Out of interest:
This sounds like a 3D version of the Convex Hull problem. Is it?
For maximum efficiency you would have to use SoA for storing your points and probably work with SIMD intrinsics directly. For example, in this code generated assembly processes points one by one, while a more efficient approach would be to process 8 points in parallel (number of
f32s fitting into a single YMM register) and you would extract final min/max values from result registers only after all points have been processed.
I am not sure why compiler is unable to do it itself, changing
i32 results in roughly the same loop.
UPD: Here how SIMD-based code may look: Compiler Explorer. It's a bit simplified (e.g. it does not handle tail points) and untested, but should convey the main idea. Note that the main loop (see
.LBB0_10) in the generated assembly processes not 8, but 16 points per iteration. It allows CPU to use instruction-based parallelism, since
_mm256_max_ps have latency of 4 cycles, but throughput of 0.5, i.e. you can execute up to 8 of such instructions in parallel. This number depends on CPU, e.g. by using
-C target-cpu=native on godbolt the main loop will process 32 points per iteration.
This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.