Here's a simple problem - find the limits of all the points in a point cloud, efficiently.
The playground above shows the obvious solution, but it is procedural and has mutable variables. Is there some iterator-based solution that isn't slower?
There's a standard iterator function for max or min value of an array, but here I want both max and min for each element of each point, and those standard functions don't help.
A related, but much harder, problem is finding the vertices of the convex hull of the point cloud. Unlike the bounding box corners, these will all be members of the input set.
For maximum efficiency you would have to use SoA for storing your points and probably work with SIMD intrinsics directly. For example, in this code generated assembly processes points one by one, while a more efficient approach would be to process 8 points in parallel (number of f32s fitting into a single YMM register) and you would extract final min/max values from result registers only after all points have been processed.
I am not sure why compiler is unable to do it itself, changing f32 to i32 results in roughly the same loop.
UPD: Here how SIMD-based code may look: Compiler Explorer. It's a bit simplified (e.g. it does not handle tail points) and untested, but should convey the main idea. Note that the main loop (see .LBB0_10) in the generated assembly processes not 8, but 16 points per iteration. It allows CPU to use instruction-based parallelism, since _mm256_min_ps and _mm256_max_ps have latency of 4 cycles, but throughput of 0.5, i.e. you can execute up to 8 of such instructions in parallel. This number depends on CPU, e.g. by using -C target-cpu=native on godbolt the main loop will process 32 points per iteration.