Data Separation


#1

I have an API design problem. I have objects which can stand alone (physics bodies and collision shapes), but often are used together (physics bodies will likely have collision shapes attached to them).

I need to be able to get a users object’s position and then update it if needed. For individual objects this is easy. But for doing the broad phase of the collision detection system I need a list of all the objects and their positions, velocity, bounding box and a few other things.

In a perfect world, what would the ideal API for this look like?

My idea is possibly to have an array of the data all the systems need and provide a simple set of functions for the user to get the data they might need via a hash? Doing this would allow a single set of positions and bounding boxes and other stuff and the user can have whatever they want and only have to store a hash to get any of the data they might need.

Here is an example:

struct CustomObj {
    id: u32,             // this would be the hash
    texture: Texture2D,  // this is some user data
}

let mut obj = CustomObj::new();
obj.id = sim_engine.new_rigidbody();

sim_engine.set_collision_shape(obj.id, Shape::Rectangle(100, 50));

let position = sim_engine.get_position(obj.id);
let bounding_box = sim_engine.get_bbox(obj.id);
// do stuff with the position and bounding box

Obviously the above is somewhat simplified. I would probably provide functions to return mutable references to the user if needed/possible.

Any suggestions on this are much appreciated.


#2

First, let’s review the data layout a bit. Why are your precise motivations for avoiding the usual array-of-struct data layout, where you have a single RigidBody struct that contains positions, bounding boxes, etc, and the client can access the full struct in a single query?

There can be several ones (reducing alignment-induced memory waste, allowing efficient vectorization, improving cache locality when only a subset of an object is accessed…), and I am interested in hearing what yours are, because some of these goals may call for a slightly different API design. For example, efficient vectorization may require individual position coordinates to be stored in separate arrays, which your current API does not expose.

In your current interface, you separate the actions of creating a rigid body and setting its collision shape. What is the collision shape of a rigid body between the moment where new_rigidbody is called and the moment where set_collision_shape is called, and does it make sense? If not, you may want to redesign the interface so that people cannot forget to set the collision shape when creating a rigid body (e.g. by making the collision shape a parameter of new_rigidbody).

I would also be interested in hearing more about your hash/ids. Since the sim_engine basically generates them out of nowhere, I assume that they must come from some kind of counter or random number generator, buf if so, why are they called hashes? Do they reflect something about the underlying data structure (e.g. are objects with neighbouring IDs close to one another in memory)? Are they just used as keys into a HashMap-ish data structure? Do you think they should be hidden behind an opaque type so that you have freedom to change the data type later on if needed, and if not, why?


#3

The motivation for separation of data is that both the collision and physics crates should not have copies of the shared data and instead share a single copy of the position, bounding box, etc… I certainly will have methods for retrieving the full structure, but often only position & orientation are needed for most things (rendering, sound, etc…).

I have given thought to attempting a Structure of Arrays approach as that may lead to significant improvement in cache locality. SOA can also greatly increase SIMD performance, but only if designed correctly. Most time in a collision detection engine is spent checking bounding boxes and then testing shapes and collecting contact info. Most time in a physics simulation is spent integrating forces, resolving contacts and joints, and updating positions.

The current interface is just an example and will be improved greatly once I choose a certain path. Rust can back you into a corner if you design your system like you would in an OOP language. I really think the builder style of creation would work well in this API, I’ll reply tonight with an example of how that might look.

RigidBody doesn’t have to have a Shape and Shape doesn’t have to have a RigidBody.

The ID/Hash system is just an idea. If I go with SOA approach, an object’s ID would just be its index into the SOA. I was thinking of using a hash as an index into HashMap, but I feel this is way to complex for what I’m trying to accomplish.

Hiding the underlying structures behind an API allows easy reorganization for better performance and SIMD design without the user having to change their design.

I really appreciate the feedback. Most API design seems like an afterthought and I really want something that feels like it was designed for Rust, but isn’t overly complicated. nAlgebra, nPhysics, and nCollision are great crates, but they seem a little busy and just to advanced for a simple 2D game. I like many of the design choices and hate others. I get huge errors about types because they are layered so deep. Those crates are industrial simulation grade and I’m looking to create something more simple and not so configurable.


#4

Thanks for the clarifications! Some aspects of your use case, such as the fact that the collision detection and physics code should be in separate crates, definitely escaped my attention.

If you are eyeing the SoA approach for cache locality, then do keep in mind that on its own, it will “only” help with iteration over individual objects. To speed up tasks which involve interactions between neighboring objects, such as collision detection or computation of physical interactions between objects, you may additionally want to arrange your data in memory so that objects which are close to each other in the simulated 2D or 3D space are also close to each other in RAM. Which is itself made more complicated by the fact that when objects are allowed to move in the simulated space, they must move in memory as well to keep locality high, creating a tradeoff between keeping RAM bandwidth low and cache locality high…

If you decide to go in that direction, then you will need to arrange for the fact that objects are moving in memory, which means that there must be an indirection between user identifiers and memory addresses, and that the optimal object iteration order changes over time (as when the two considerations mismatch, objects should be iterated in memory order, not the order of user identifiers). To account for this, you would need to make sure that the user does not own the loop that iterates between physical objects, but that the API instead provides high-level methods that iterate over objects in an unspecified order. Which is anyway something that you may also want when introducing parallel computations.

It’s not easy, and it may be overkill for your purpose, but it’s doable, and some raytracers and physics simulation use this kind of technique to get the last drops of performance out of a machine. So it is one possible input for your interface design: are you ready to go this far for better cache locality? If so, your API design will need to plan for it, by e.g. encouraging the user away from explicit loops over the object identifiers in favor of library-controlled iteration.


Do the two usually go together (say, more than 90% of the time), or do you think that you will frequently end up in situations where you have one, but not the other?

In the latter case, your current API design direction looks good, although as you mention, a Builder-style API would work as well (and possibly allow for more optimizations under the hood, and greater robustness in the face of failure during object creation).

In the former case, you can improve ergonomics by requesting that users explicitly specify when a RigidBody does NOT have a Shape, which guarantees that they won’t forget to specify the Shape when there is one. The idea is to pass an Option to the RigidBody constructor, like this:

// Has a shape, usual case
sim_engine.new_rigidbody(Some(Shape::Rectangle(100, 50)));

// Does not have a shape, edge case
sim_engine.new_rigidbody(None);

This obviously only works if the user rarely needs to specify a None, i.e. the “body without shape” case is rare. Otherwise, it would get annoying quickly.


Thanks for clarifying that this is just about giving the user a unique object identifier. The “hash” naming confused me a bit, and raised ultimately irrelevant concerns of collision avoidance and deterministic generation, so if I’m not alone, you may want to keep this nomenclature away from the final API and documentation. The “ID/Identifier” naming that you already used in most of your example code is better.


This is true, however there is a trade-off between building the underlying data structures for SIMD performance and keeping things simple in the user API.

Scalability to wide SIMD units may require storing vector coordinates in separate arrays, especially when they play different roles in the computation (as in cross products). Whereas user friendliness requires accessing vectors as a whole, rather than coordinate by coordinate.

You can get both by providing the user with a proxy object that looks like a position/speed vector, but actually indexes into multiple coordinate arrays under the hood, however that’s a fairly invasive API design that must be planned for early on.


#5

You can use soa-derive to experiment with SoA/AoS differences. You still need to change parts of the code when switching, but it already helps with cutting down the repetitive code. (full disclosure, I wrote this crate ^^)

Could you expand on this? What would be the difference between

struct SoA1 {
    positions: Vec<[f64; 3]>,
    velocities: Vec<[f64; 3]>,
}

struct SoA2 {
    x: Vec<f64>,
    y: Vec<f64>,
    z: Vec<f64>,
    vx: Vec<f64>,
    vy: Vec<f64>,
    vz: Vec<f64>,
}

#6

First, most SIMD implementations are quite picky about alignment, and will only perform at their maximum speed if vector data is loaded contiguously from a RAM address that is a power of 2. A practical consequence is that 3D vectors must be expanded into 4D vectors even if you don’t need them, leading to a waste of memory bandwidth and vector lanes.

For example, here’s how you would load two 3D position vectors into a vector register of size 8:

| X1 | Y1 | Z1 | ## | X2 | Y2 | Z2 | ## |     (## is padding)

With one array per coordinate, you can fill 3 vector registers perfectly instead:

| X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |

| Y1 | Y2 | Y3 | Y4 | Y5 | Y6 | Y7 | Y8 |

| Z1 | Z2 | Z3 | Z4 | Z5 | Z6 | Z7 | Z8 |

This first concern is specific to 3D vectors, 2D and 4D vectors will be fine.


Then, when it comes to actually doing stuff with the vector, you will find that most “scalar algorithms” will transparently translate into a vectorized version when using array-per-coordinate storage, whereas the array-of-vector storage will often require a specific, and sometimes less performant algorithm.

For example, let’s assume that you want to compute the norm of 3D vectors: n = sqrt(x² + y² + z²).

With array-per-coordinate storage, the implementation is very simple:

Load X in R1:   | X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
Square R1:      | X1² | X2² | X3² | X4² | X5² | X6² | X7² | X8² |
Load Y in R2:   | Y1 | Y2 | Y3 | Y4 | Y5 | Y6 | Y7 | Y8 |
Square R2:      | Y1² | Y2² | Y3² | Y4² | Y5² | Y6² | Y7² | Y8² |
Add R2 to R1:   | X1² + Y1² | X2² + Y2² | X3² + Y3² | X4² + Y4² | X5² + Y5² | X6² + Y6² | X7² + Y7² | X8² + Y8² |
Load Z in R2:   | Z1 | Z2 | Z3 | Z4 | Z5 | Z6 | Z7 | Z8 |
Square R2:      | Z1² | Z2² | Z3² | Z4² | Z5² | Z6² | Z7² | Z8² |
Add R2 to R1:   | N1² | N2² | N3² | N4² | N5² | N6² | N7² | N8² |
Square root R1: | N1 | N2 | N3 | N4 | N5 | N6 | N7 | N8 |

That’s pretty much the algorithm that you’d have written for the scalar algorithm in assembly. It takes 3 vector loads, 5 arithmetic operations and 1 square root to do the computation on 8 three-dimensional vectors, and it’s likely to be optimal on pretty much every architecture (save for possible use of FMA, when it’s available and its different precision characteristics do not matter). Now, let’s see what kind of algorithm we would need when using the array-of-vector layout:

Load vectors:          | X1 | Y1 | Z1 | ## | X2 | Y2 | Z2 | ## | 
Set padding to 0:      | X1 | Y1 | Z1 | 0 | X2 | Y2 | Z2 | 0 | 
Square:                | X1² | Y1² | Z1² | 0 | X2² | Y2² | Z2² | 0 | 
Add pairwise and pack: | X1² + Y1² | Z1² | X2² + Y2² | Z2² | $$ | $$ | $$ | $$ |    ($$ = irrelevant)
Add pairwise and pack: | N1² | N2² | $$ | $$ | $$ | $$ | $$ | $$ |
Square root:           | N1 | N2 | $$ | $$ | $$ | $$ | $$ | $$ |

Here, we used 1 vector load, 1 masked operation, 1 arithmetic operation, 2 special-purpose operations, and 1 square root, in order to compute the norm of two vectors. So for 8 vectors, we would need 4 vector loads, 4 masked operations, 4 arithmetic operations, 8 special-purpose operations, and 4 square roots, unless we would be ready to unroll the resulting loop and get clever with the pairwise adds and square roots (it’s quite intuitive that given enough inter-register data shuffling, we can get down to 1 single square root).

Moreover, the special-purpose operations may not be available on all SIMD architectures, and may have much lower throughput than basic arithmetic SIMD operations. For example, if I remember correctly, Intel’s SIMD instruction set does not actually have a “horizontal add and pack” operation for floats (although it has one for integers, because, well, who needs consistency?), so you would need to build it yourself using a combination of other primitives.

Now, I don’t have very high faith in my SIMD coding skills, and this may not be the best possible algorithm. For example, depending on the architecture, zeroing the vector then doing a masked load may be better than loading and then doing a masked assignment. But you can already see that this approach requires more CPU instructions, depends more on the characteristics of the underlying SIMD instruction set and implementation, and uses a much more situational approach. It is thus much less likely to be automatically generated from a scalar loop “for free” by an optimizing compiler.

If you want to have some fun, you may now ponder how you would implement a vector cross product in the array-of-vector approach. Whereas it is again as easy as a scalar implementation in the array-of-coordinate approach…


#7

Very nice! I will definitely look into using this! Looks like there is almost no performance loss for small stuff. I really like the simple way to iterate over multiple values.


#8

Thanks for the great explanation of SIMD!

My engine will focus on 2D physics, but I may work on 3D in the future. Honestly, I feel like if I hide all the data and keep a clean API for the user it will be much easier to change things around under the so called curtains and replace and test small parts of my code with SIMD implementations.

Currently I’m still very early in the design phase of this project and really want to get everything structured. I have worked on physics engines in the past, Farseer Physics 3.0 was mostly my effort and was a port of Box2D to XNA. Later I rewrote Jitter to work in 2D and have a few videos of my efforts.

Jitter2D Joints
Jitter2D Stress Test

One main goal of this project is to beat what I accomplished with Jitter2D by a factor of 2 at least. Rust is very fast and designing things with the cache in mind can greatly improve performance. Currently I have a Core i3 6100 with 1.5MB of L3 cache per core. If I was able to do just my broad phase with bounding boxes only, they are 16 bytes each, I could fit ~50,000 per core into my cache. Obviously I will need more info then that but breaking down problems into minimal chunks like that and optimizing for SIMD can really boost performance even on slow machines.


#9

Well I’ve done quite a bit of SOA vs AOS testing and I’m not seeing any performance gain. I guess most of the gain would come from optimizing certain routines with SIMD. I’m going to focus on version one just working and then focus on SIMD improvements.


#10

For this kind of question, I would suggest giving a chance to profilers based on hardware performance monitoring units, such as Linux perf or Intel VTune. They can tell you whether your program’s performance is cache-limited, which is the scenario in which I would expect SoA to be helpful in scalar code.


#11

I just found out about the faster crate and I’m definitely gonna give that a try. Messed around with it a bit last night and it wasn’t working, but I’ll try again tonight and see if I can get it working.

I have a fairly good separation layer going right now and hope to have a nice interface to the broad phase collision detection finished in a few days and start benchmarking some different techniques.