Which of the two data structure is more efficient?

There always exists a question about how to organize data when doing scientific computing, but I never really give it a serious test to see which one is more efficient. I would like to know your opinions.

For example:

Computational fluid dynamics split the domain into many small cells, in each cell, we have 5 variables to store and need to be calculated: p for pressure, (u, v, w) for velocity, t for temperature. The total amount of cells are among millions to even billions.

There usually exists two ways to store the data:

1: store them separately:

let mut p: Vec<f64>=Vec::new();
let mut u: Vec<f64>=Vec::new();
let mut v: Vec<f64>=Vec::new();
let mut w: Vec<f64>=Vec::new();
let mut t: Vec<f64>=Vec::new();

2: store them together:

struct Cell {
    p:f64,
    u:f64,
    v:f64,
    w:f64,
    t:f64,
}
let mut cell_value: Vec<Cell> =Vec::new();

Which way do you think would be more efficient? Or do you have other suggestions?

When you index them, the amount of jumping around in memory may be different. Do you think this is the major influential factors? If not, what factors do you think are the most important?

Thanks.

1 Like

I think that either is fine in general, and which is more appropriate depends on the operations you plan to perform on it.

If for example most calculations involve just 1 out of 5 variables, and it has to be done for many sequential cells, then separate Vecs might be useful as that will be more cache friendly to that particular computation.

If however, most calculations involve more variables per cell, then it might be more cache friendly to represent a cell with a struct and put the vars in there.

4 Likes

Thanks.

So, you do think how much amount of jumping in memory is the major influential factor?

Usually, the calculations on each cell will need all 5 types of informations from several other nearby cells. But their index positions may not be nearby to each other. Well, the ideal case is to sort their position as close to each other as possible before calculation, but it is not guaranted, it is not even possible.

Usually, the computing program is pretty big to code, so I would be satisfied with any one debuged and runable solution. Got no time to do the second version to compare. I didn't find a good way to find the answer out yet.

How does the CPU cache related with question? My understanding is: All data are stored in the memory, and when you need the data, CPU will jump around in memory and put only the useful ones into CPU cache, right?

Let's suppose there are three typical cases of reading & writing memory (numbers inside [ ] is the position):
1: consecutive: [1, 2, 3, 4, 5]
2: arithmetic progression: [1, 11, 21, 31, 41]
3: random: [1, 20, 43, 73, 84]
I guess #3 will be slower than #1 and #2, but would #2 be slower than #1?

Sorry about the mess, I have no clue...

For this kind of performance problem, where you're not replacing one algorithm with another but care about how the CPU executes your code, it is wise to actually benchmark both versions. Not just a toy example, but your actual fluid dynamics (or whatever) code.

That is the only way to know for sure which is better. There are lots of factors affecting performance, and theorizing about what the CPU (and compiler) does is often wrong.

9 Likes

Yeah, I think so, thanks.

But can you answer this simplified question: If only care about reading and writing from the following position of the memory, which would be faster:

1: consecutive: [1, 2, 3, 4, 5]
2: arithmetic progression: [1, 11, 21, 31, 41]
3: random: [1, 20, 43, 73, 84]
4: larger arithmetic progression: [1, 101, 201, 301, 401]

I guess #3 will be slower than #1, and #2,
would #2 be slower than #1? would #4 slower than #2?
would #3 be slower than #4?
or only #1 is faster, all the others got the same slower speed?

Thanks.

I would guess #1 < #2 < #4 < #3 (in run time), because even though #4 has a longer stride the CPU will be able to predict and pre-fetch the access pattern. And I might be right. But, in a real algorithm, there are many other factors based on what else is happening. You cannot say that just because #1 is faster in isolation, #1 is definitely the right data structure to use. That's why you must benchmark variations of your real algorithm — there is no other reliable way to know.

3 Likes

Thanks!

The way to improve memory locality of this is to perform multiple steps of computation in a relatively small area, so that all the data remains in memory cache throughout these steps, before moving on to another area.

This can be done in a recursive fashion, by recursively splitting squares into smaller squares, etc, to make this work well for any cache size, and for multiple levels of cache.

Details are somewhat tricky.

4 Likes

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.