Performance optimization of Vec containing Structs


in an application, I have a struct containing 320 bytes of data, whereas all that data is represented as struct stored in a flat vector. Even though opt-level=3 and target-cpu=native are set, performance bottlenecks are being observed on simple push operations.

This code roughly translates to my problem:

struct Data {
  a: u32,
  b: u32,
  c: [u8; 32]

let mut storage : Vec<Data> = new Vec::with_capacity(500_000_000);

for i in (0..500_000_000) {
  storage.push(Data {
    a: 0,
    b: 0,
    c: [0,0,0....]

Execution time for this loop is over 6 seconds on my development machine.
Parallelizing through a mutex is even slower, presumably due to extreme lock contention.
Are there any suggestions on how to speed this seemingly extremely trivial scenario up?

Are you sure about the new?

Have you done a flamegraph of the code? For all I can tell, you are requesting 500_000_000 * 320 = 160 GB of heap memory - almost any malloc implementation will choke out on such a large request. I suspect the bottleneck is in the malloc call (via with_capacity). Also, you would have a huge number of cache misses during the pushes (to be precise, there would be a write-back buffer, but even then it would overflow eventually).

Are you sure about the new ?

Scratch the new, you are correct. My apologies, I did type that code up to abstract the problem and didn't actually copy it out of my editor.

Have you done a flamegraph of the code?

To be honest, I have not (yet). The code appears so trivial to me that profiling it wasn't my first step, I figured, perhaps someone point me in a direction where I should be heading otherwise. As far as requesting heap memory goes, the vector initialization happens immediately, I suppose the virtual memory can be mapped pretty quickly and is backed with transparent hugepages. The actual slow path is definitely the push operations and it's certainly outside the scope of my knowledge how that could be speed up — I figure that the compiler would emit bounds checking in this scenario. As test, I have added an assert!(i < data.capacity()) just to be sure, but that changed nothing.

I am aware that this is quite a lot of data, unfortunately, there is no way to structure this differently. Based upon those 500_000_000 elements, a merkle tree is calculated and organizing it differently than a flat vector would probably end up being even slower.

Which means that the cache misses during write-backs become the bottleneck.
Profiling is definitely a good thing to see - you could let us know what you found.
Also, I would probably try and measure the time take by the following C code:

struct Data {
   int32_t a;
   int32_t b;
   int8_t c[32];

Data *storage = malloc(500000000 * sizeof(*storage));
memset(storage, 0, 500000000 * sizeof(*storage));

I am pretty confident that you are right with:

Which means that the cache misses during write-backs become the bottleneck.

Especially the fact that specifying #[repr(packed)] is slightly faster confirms that suspicion — better cache utilization due to removed padding. I am by no means a professional Rust developer. Do you have any ideas what could be done about this? I will do a flamegraph and try the C variant on my machine, to have some data to compare with. Perhaps I should give jemalloc a try, I doubt it changes anything significantly though.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.