Why does my data structure allocate so much memory?

Hello,
I have a data structure like this one:

enum SemanticSurface {
    RoofSurface,
    GroundSurface,
    WallSurface,
}

struct Material {
    name: String,
    ambient_intensity: Option<f32>,
    diffuse_color: Option<[f32; 3]>,
    emissive_color: Option<[f32; 3]>,
    specular_color: Option<[f32; 3]>,
    shininess: Option<f32>,
    transparency: Option<f32>,
    is_smooth: Option<bool>,
}

struct Texture {
    image: String,
}

type Point = [f64; 3];

type LineString = Vec<Point>; // There are many Points in a LineString.

struct Surface {
    boundaries: Vec<LineString>, // Usually, there is one LineString in a Surface boundary, but there can be more.
    semantics: Option<SemanticSurface>,
    material: Option<Material>,
    texture: Option<Texture>,
}

type Shell = Vec<Surface>; // There are many Surfaces in a Shell.

enum Geometry {
    Solid {
        lod: String,
        boundaries: Vec<Shell> // Usually, there is one Shell in a Solid boundary, but there can be more.
    }
}

I know the exact size of the vectors, but only at runtime. The data in the vectors comes from an intermediary container that was deserialized from a JSON file.
Since I know the size, I create each vector .with_capacity().

The data itself is in the Points. The other fields, Surface.semantics, Surface.material, Surface.texture does not contain any data for now. I just left it there, because maybe it matters? Similarly, the Solid.lod field is a string of a few characters only.

When I measure the allocated size of these structures I see that there is significantly more memory allocated for the vector in Geometry::Solid.boundaries than what I can explain from multiplying the number of Points by 24 bytes.

I measure the heap allocation size with the datasize crate. I also check the peak memory use with /usr/bin/time -v and valgrind massif. They confirm the memory usage that I measure for Geometry::Solid.boundaries, and since I'm in the GB-range, I don't really care about a few Mb difference...

But to be specific, these are the numbers that I measured:

  • total nr. points: 34543002 (calculated from this, there is about 829 Mb of Point data on the stack, right?)
  • total nr. of Surface: 11514334
  • total size of Surface-s: 1105 Mb (I think this is heap allocation only, since I measured it with datasize::data_size.
  • total nr. of Geometry-s : 16865
  • total size of the boundary vector of all Geometry: 2948 Mb (measured the same way as the Surface, with datasize::data_size)

I checked if the vectors are over-allocated with Vec::capacity() - Vec::len(), and this is 0 for each vector.

Could someone help me to find out why is there nearly 3x heap mem allocated for the Geometry.boundary than for the Surfaces? Since the Geometry.boundary is just a Vec<Vec<Surface>> and it doesn't have unused capacity (capacity - len = 0), I would expect that the difference in the allocated mem would be negligible.

Or am I mixing concepts and measuring the wrong thing?

The data isn’t on the stack. It’s on the heap, because the points are inside of some Vecs. The amount of heap data coming from the points alone being about 829 Mb should be accurate though.

In case that wasn’t clear, the calculation done by datasize::data_size is recursive, i.e. the size for the whole boundaty vector should contain all the data all the way down to the points.


The size of each Surface isn’t clear from the code you provided because SemanticSurface, Material, and Texture are not presented. Dividing the 1105 Mb number you gave, minus the 829 Mb of the points, by the number of surfaces, gives 24 bytes, which corresponds to mem::size_of::Vec<Point> from the LineStrings. Again, you seem to have added up all “datasize::data_size” values for all the surfaces here, but the data that would be on the stack if you had just a single Surface is also going to go on the heap once the surfaces themself go into a Vec.

(I’m assuming one LineString per Surface like you described in the comments.)


So 24 bytes per Point, and an additional 24 bytes per LineString (the size of a Vec is 24, from 3 words for the pointer, length, and capacity), one LineString per Surface, that’s the

34543002 * 24 + 11514334 * 24 = 1105376064 bytes (1105 MB)

so far, now the heap data of the whole boundary is 2948 MB, you say. The additional memory is going to be coming from each of the surfaces, where per surface we’ll have 24 bytes for the boundaries: Vec<LineString> field, and an unknown number from the remaining three fields. Calculating

(2948000000 - 1105376064 - 24 * 11514334) / 11514334 = 136.0287

suggests 136 bytes for an Option<SemanticSurface>, Option<Material>, and Option<Texture>, combined (plus potentially up to 7 bytes of padding). Assuming this 136 bytes size gives an overall heap usage of your Gemetry’s boundarys of

34543002 * 24 + 11514334 * (24 + 24 + 136) + 24 = 2948074264

Assuming all of the geometries are yet again part of some Vec, that adds another (24 + 24) bytes per Geometry for the log and boundaries, so data_size of a Vec containing all the Gemotrys would be

2948074264 + 16865 * (24 + 24) = 2948883784


Tip: If you want the whole size of a value including heap data and stack data, you would need to add std::mem::size_of::<…>() of the type to its heap data (you can also use size_of_val for convenience, so it’s data_size(&x) + size_of_val(&x)). E.g. assuming my calculations are correct, std::mem::size_of::<Surface> should be 160. If a value is put in a Vec, then this combined size is the overall heap usage you get due to the value.

6 Likes

Thank you for your detailed reply!
I edited my question and added the definitions for the Option<SemanticSurface> , Option<Material> , and Option<Texture>.

I added size_of_val() to the measurements, so that I calculate with data_size(&x) + size_of_val(&x), and now everything adds up.

And yes, the std::mem::size_of::<Surface> is exactly 160 :slightly_smiling_face:
From which:

  • boundaries: 24
  • semantics: 1
  • texture: 24
  • material: 104
  • padding: 7

Now when I calculate the totals with data, I get:

Surfaces [Mb] 2948

  • boundaries [Mb] 1381
  • semantics [Mb] 11
  • texture [Mb] 276
  • material [Mb] 1197

Which surprised me I have to say, since only the boundaries and semantics contain actual data, texture and material are "just" empty initialization. But this is clearly because of my lack of understanding.

Since for my tests I created the Surfaces with material and texture set to None, I assumed that those fields somehow magically won't allocate memory until they are filled with data...

let srf = Surface {
    boundaries: <some value>
    semantics: <some value>,
    material: None,
    texture: None,
}

That's impossible. Rust is statically typed, and every type has its own size, determined by the compiler.

Well, more exactly, there are so-called "dynamically" sized types like [T] and dyn Trait, but even that doesn't mean that you can just change their size – it's not that their size is variable, it's more like it isn't known until runtime. A dyn Trait still has the constant size of the concrete type it was created from, and while a slice can be re-sliced behind a reference, you still can't reallocate it unless you own it and round-trip through an explicit Vec, at which point it's really not the size of a type that changes, but that of a heap-allocated buffer.

Rust structs are also just pure values. They don't imply heap allocation, and in particular, they are not hash tables from strings to arbitrary values. A struct has a layout with all of its fields, which is determined by the compiler once, and it is baked into the executable in the sense that all code the compiler emits relies on the size of the type and its field offsets. Fields don't get allocated dynamically if they are optional, and optionals aren't magic — an enum like Option takes as much space as its largest variant (Some), and some more for a tag ("discriminant") that says which variant is currently active. If this weren't implemented as it is, it would be impossible to put structs and enums into contiguous collections, for example.

7 Likes

Also thank you for the clear explanation!
Although I did work through the related sections of the rust book, it all crumbles when I'm facing practical challenges like this one :laughing:
Also, this why I'm so glad this forum exists!

2 Likes

The distinction between the “shallow size” of a value, the size returned by mem::size_of::<T>() for its type, compared to any (owned) heap data (whose size is what’s returned by the data_size method) is that the memory for the shallow size can be allocated on the stack, as long as the value is on the stack. This is the amount of data that needs to be copied whenever the value is moved around, and if you move a value onto the heap, e.g. by putting it into a Box or pushing it into a Vec, then this shallow size amount of bytes will be written to the heap, too. (So then everything is on the heap.)

A None value of an Option won’t take up any heap data as long as that Option is on the stack. If it’s moved into a Vec (or otherwise put onto the heap), heap memory of size mem::size_of::<Option<……>>() (e.g. mem::size_of::<Option<Material>>()) is necessary. For Option values of “large” types such as Material, if the Option is very often None, it can thus sometimes make sense to change Option<T> to Option<Box<T>>, so that in case of None only 1 word (8 bytes) are used. The downside is that it adds an additional allocation (the number of alocations is somewhat relevant for performance, since creating each allocation is some overhead, and separate allocations can be distant in memory, resulting in cache-unfriendlyness), and also it adds the additonal pointer (1 word) of memory usage for the Box on top, in case the Option wasn’t None.

(Feel free to try out yourself what effects using Option<Box<Material>> would have in your None-heavy [or exclusively None-using] test case.)

In case the same Material is used often for multiple surfaces, it could also make sense to, instead of Box, use a shared-ownership type like Arc (or – equivalently but slightly faster – Rc if you don’t need support for using it in multiple threads). With Option<Arc<Material>> you have the same 1 word / 8 bytes of shallow size for the Option<Arc<Material>> itself as you had with Box, but the heap size needs an additional 2 words / 16 bytes per Material; however this heap data can then be shared, so that two clones of the same Arc<Material> will only use up those 104 bytes plus also the additional 16 and the heap data of the String once. So this could make a lot of sense if you’d otherwise be cloning Materials a lot.

3 Likes

Looking into the Box, Rc or Arc types has been on my list so I'm glad that you recommend it. It confirms that it is worth investigating.

What I see already is that I need to rethink how I store the semantics, materials and textures for a Surface.
It is exactly as you say. The same Material is used for multiple surfaces. In fact, normally there are only a few materials for all the surfaces in a Geometry, so copying them onto each Surface is incredibly wasteful. And the memory usage clearly shows it.

So I'm thinking to have a material container per Geometry which stores those few Material instances, and then the Surface references the Material in the material-container. But I guess how this would work is a different topic already.

1 Like

In other words, there are no "stack types" and "heap types" in Rust. There are just types. And there are values of course, which have a type. And the type of the value doesn't influence whether it is on the heap or on the stack. Whether a value will go on the stack or the heap depends on you. More precisely, it depends on the declaration of the value. By default, if you just assign a new value to a variable, it will go on the stack (where else? – it would be wasteful and in most of the cases unnecessary to automatically heap-allocate it).

If you specifically ask for the creation of a container, and the container uses the heap, and then you put your values into the container, then of course the values will go on the heap. This is nothing special – values go wherever you want them to go.

Consequently (and relatedly), references don't always point to the heap, either. You take a reference to a local variable? It will point to the stack. You take a reference to the contents of a Box or a Vec or a BTreeMap? It will point to the heap. This distinction usually doesn't really matter anyway – trying to avoid heap allocations is an optimization, and sometimes it can't be avoided. But values on the stack and on the heap don't behave in different ways. It's all just memory, after all.

(There are cases when this matters, e.g. moving a Box won't change the address of its heap-allocated referent, and sometimes this is required for the soundness of unsafe code. But I reckon unless you are dealing with unsafe, you don't need to worry about these things at all.)

3 Likes

So sounds like you're well on your way to success - 1) understanding :slight_smile: 2) deduping (here with Rc).

I sometimes do 2 things with 2 isn't enough. An Rc is 8 bytes (local struct) + 8 (counter) + 24 bytes (String struct) plus the payload size (POSSIBLY with an 8 or 16 byte dynamic memory prefix; depending on the system allocator). If you have 1 million strings, you can do a few things to shore this down.

  1. create a super-string with all the deduped strings concatenated.. Then only store offset+length (note you don't need terminating strings like zeros).

  2. You COULD use a slice to the shared super-string.. This would be 16 bytes.. But if you KNEW with bible-truths that it would be under 4GB, you could use a (u32,u32) and create the string slice on the fly.

  3. If you knew the strings could be put in-order.. you could push 1 extra item onto the slice stack then use ptr = offset[i] ; len = offset[i+1] - offset[i]. So now you're down to 4 byte per string slice. If items can't be listed in order (including if there are dups), then you'd need a second array (the re-order array - but this only needs as many bits as you have elements, whereas the offset array needs as many bits as you have string bytes)

  4. if you really, god honestly knew, and had all the might of Zeus behind you.. That all strings were under 256 bytes. You could construct an 8bit array and use some very fancy math and calculate a virtual offset(i), offset(i+1) (and by step 3 calculate the length).

  5. If you were still unsatisfied, you could organize all the strings by length, sort them, then for every bundle of 1024 of a given string size span, extrude the first letter of each string (gives 25% redux of size for 4 letter names, 12.5% redux for 8 letter names). Really pushing it at this point.

Then, you can note that everything I said about strings applies to floats..
Then step 5 becomes.. make all your 4 byte doubles only 3 bytes (25% redux)

  1. super-duper-fancy-math, reduce the float array to 32 - log_2(num_elements) + 2 bits each. :slight_smile:
2 Likes

There’s 2 counters, one for the number of strong references and one for the number of weak references. So there’s 16 bytes for the counters. (Those 2 counters are the “additional 2 words / 16 bytes per Material” I mentioned in my previous answer.)

When the strong counter hits zero, the payload is dropped. (Which fill free any heap memory owned by [fields of] the payload value, but does not yet free the “shallow” memory it requires.) Only when both counters hits zero, the allocation of the payload’s shallow size plus the two counters is freed.

4 Likes

Thanks for the clarification. Was hoping they'd assume you couldn't have 4 billion dangling objects; as other languages I've used have a 16bit weak-ref counter. Oh well, guess we're ready for gangdum style dependency-graph trees.

1 Like

Thank you for the tips!
I admit that many of them goes above my head. Would you mind explaining the super-duper-fancy-math :slightly_smiling_face: in point 6? How does it reduce my array of [f64;3]?

And I think it is reasonable to expect a super-string will be under 4GB (famous last words...), so I can use u32 as the two indices, but I don't understand what you mean in point 3. How does the it work with the slice stack and the re-order array?

Or if you could point me to some material where these optimizations are explained, that would be great help as well.

You're building a renderer of some sort, so this is your legally mandated suggestion to look into using an ECS (Entity-Component-System) of some sort.

The basic idea is you do something like:

#[derive(Eq, Ord)]
struct Entity(u64);
struct ComponentSet<T>(BTreeMap<Entity, T>);

struct World {
  next_entity: u64,
  solids: ComponentSet<Solid>,
  shells: ComponentSet<Shell>,
  surfaces: ComponentSet<Surface>,
  ...
  materials: ComponentSet<Material>,
  textures: ComponentSet<Texture>,
  ...
}

Where these collections are effectively optionally attaching components to entities without taking space in every entity, and it's trivial to share other entities just by using the same entity id, but importantly in being ordered by entity you can iterate over all entities with their components you care about easily, and the functions doing this are called systems, completing the ECS name.

Bevy uses a really nicely designed and pretty API for doing this, but it's pretty easy (but ugly) to do manually too.

I will say though, that it takes a while to get used to designing in this way: do you embed a Vec of children or a parent entity id, are assets little textures entities or is that separately handled, how do you handle stateful operations like culling etc., but the answer is generally "whatever is simplest in the system" and "add more components"

1 Like

That's a stab at the easy button version of what I was talking about. It removes the Rc and Vec overhead per item.. the Point itself was fine (using only 12 bytes). I hadn't actually done one of these in rust, and I'm actually getting into very similar (3D float vector stuff myself) so it was worth the exercize.

Note, many of my points above are mutually exclusive - I was purposefully vague because there are a LOT of "that depends" variants, and the above description was over-bloated as it was.

In generally you have two conflicting strategies.. Prefix-compress things that are in order (databases like LevelDB, Cassandra do this).. Whether they be string or floats doesn't matter - theyr'e &[u8] one way or another.. In any "sorted set" you can perform prefix removal. If your'e concerned about performance more than space, then you should stick with byte-aligned prefix removal.. typically adding a 1 byte prefix-dup len in front of each string is good enough.. BUT you only achieve meaningfully compression when you get above 1024 elements (DBs typically have millions of sorted records so they're fine). If you are willing to do some bit-twiddilng, you can still use byte-aligned structures, but the "prefix" can be compacted arithmatically, while the suffix says in nice fixed-length arrays of strings (bytes, tupples, etc).

The sorting requirement has the problem that your data doesn't WANT to be sorted, it wants it's own natural values (e.g. each float tends to have a slope, but wants to go up or down as needed). You have two options here.. If you had 1 float per element, then you can just reorder the elements into the float sorted order. This dramatiaclly affects the datamodel (probably doesn't apply to you). If you just used two arrays; one that's sorted, and one that's an index vector, then unless you have a LOT of de-duplication, you'll wind up just growing the data. Granted, any duplicated floats save 4 bytes, v.s. a 2 or 3 byte index.

GLTF uses the [(f32,f32,f32); N] with a separate [u32; N] to the above point - it just blanketly assumes the floats are duplicative (if every float was unique, this would have actually grown the data by 33% - for example, an extra 4 byte pointer for every tuple).

Finally, there is some deep layering you can do to have zero size-growth in the worst, case (unlike the above 1 byte prefix length), but massive compression in the case of high duplication and high ordering (e.g. runs of increasing or decreasing float values).

It has to do with constructing a CumulativeDistributionFunction with a max-negative bias + 1. If you have [ 2, 1, 5, 7 ] you can convert this into {-2} [2, 3, 9, 13] (my off the top of my head math might be off).. Key is that the 4 number input becomes a 5 number thing; a scalar (representing the max negative delta bias) and an increasing SORTED array. Thus you can apply prefix compression to the array part. Here I have integers, but you can do something similar for floats (or use some non-linear mapping from floats <-> ints). This takes a lot more cpu, but, again, depends on the use case. if the GPU is physically full, it might be worth burning cpu in the shader to have larger field arrays.

I won't go into the bit-level prefix compression, except to say, shannon entropy characterizations suggest you CAN compress any dataset with a known set of probabilities to (re-arranged because I find Shannons formulas not directly useful) log_2(n) - log_2(k) + err_bias per entry.. I spent a better part of a decade on this theory, looking at all the various compression techniques; most are close to this shannon limit (ANS is the currently trending approach) - but either have MASSIVE cpu costs, or can't decompress random elements without first decompressing preceeding elements (which a GPU shader very much would like to do). To put in perspective, if I have 16 million sorted floats (using the above technique) and due to the C.D.F. accumulation the adjusted sum is 18 billion (35bits), then the per-entry storage requirement is `35 - 24 + 1.486 => 12.486 bits' Or better than 2-to-1 compression for RANDOM float data.

3 Likes

Very minor nitpick: shouldn't Option<Box<T>> have the same size as Option<T> due to being able to use the null pointer as Option::None?*

  • there might be targets that do not have this property, but my understanding for mainstream platforms was the behavior as described.

Maybe you meant that Option might only use one byte for the discriminant where Box requires a full pointer? But then it stillt wouldn't be one full extra word on top (only one extra word - 1 byte)

1 Like

Of course not, since T might be much larger then Box<T>. But Option<Box<T>> will have the same size as Box<T>, yes (this is not dependent on platform, since Box<T> is always non-null).

2 Likes

If you have some value x: T, then the total memory usage…

  • …of None: Option<T>
    is size_of_val(&x)
    = size_of::<T>()
    plus some space needed for the enum tag/discriminant, but this might be zero for many types T, including Material[1]
  • …of Some(x): Option<T>
    is size_of_val(&x) + data_size(&x)
    = size_of::<T>() + data_size(&x)
    plus some space needed for the enum tag/discriminant, but this might be zero for many types T, including Material[2]
  • …of None: Option<Box<T>>
    is size_of::<usize>()
    = 8 (on 64bit systems)
  • …of Some(Box::new(x)): Option<Box<T>>
    is size_of::<usize>() + size_of_val(&x) + data_size(&x)
    = 8 + size_of::<T>() + data_size(&x)

So by using Option<Box<T>> instead of Option<T>, you

  • save size_of::<T>() - 8 bytes when the value is None
  • need additional 8 bytes when the value is Some
    (and the fact that there’s an additional allocation is involved in this Some case has some performance overhead and probably also memory overhead (depending on how the system allocator works), too)

  1. because Material contains String, Option<float>, Option<[float; 3]>, and Option<bool> and each of those has some impossible values/bit-patterns that can be used to mark the None case (one of them would have been enough) ↩︎

  2. because Material contains String, Option<float>, Option<[float; 3]>, and Option<bool> and each of those has some impossible values/bit-patterns that can be used to mark the None case (one of them would have been enough) ↩︎

5 Likes

cleaned up code, now Cow for all returns

1 Like