Blog Post: Memory Management in Rust and Swift

I wrote a high level blog post about how memory is managed in Rust and in Swift. It may not be super informative if you're already familiar with both languages, but if not, hopefully it's a good overview of how the two languages tackle this issue in sometimes similar, sometimes different ways. Hope you like it!

7 Likes

This approach works well and can be much more efficient than more traditional garbage collection approaches, because it does not require some mechanism to traverse the entire heap searching for memory that is no longer referenced.

I have not tried Swift specifically, but so far all reference-counting implementations I've seen were slower than the good tracing copying generational garbage collectors. The reason is that managed heap makes the allocation and deallocation more efficient, which makes up for much of the time spent tracing, while on the other hand maintaining the reference counts is not free either.

The main advantages of reference counting over tracing are that it does not introduce random pauses and that it calls destructors immediately when the last reference goes out of scope and not only at arbitrary later time when it gets around to a collection pass.

Is Swift ARC breaking reference cycles as CPython does?

Interesting. I've heard that this can be the case. What implementations are you referring to specifically? Why is (de)allocation slower? Is it because the managed heap has a better picture of the heap in its entirety and can therefore make better decisions about where to place the memory?

I'm not sure how python handles this, sorry. But as you can see here, Swift uses the "weak" and "unowned" keywords to specify the relationship between references. So ultimately it's up to the programmer to tell the compiler how the circular reference should be broken, and if they don't do this, then memory is leaked.

Both Java and .NET have pretty fast collectors, but basically all runtimes with generational copying collectors are rather good.

For one I remember comparison between C# and Vala, where the later uses reference counting and otherwise compiles to native code via C. The later was slower on some mostly comparable code and the allocation and reference counting was cited as major performance bottleneck.

A generational copying collector has a chunk of memory in which new objects are allocated. This chunk of memory is continuous and free, so allocating an object is just a matter of incrementing a pointer.

Now when the chunk is full, a collection pass is run. This will move all the surviving objects to other memory chunk, designated for longer-term storage. When that is done, the chunk is simply considered free again, so all that needs to be done is resetting the pointer to start.

So allocation and deallocation are essentially as simple as stack allocations and the cost is associated with object surviving the first collection pass.

But explicit allocators can't do this. They have to manage lists of free chunks and iterate them back and forth all the time and that takes time.

Now in object oriented languages, the temporary objects usually significantly outnumber the more prermanent ones. And for the managed runtime, all of these are basically free while the explicit allocator can only put the sized ones on the stack.

Now there is another bit to the story. The fact that while considerable research went into the Java and .NET collectors, typical malloc in typical standard C library is a complete disaster. And not that there was no research. There was, and it produced significantly improved allocators like umem and jemalloc (which Rust uses). It just never made it to most standard C runtimes.

Don't forget the typically higher memory usage of GC-based schemes.

This is true, but on today's memory-laden systems doesn't make as much of a difference (except when it does, this is why the JVM for example has explicit options to size the heap).

Let's not forget that jemalloc also takes an upfront allocation of 6MB for its arenas – granted that's much less than the typical JVM, but the point is if you can spend memory to go for speed, why not?

1 Like

I think @comex was referring to the fact that GC-based systems rely on the free space in the old generation to amortize the cost of old-generation marks across many allocations, so performance degrades significantly as memory use approaches 100%, while systems with free lists and explicit deallocation are unaffected by memory use until they stop working entirely (but you'd have to be very careful about fragmentation to benefit from this).

That's a naive question (and I realize the answer is probably "it's much more difficult than you think it is"), but couldn't the compiler optimize away a good part of the reference counting and simply pass the value by reference when it is sure some function isn't copying it?

The question isn't naive at all. For example, Java uses escape analysis to reduce heap allocations.

We actually have a lint in clippy that does something similar and allows us to flag things on the heap that could actually live on the stack.

1 Like

Escape analysis should go in the compiler, not in a lint.

I disagree, for two reasons:

  1. Rust makes things explicit. You have to do some work to put things on the heap – and silently putting it on the stack instead could break unsafe code that relies on things being heap-allocated. Also the stack is fairly limited on all systems I care for. Putting too much data on the stack will at least no longer segfault your application, but IIRC won't do what you want either.
  2. The lint makes it fairly easy to change your code to get rid of heap allocations (basically "remove the Box::new here"). So the cost to the programmer is fairly minimal, and the compiler doesn't have to do extra work. Furthermore, given rustic code, the lint is unlikely to ever be triggered, because explicit heap allocation just isn't as common in idiomatic Rust code. So the lint not only improves performance, but steers the programmer in the direction of writing more idiomatic code.
1 Like

This is an example code from one answer by matklad:

use std::collections::HashMap;

fn main() {
    let tags_string = "KEY1=VAL1,KEY2=VAL2";
    let tags: HashMap<String, String> = tags_string.split(',')
        .map(|kv| kv.split('=').collect::<Vec<&str>>())
        .map(|vec| {
            assert_eq!(vec.len(), 2);
            (vec[0].to_string(), vec[1].to_string())
        })
        .collect();
}

It contains a .collect::<Vec<_>>(), that's a heap allocation of a 2-items long vector for each input row. I'd like a future Rust compiler to perform Escape Analysis on such code and remove all those heap allocations.

The problem here is that the compiler cannot know the size of the Vec beforehand, even with the assertion in place. And we unfortunately don't have stack-allocated arbitrary-length vectors yet (even generic_arrays have to have a compile-time length).

I'm with you on wanting the optimization, but I'd still like to have it as a lint, so I can see what's going on and where I can improve my code. Making this stuff explicit has the added benefit of knowing when it fails.

This discussion is about compiler optimizations, not user-level containers. So even if we don't yet have user-level stack-allocated variable-length arrays (and I think they will be quite useful to implement something similar to the Ada2012 Bounded Containers Bounded and unbounded containers ), the compiler can override the regular heap allocation of the Vecs in that code and stack-allocate them using the LLVM stack alloca intrinsic...

Even under otherwise favorable assumptions, the compiler can't replace the Vec<&str>'s allocation code with a simple stack allocation because it doesn't know the vector can't grow larger than 2. Because, of course, it can be larger if there is more than one '=' in a record, albeit the other elements are immediately discarded, and in theory you could write a specialized optimization to notice and take advantage of that (it would have to also depend on either &str's lack of destructor or to_string's semi-purity to avoid changing semantics by reordering the to_string on the first two elements with the destruction of the others - complicated), but the optimization would probably be hard to generalize and thus not very useful.

A somewhat more feasible optimization might be guessing based on the assertion that the length will be 2, starting with a stack allocation of 2, and changing to a heap allocation if necessary. But this would still require changing Vec from 'just a user type that happens to be in the stdlib' to one with some compiler magic, or else adding some sufficiently general mechanism for user code to introduce such optimizations, which would be very tricky. (Haskell's rewrite rules are pretty cool, but in an imperative language they wouldn't be expressive enough.) And it would still generate suboptimal code to deal with the allocation and deallocation we don't even want.

...Or we could just go with Rust's philosophy of making things explicit. In this case, once Rust has integer generics, it'll be less painful to add impls to fixed-size arrays, so there could be an impl like

impl<T, const N: usize> FromIterator<T>
    for Result<[T; N], SomeError>

which would return a fixed-size array if the iterator in fact returned exactly that number of elements, or an error otherwise.

Or we could notice that the example code is probably 'wrong', in that if we have foo=bar=baz we probably want that to mean ("foo", "bar=baz") rather than crash the program, and go with:

// unwrap is bad, but the example also crashed with an 
// invalid input, so it's only fair
let idx = kv.find('=').unwrap();
(kv[..idx].to_string(), kv[idx+1..].to_string())

which is shorter anyway.

1 Like

No, it shouldn't, in the case of Rust. In the case of Java or Swift where heap allocation is part of the language and happens to all references, sure. It makes sense.

Rust already gives you complete power to choose the type of allocation to use. Additionally, the compiler does not know anything about the heap; these are just API calls. Given that Rust prefers to be explicit, it may not necessarily be a good idea if suddenly box, vec, and String stopped allocating in certain places. Nor am I in favor of making these types language types. If the concept of an allocation which has a pure destructor (nothing other than deallocation) and only reallocates on mutation can be boiled down to a fundamental wrapper type (like NonZero), I'm a bit more open to this. Still, cases like the collect example have a chance of reallocation and you may need to add additional annotations to collect(). It gets sticky.

A lint lets you notice cases where you're doing this, and clean it up by removing the heap allocation, if necessary. It doesn't force an optimization on you. In other languages this isn't possible since the user doesn't have control over the allocation type.

2 Likes

Actually LLVM already has an optimization pass which removes unnecessary heap allocations. It currently isn't usable in Rust because it looks for certain hardcoded function names (malloc, free, operator new, operator delete), it shouldn't be too hard to extend it.

The pass works through dead code elimination: if it can see both the allocation and the free (through inlining) and all stores to the allocated memory are dead then the allocation calls can be eliminated. For example, this code:

int f(int x) {
    int* y = malloc(sizeof(int));
    *y = x;
    int z = *y;
    free(y);
    return z;
}

will be optimized down to:

int f(int x) {
    return x;
}

This would be great (especially as it works after inlining), but a lint would still be good, because removing needless heap allocations not only makes the code faster, but usually also more readable.