Why isn't Rust even faster than C++? Comparing heap vs stack allocation

So, I've been doing a lot of benchmarking of C++ functions recently to understand on a fine-grained level which functions are slow and which are fast. Using LLVM, so I assume it's roughly the same as if Rust was calling these low-level functions.

One thing that was interesting was that stack allocation took at an average of 0.15 nanoseconds, malloc took roughly the same amount of time, though free was a killer, taking between 0.25 and 1 microsecond (250 to 1000 nanoseconds) per free() on average.

Given most memory that is allocated needs to be freed and for stack allocation, this operation is pretty much free (move stack pointer.. needs to be done either way), why isn't Rust blazingly fast compared to C++? Is malloc rarely needed either way so it's not really a big deal performance-wise if you can do it faster?

All benchmarks and everything I've seen put them practically neck and neck.

We can't really discuss your results without seeing your benchmark code.

That is my experience as well. As a Rust noob with much C/C++ experience of course that comparison comes to my mind and I have recreated a bunch of my little C/C++ codes in Rust. So far Rust has matched and some times improved performance.

1 Like

Rust puts a big emphasis on using the stack where possible, so that may be why people don't complain as much about free() being less performant.

By default Rust uses the system allocator (i.e. malloc() and free() from the platform's libc). Do you see any difference when switching to jemalloc?

2 Likes

I am not sure what the question here is.

In C++ you can use the same amount of stack allocated objects as in Rust. The amount of available allocators is basically the same (libc allocator, jemalloc, etc)

So why would you expect any big differences?

4 Likes

Here is the code, by all means if I'm benchmarking it wrong, I'd love to hear suggestions:

I'm looking through the C++ programs on benchmarking game as I was curious if they were using alloca(stack alloc), they don't use a lot of alloc but the ones that run slower compared to Rust actually tend to. Stack allocation is not supported as a standard, which is why malloc is more typically used.

@Michael-F-Bryan
Where does Rust use malloc? I thought that was primarily only done for "Box", which isn't used often.

Vec are allocated on the heap. Hence malloc.

Unless the Vec is zero length of of zero sized elements: Vec in std::vec - Rust

4 Likes

Your question makes a false assumption and then asks why the assumption is true. (I think I finally found a case where use of "begs the question" would be correct :slight_smile: ).

All the comparisons I have seen and codes I have written show C++ and Rust to be equally "balzingly" fast.

I thought this sounded odd the first time I read it. It says "I assume" which implies you have not actually written any Rust to compare with the same C++ functionality.

This is now confirmed by the fact that I see no Rust code in the repository you have linked to.

What actually is the question here?

5 Likes

It's true, I'm going off of other benchmarks and comparisons I've seen to compare the two instead of doing it myself.

If anyone wants to whip up a similar test for Rust, it just might highlight operations where C++ wins.. which means that likely Rust could be optimized to at least match it. Would be interesting too. Note though that I'm running C++ in debug mode and with settings so it won't inline functions or optimize anything away. Probably still a fair comparison though if Rust is run the same way but there are a couple small things such as Rust checking for overflow in debug mode, not in release mode that C++ doesn't do.

Happy to reorganize this code so everything is in it's own little function instead of cramming all these tests into one function.

Do you have links to those other benchmarks and comparisons?

Take a look at the benchmark game: Rust vs C gcc - Which programs are fastest? which shows Rust matching C++, sometimes a bit faster, sometimes a bit slower.

Take a look at the work of Cliff Biffle with the same project for embedded micro-controller in C++ and Rust. Rust was quicker! Rewriting m4vgalib in Rust - Cliffle

Take a look at my experiments with creating simple cache friendly array processing in Rust and C: https://github.com/ZiCog/loop_blocking

It's very hard to make these kind of comparisons. So much depends on the compiler, GCC vs Clang vs Rust vs MSVC. Or even compiler versions.

I notice that whatever language you use small changes in your source, that look inconsequential, can make many percent difference in performance.

3 Likes

If you are comparing in debug mode all bets are off.

I have not measured anything here but my gut feeling is that turning off optimizations slows Rust down a lot more than C++.

Just a feeling I get.

2 Likes

All heap-allocated types in Rust use the system allocator (malloc/free) by default:

  • Box
  • Rc and Arc
  • Vec
  • String, OsString, CString, and PathBuf
  • HashMap and HashSet
  • The other types from alloc::collections (VecDeque, BTreeMap, etc.)
  • BufReader and BufWriter

Overall, well-optimized Rust code and well-optimized C++ code tend to use the heap in very similar ways. In some of the benchmarks game programs, we even tuned the code deliberately to use the same allocation patterns as some of the C and C++ entries for the sake of easy comparison.

7 Likes

@ZiCog I'm using Clang, same as Rust.

Will have to use something like this to force the compiler to not optimize away the very functions we are trying to use. Because I'm doing stuff like calling empty functions or additions and not using the results, it's very hard to test in release mode without worrying that the compiler will figure it's useless code and tell me it took 0 seconds to complete.
https://programfan.github.io/blog/2015/04/27/prevent-gcc-optimize-away-code/

@mbrubeck
I can understand in that circumstance but seems like as much as possible things should be done in a Rusty way as benchmarks are a huge way to show Rust's benefits. Perhaps you can throw up the results of the non-C tuned programs somewhere for a comparison more fair to Rust? Would have to spend a little time seeing if the C version could be easily tuned in similar ways. Does benchmark game require standard library only? In that case.. huge win for Rust.

The optimizations that were removed from the Rust programs were also removed from several C, C++, and other programs. They aren't unique to Rust, and wouldn't give any special advantage to Rust.

In general, almost any optimization that can be used in Rust can be used in C++, and vice-versa. Rust tries to make the process easier and safer, but it doesn't add radically new performance possibilities. Usually, highly-optimized Rust code and highly-optimized C++ code can both approach the limits of the underlying hardware and operating system. In terms of low-level performance, these two languages are much more similar than they are different.

No, although it discourages libraries that are written just for winning at benchmarks. General-purpose libraries like the Rust regex crate or the C pcre library are allowed.

4 Likes

I don't mean to be rude but I think your bench marking approach is futile.

Any useful program/function has input and output. The important thing to find is the time taken to produce the output from the input.

Nobody cares if an empty function is removed and the program exists in a single instruction, or even none.

A function with no input could well be optimized away and whatever output it should produce be computed at compile time and used instead.

Not only that, one should benchmark over a range of inputs. A compiler could well optimize a multiply into a shift or a division into a multiplication for any particular known value. All kind of tricks.

When bench marking with Rust there are suggestions for dealing with some of these issues:Benchmark tests

2 Likes

That was just an example and something I was curious about but the idea of knowing how much time is spent on calling the function vs actually executing what is inside of it.

It's not needed for most purposes I agree but for the project I'm working on where squeezing just a little bit of extra performance out of it is a big deal.. yes, I need it. I found some interesting things by performing benchmarks most people don't even think about.. for example, I found a way to call async code about 100x faster (more details coming soon).

Sounds like you need some differential timing.

Write some expression or a whole bunch of them and put timing around them. Then move that processing into a function and time that. The difference is the overhead of the function call and return.

Of course the compiler might in line your function anyway so you end up with the same time results.

Point here is that making a function call does not actually compute anything in itself. It's futile measuring it.

Benchmarks should deal with observables. The inputs and the outputs.

I'm all ears :slight_smile:

2 Likes

Worth noting the current version of those docs lives at test - The Rust Unstable Book (exact same content though, so eh)

1 Like

Regarding Rust allocating a lot of things on the heap.. is there a reason Rust doesn't use the stack by default in those scenarios where it can? Thought that was a big part of the reason for borrowing, I know memory safety is more important.

@ZiCog
Yup, already using differential timing in some places such as timing the for loop initially, then subtracting it from other measurements, but you're right.. might be some other ways to use it too, thanks.

Box, Rc, and Arc allocate on the heap because their purpose is to create owning pointers that aren't limited to the lifetime of any local stack frame. (The same is also true in a sense of the collection types: They are small values that act as owning pointers to larger blocks of memory, so you can pass around ownership without memcpy'ing the entire contents around everywhere.)

Vec, String, and the other collection types allocate on the heap because they hold arbitrarily-sized data. They allow data that is larger than the stack and/or has a size that varies at run-time. (This is also true for some use cases of Box/Rc/Arc, when they are used to store ownership of types like slices or trait objects.)

Rust provides plenty of ways to avoid the heap when it makes sense to do so, but typical programs have a lot of cases where it makes the most sense to put data on the heap. Especially since the stack is typically limited in size to a few MB.

2 Likes

Expanding on this some more: “The stack is fast, so it's better than the heap” is the wrong way to look at things. You need to look at the strengths and weaknesses of both allocation options.

The stack is fast because of its limitations. Stack frames are small, laid out statically at compile time, and everything in a stack frame gets destroyed at the same time when its scope ends. So the stack is a good place to store values that are small, fixed-width, and/or temporary.

The heap is huge, global, and managed at run-time. So the heap is a good place to store values that are large, dynamically-sized, and/or long-lived.

And you don't need to choose just one. The heap and stack are made to work together. When you allocate space on the heap, you get back a pointer. This pointer is small, so we often put it into a variable that lives on the stack. For example, in Rust when you have a variable of type String, some of the string's data is stored on the stack and some on the heap, working with the strengths of each.

10 Likes