Not quite zero-cost abstraction


#1

(co-authored by @bugaevc and @YaLTeR)

Zero-cost abstraction is one of the major selling points of Rust, along with memory safety and others. In fact, it is listed first in the Rust features list on the rust-lang.org frontpage.

Zero-cost abstraction means abstraction without overhead. This post in the official Rust blog states that zero-cost abstraction is a core principle of Rust and cites the following definition given by Bjarne Stroustrup:

What you don’t use, you don’t pay for. And further: What you do use, you couldn’t hand code any better.

There are many things in Rust the language (such as null pointer optimization, traits, closures and iterators) and its ecosystem that do indeed provide pleasant and easy-to-use abstractions without imposing any overhead.

But this is, unfortunately, not always the case.

Here’s an example of a high-level abstraction that does have a hidden cost:

#[macro_use]
extern crate serde_json;

fn main() {
    let obj = json!({
        "hello": "world",
        "nums": [42, 35]
    });
    serde_json::to_writer(std::io::stdout(), &obj)
        .expect("Failed to serialize json into stdout");
}

(playground link)

If the abstractions provided by json and serde were indeed zero-cost, we would expect this code to compile in release mode (with LTO additionally enabled) to something very similar to what this code compiles to:

use std::io::prelude::*;

fn main() {
    let message = b"{\"hello\":\"world\",\"nums\":[42,35]}";
    stdout().write_all(message)
        .expect("Failed to serialize json into stdout");
}

Furthermore, we would expect a simple

print!("a");

to have no additional cost over

let res = libc::write(1, b"a", 1);
if res < 0 {
    panic!("some error message");
}

None of this is the case. In fact, serde_json performs heap allocation, and accessing stdout has to deal with Arc<ReentrantMutex<RefCell<LineWriter<Maybe<StdoutRaw>>>>> and it is not optimized away even if we only output one line and never create a second thread.

As another example, something like

&format!("{} {} {} {} {}", 1, 2, 3, 4, 5)

keeps the whole formatting code intact to be run at runtime.

And these are just a few examples of how non-zero cost Rust frequently is. We do pay for what we don’t use (line buffering, runtime formatting or JSON encoding, concurrent stdout access), and we can easily hand-code it to be better (not that we would want to do that, because it leads to uglier code).

It seems that in most cases, these optimizations are prevented by the reluctance of Rust (or LLVM as its component) to inline function calls. Notably, many functions in Rust’s standard library (for example, iterator adapters) are marked #[inline], and that seems to be the reason why they are a zero-cost abstraction.

Other functions like printing could require a threading analysis by the compiler to determine if a mutex or a TLS variable is only used by a single thread.

This leads us to the following questions:
• What are the reasons for a compiler not to inline a function that is only called once? This neither increases the code size nor negatively impacts caching.
• Are these missed optimizations considered an issue for Rust? (They should be, since it’s a violation of its core principle.) Should they be documented?
• Is anybody planning to do anything about it?
• Why isn’t LTO enabled by default in release builds?
• When is it a good idea to manually mark methods as #[inline]?


#2

Consider:

f() compiles to 1KB of code.
g() compiles to 128KB of code.
f() rarely calls g().
f() is the only caller of g().
f() is called frequently in a performance-critical loop.

If the instruction cache were only 32KB, inlining g() into f() would blow the instruction cache and this would hurt performance.

However, inlining isn’t the only way to do dead code elimination. Even if g() isn’t inlined info f(), the compiler could use the information about how f() calls g() to cut out guaranteed-unused code paths.

When is it a good idea to manually mark methods as #inline(always)?

I mark functions #inline(always) when inlining seems essential for performance and/or when the function is so simple (e.g. it just forwards to another function adding one argument) that I expect that the inlining will have little or no effect on code size.


#3

“Zero cost abstractions” can never be an absolute in practice. So your “not quite” qualification is appropriate, or maybe “not always” is slightly better. C++ is in the same boat, despite Bjarne coining the term.

The reason is because it relies on the Sufficently Smart Compiler fallacy. Languages can make themselves more amenable to optimization, but they rely on compilers to peel abstractions away. They can guarantee certain things happen at the micro level (e.g. null pointer optimization, monomorphization, layout control, etc), but that’s part of making themselves amenable for optimizers.

There’s a good reason inlining is the “mother of all optimizations” - it’s what peels the abstractions away and lets the compiler reason about the otherwise “black box”. So, not really saying anything revolutionary here, but inlining must occur for zero cost to be even possible. Again, not Rust specific.

As to why inlining can fail (without force by user), there can be many reasons unfortunately, and they will be compiler specific. Generally, compilers use a bunch of heuristics to decide on online candidates - those heuristics are tuned by “common” code shapes but are otherwise just that - heuristics. Code size is one of them, usually. Then there’s also caller heuristics - is the callee inside a loop? If so, maybe give it an inlining bonus. But also need to be careful about icache size of a loop body. But maybe try inlining and see the result. But then keep inlining all other call sites. But then what if we don’t think the net result after all inlining is profitable? Do we back out all inlining? Only some and reassess? That’s going to take a while even if we want to spend time on it. But then users hate long compile times - will the net result definitely be faster? By how much? Will users think the extra compile time was worth it? And so on. Things like that get hairy real quick.

So the point being is inlining will fail from time to time, and you’ll need to force the compiler to do it because you know better.


#4

If you want without runtime processing cost, you should be looking at metaprogramming. There is no suggestion that the serde_json code would provide anything to remove processing steps when presented with static input.


#5

This is a good example. I’d just add though that it’s possible for f() to compile down much smaller than 1KB if g() was inlined, and it would be a net win. The problem is the compiler has to make this decision a priori (typically) and has to do it many times in a big program. PGO adds another dimension where block frequency is known and so compilers have a bit more data to guide their decisions. But, it’s still imperfect of course.


#6

I’d just add though that it’s possible for f() to compile down much smaller than 1KB if g() was inlined, and it would be a net win.

The obvious solution is in order to make inlining decision, always try inlining it, run optimization to the end, and compare size. The obvious problem is this is too slow. One solution is to replace inlining heuristics with try-inlining heuristics. That is, heuristics decide not whether or not to inline, but whether or not to try-inline. Now you arrived at Towards better inlining decisions using inlining trials (1994) by Jeff Dean. (Yes, it’s the same person.) I always wondered why this method isn’t more popular.


#7

I understand Rust’s “zero-cost abstractions” feature to be that the language allows creating such abstractions. Not that all abstractions it allows are zero cost.


#8

Yeah, of course you can write costly abstractions in any language. The point here is that it should be easy to write zero-cost abstractions using idiomatic Rust.


#9

On that note, here is a zero-cost abstraction for building JSON strings. We use this in the test suite for serde_json.

let obj = json_str!({
    "hello": "world",
    "nums": [42, 35]
}); 

// expands to
let obj = "{\"hello\":\"world\",\"nums\":[42,35]}";