The performance issue of returning multiple values(tuple)

I am developing a performance-sensitive data service using Rust. The goal is to perform multiple operations on a list of 200,000 elements within a few seconds, with a total computation volume reaching tens of billions of operations.

The process involves multiple linear steps, leaving little room for multithreading. As a result, I have conducted performance tests on almost every code block.

Currently, I need to make a change from the first case to the second case:

// Case 1
for element in data {
    let result = sub_task(element);
}

// Case 2
for element in data {
    let (result, state) = sub_task(element);
}

Instead of just returning a bool, I now need to retrieve an additional state to pass to the next step.

Due to performance concerns, I tested the impact of returning a tuple and found that it significantly affects performance. However, when conducting a similar test in C++, the performance impact of returning a tuple is relatively smaller.

Rust Code:

use std::hint::black_box;
use std::time::Instant;

use rand::{Rng, thread_rng};

// Define a simple structure
#[derive(Debug, Copy, Clone)]
struct RandomStruct {
    flag: bool,
    value: u8,
}

// First function: returns a u8 value
fn return_u8(rng: &mut impl Rng) -> u8 {
    let test: u8 = rng.gen();
    black_box(test);
    black_box(test>10);
    test
}

// Second function: returns a tuple (bool, u8)
fn return_tuple(rng: &mut impl Rng) -> (bool, u8) {
    let test: u8 = rng.gen();
    black_box(test);
    (test>10, test)
}

// Third function: returns a RandomStruct
fn return_struct(rng: &mut impl Rng) -> RandomStruct {
    let test: u8 = rng.gen();
    black_box(test);
    RandomStruct {
        flag: test>10,
        value: test,
    }
}

// Benchmark the performance of random number generation
fn benchmark_rng(iterations: usize, rng: &mut impl Rng) {
    let start = Instant::now();
    for _ in 0..iterations {
        let value = black_box(rng.gen::<u8>());
        black_box(value); // Prevent compiler optimizations
    }
    let duration = start.elapsed();
    println!(
        "Random number generation: {:?} (for {} iterations)",
        duration, iterations
    );
}

fn main() {
    let iterations = 100_000_000; // Test 100 million iterations
    let mut rng = thread_rng(); // Initialize a global random number generator once

    // Benchmark: random number generation
    benchmark_rng(iterations, &mut rng);

    // Test: returning a u8 value
    let start = Instant::now();
    for _ in 0..iterations {
        let value = black_box(return_u8(&mut rng));
        black_box(value); // Prevent compiler optimizations
    }
    let duration_u8 = start.elapsed();

    // Test: returning a tuple (bool, u8)
    let start = Instant::now();
    for _ in 0..iterations {
        let value = black_box(return_tuple(&mut rng));
        black_box(value); // Prevent compiler optimizations
    }
    let duration_tuple = start.elapsed();

    // Test: returning a RandomStruct
    let start = Instant::now();
    for _ in 0..iterations {
        let value = black_box(return_struct(&mut rng));
        black_box(value); // Prevent compiler optimizations
    }
    let duration_struct = start.elapsed();

    // Print results
    println!(
        "return_u8: {:?}, return_tuple: {:?}, return_struct: {:?}",
        duration_u8, duration_tuple, duration_struct
    );
}

C++ Code:

#include <iostream>
#include <random>
#include <tuple>
#include <chrono>
#include <cstdint>

// Define a structure
struct RandomStruct {
    bool flag;
    uint8_t value;
};

// Random number generator
std::mt19937 rng(std::random_device{}()); // Mersenne Twister engine
std::uniform_int_distribution<uint8_t> dist(0, 255); // Generate uint8_t random numbers

// First function: returns uint8_t
uint8_t return_u8() {
    uint8_t test = dist(rng);
    volatile auto value = test; // Prevent optimization
    volatile auto flag = value > 10;
    return value;
}

// Second function: returns (bool, uint8_t)
std::tuple<bool, uint8_t> return_tuple() {
    uint8_t test = dist(rng);
    volatile auto value = test; // Prevent optimization
    return {value>10, value};
}

// Third function: returns RandomStruct
RandomStruct return_struct() {
    uint8_t test = dist(rng);
    volatile auto value = test; // Prevent optimization
    return {value>10, value};
}

// Benchmark random number generation
void test_rng_generation(size_t iterations) {
    auto start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        volatile auto value = dist(rng); // Prevent optimization
    }
    auto duration = std::chrono::high_resolution_clock::now() - start;
    std::cout << "Random number generation (100M): "
              << std::chrono::duration_cast<std::chrono::milliseconds>(duration).count()
              << "ms" << std::endl;
}

int main() {
    const size_t iterations = 100000000; // Test 100 million iterations

    // Benchmark: random number generation
    test_rng_generation(iterations);

    // Test: return uint8_t
    auto start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        volatile auto value = return_u8(); // Prevent optimization
    }
    auto duration_u8 = std::chrono::high_resolution_clock::now() - start;

    // Test: return (bool, uint8_t)
    start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        volatile auto value = return_tuple(); // Prevent optimization
    }
    auto duration_tuple = std::chrono::high_resolution_clock::now() - start;

    // Test: return RandomStruct
    start = std::chrono::high_resolution_clock::now();
    for (size_t i = 0; i < iterations; ++i) {
        volatile auto value = return_struct(); // Prevent optimization
    }
    auto duration_struct = std::chrono::high_resolution_clock::now() - start;

    // Print results
    std::cout << "Results: "
              << "return_u8: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(duration_u8).count()
              << "ms, return_tuple: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(duration_tuple).count()
              << "ms, return_struct: "
              << std::chrono::duration_cast<std::chrono::milliseconds>(duration_struct).count()
              << "ms" << std::endl;

    return 0;
}

Rust command:

RUSTFLAGS="-C opt-level=3 -C target-cpu=native" cargo run --release

Rust result:

Random number generation: 230.609456ms (for 100000000 iterations)
return_u8: 243.934695ms, return_tuple: 531.157848ms, return_struct: 524.711489ms

C++ command:

g++ test.cpp -o test -std=c++20 -O3
./test

C++ result:

Random number generation (100M): 226ms
Results: return_u8: 261ms, return_tuple: 313ms, return_struct: 336ms

I am not proficient in either Rust or C++. I compared it with C++ just to verify whether the performance impact of tuples is a universal issue across languages. The C++ code was generated with the help of AI.

I would like to ask for help: In addition to this case, tuples are also used extensively in my other places.Is there a way to optimize?

2 Likes

Do you need to?

The generated code more than meets the goals you set of processing 200k samples within "a few seconds," by processing a million (benchmark) examples in half a second on your machine. I would be sorely tempted to call it good at that point unless you have a specific goal for which further time savings would be useful.


For what it's worth, with rust 1.82, I see ratios much closer to the ones you got from your C++ example:

% uname -moprsv
Darwin 23.6.0 Darwin Kernel Version 23.6.0: Thu Sep 12 23:36:55 PDT 2024; root:xnu-10063.141.1.701.1~1/RELEASE_ARM64_T8112 arm64 arm

% cargo run --release
[…]
warning: `jqxyz-bench` (bin "jqxyz-bench") generated 1 warning
    Finished `release` profile [optimized] target(s) in 0.40s
     Running `target/release/jqxyz-bench`
Random number generation: 490.563458ms (for 100000000 iterations)
return_u8: 468.986125ms, return_tuple: 566.48425ms, return_struct: 562.374208ms
4 Likes

I'm not sure how much effect the RNG is having on your test... Have you tried reordering the tests to verify it's related to the return types (not the RNG entropy buffer getting emptier)

You may also look into criterion to get more rigorous statistical analysis of the test cases. I recall they have a lot built in to avoid cold/hot cache differences and they measure the variance between runs.

4 Likes

These microbenchmarks don't teach you anything about your final program performance. Things that are slower in such kinds of benchmarks might speed up the final program because they allow better compiler optimizations.

How large is sub_task and how often is it used in other places of your program? If it is small used only once it will likely get inlined anyway and the return type does not matter.

Btw: Your c++ code does not inspect the individual tuple/struct elements in the outer loop. There is just a single volatile for the complete object which will likely result in a single write. However, I never understood volatile semantics of local variables...

3 Likes

"They're the same picture."

The benchmarking strategy chosen is probably at fault. criterion was already suggested, and that will help fix whatever unintended side effects you've introduced.

But, benchmarking is very difficult. It's impossible to say that you are measuring what you think you are measuring, unless you fully study the compiler output. And even then, there's no guarantee that a newer compiler will always generate the same output.

Also, the requirement to perform tens of billions of operations within a few seconds, though underspecified, is trivial for a modern CPU. We're in the future, our CPUs are in the millions of MIPS range [1]. That's a million-million instructions per second. A trillion. Per second. You might be prematurely optimizing.


  1. Multi-core MIPS. Single-core performance is on the order of 10 instructions per clock cycle per core. At 5 GHz, that's 50,000 MIPS per core. ↩︎

11 Likes
  • In fact, the actual computation requires approximately 50 billion iterations, with random number generation needed in about 1/10 of the cases. Due to the long runtime, the test code only uses 100 million iterations.
  • I tested it on three different machines, all using version 1.82, and the results were similar across all of them.Thanks for letting me know about the other cases.
  • Yes, I previously tested swapping the order, and I got the same results.
  • I used Criterion for benchmarking, and the results still showed a similar ratio.
    Since I only have basic knowledge of Rust, I’m not sure if this is correct. Could you please help review the code? Thank you!
random_number_generation
                        time:   [2.2938 ns 2.2987 ns 2.3039 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

return_u8               time:   [2.3549 ns 2.3605 ns 2.3666 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

return_tuple            time:   [5.3944 ns 5.4141 ns 5.4342 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

return_struct           time:   [5.4137 ns 5.4363 ns 5.4614 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

Code:

use criterion::{criterion_group, criterion_main, Criterion};
use std::hint::black_box;
use rand::{Rng, thread_rng};
use std::time::Instant;

// Define a simple structure
#[derive(Debug, Copy, Clone)]
struct RandomStruct {
    flag: bool,
    value: u8,
}

// First function: returns a u8 value
fn return_u8(rng: &mut impl Rng) -> u8 {
    let test: u8 = rng.gen();
    black_box(test);
    black_box(test > 10);
    test
}

// Second function: returns a tuple (bool, u8)
fn return_tuple(rng: &mut impl Rng) -> (bool, u8) {
    let test: u8 = rng.gen();
    black_box(test);
    (test > 10, test)
}

// Third function: returns a RandomStruct
fn return_struct(rng: &mut impl Rng) -> RandomStruct {
    let test: u8 = rng.gen();
    black_box(test);
    RandomStruct {
        flag: test > 10,
        value: test,
    }
}

// Benchmark: random number generation
fn benchmark_rng(c: &mut Criterion) {
    let mut rng = thread_rng();

    c.bench_function("random_number_generation", |b| {
        b.iter(|| {
            let value = black_box(rng.gen::<u8>());
            black_box(value);
        })
    });
}

// Benchmark: return_u8 function
fn benchmark_return_u8(c: &mut Criterion) {
    let mut rng = thread_rng();

    c.bench_function("return_u8", |b| {
        b.iter(|| {
            let value = black_box(return_u8(&mut rng));
            black_box(value);
        })
    });
}

// Benchmark: return_tuple function
fn benchmark_return_tuple(c: &mut Criterion) {
    let mut rng = thread_rng();

    c.bench_function("return_tuple", |b| {
        b.iter(|| {
            let value = black_box(return_tuple(&mut rng));
            black_box(value);
        })
    });
}

// Benchmark: return_struct function
fn benchmark_return_struct(c: &mut Criterion) {
    let mut rng = thread_rng();

    c.bench_function("return_struct", |b| {
        b.iter(|| {
            let value = black_box(return_struct(&mut rng));
            black_box(value);
        })
    });
}

// Criterion benchmark group
criterion_group!(
    benches,
    benchmark_rng,
    benchmark_return_u8,
    benchmark_return_tuple,
    benchmark_return_struct
);
criterion_main!(benches);

  • There are hundreds of child_task and dozens of parent_task, with each parent_task utilizing several dozen child_task.
  • The AI informed me that volatile is used to tell the compiler not to optimize this part, which is roughly equivalent to black_box. I also conducted some small tests, and when I added a declaration, the execution time increased slightly.

Initially, my case involved returning values like (true, 0), where the return value is a literal. In this situation, the performance of returning a tuple is the same as returning a u8. However, when I changed the data to random values, the difference between the two approaches started to show.Due to my limited knowledge of Rust, I had no choice but to seek help here.

Regarding the computation count, you were referring to instructions, while I was referring to the conditions, if, = operations, etc., in the for loop being executed once in the code. I apologize for the misunderstanding.

A 50 billion empty loop takes 11 seconds on my device without optimizations. On the client’s device, this could result in a significant wait more time, especially in examples like mine. If I change the child tasks to return tuples, the final time could double or even increase further.

Thank you very much for your help!

1 Like

How was this situation measured? 11 seconds seems about right for incrementing a register 50 billion times. But increasing the return value from one register to two isn't going to double the time unless you are doing something like collecting them all into a vector, or otherwise blowing out the data cache.

Instruction level parallelism is capable of populating two registers in a single instruction cycle when there is no data hazard.

I'm not saying don't optimize. I'm suggesting you should optimize the expensive part. And you can't tell what's expensive without profiling (empirical measurements). Are you sure you're measuring tuple return values, and not something else?

1 Like

I was asking about the code size of the functions because the return type does not really matter if the functions get inlined.

The AI informed me that volatile is used to tell the compiler not to optimize this part, which is roughly equivalent to black_box. I also conducted some small tests, and when I added a declaration, the execution time increased slightly.

It is true that volatile inhibits optimizations. The problem is that you probably get different behaviour from both. What you are actually measuring is the difference in optimization inhibition.

I have the feeling that you could benefit a lot from learning more about how modern CPUs and compilers work. The return of the struct (a tuple is a regular struct in rust. The fields just have the names 0, 1, and so on) itself is not the thing that likely determines the final performance. Also the "time" that an empty loop takes tells you very little. The biggest question is what optimizations the compiler can and will do. Modern compilers like rust's LLVM backend have good heuristics about what is fast and what is slow. With that knowledge they transform slow code into fast code when possible.

They will inline functions. This removes function call overhead and enables further local optimization. For example the compiler can see that certain parts of a return value are not used at the call side and remove the calculation all together.

They will unroll and/or vectorize loops. As mentioned in this thread already modern CPUs have specific vector instructions to process more than one number at a time. On top they can execute several instructions in parallel in the same thread. A modern CPU core as multiple arithmetic units that do work at the same time.

They will optimize out variables all together, i.e. keep them in registers the whole time. This saves on the typically most important bottle neck which is memory bandwidth.

You need to understand that all these optimizations are inhibited if you try to measure properties of the initial code. For example, if you print the address of a variable the compiler is forced to put it into memory.

1 Like

I ran your benchmarks and also saw the same issue. Looked at the generated assembly, I noticed the main difference seemed to be clearing some upper bits of stack memory for the return value. I modified the tuple function to return u32 instead of u8, and the performance went to the same as the function returning a single u8.

    Finished `release` profile [optimized] target(s) in 0.18s
     Running `target/release/performance-tuple-return`

rng Random number generation: 233.155785ms (for 100000000 iterations)
u8 Random number generation: 237.45502ms (for 100000000 iterations)
tuple1 Random number generation: 536.944549ms (for 100000000 iterations)
tuple2 Random number generation: 257.511156ms (for 100000000 iterations)
struct Random number generation: 527.3874ms (for 100000000 iterations)

rustc 1.82.0 (f6e511eec 2024-10-15)

You might think if it's returning u8 instead of u32 it would use less space or something like that, but time is being spent to clear the upper bits of the returned value.

fn return_tuple1(rng: &mut impl Rng) -> (bool, u8) {
    let test = rng.gen();
    (test > 10, test)
}

fn return_tuple2(rng: &mut impl Rng) -> (bool, u32) {
    let test = rng.gen();
    (test > 10, test)
}

I wouldn't change the code to return a u32 to improve performance if u8 is what makes sense in your code.

6 Likes

This once again highlights how tricky and misleading microbenchmarks can be.

You might want to explore the topic of flame graphs for performance measurement and optimization. It can feel a bit overwhelming at first, especially if you're new to it, and the tools may not always be as polished as one would hope. However, investing time to gain hands-on experience can be incredibly rewarding.

A good starting point is flamegraph-rs. Also worth noting is Firefox’s excellent tool for analyzing flame graphs: https://profiler.firefox.com/.

3 Likes

Currently, the codebase is about 10K lines before expansion and 4.5K after expansion. Overall, about one-third of the functionality has been completed.

I agree with your perspective. If I were proficient in compiler principles, I could solve many of these issues. Most of my prior experience is with Python, primarily writing small scripts. In addition to Rust, I’m also juggling responsibilities for product, design, and frontend work.Learning compiler principles has a significant learning curve, and I am not a very experienced programmer. Learning Rust has already demanded a significant amount of effort from me. Before this project, I had never even encountered LLVM.Given these objective reasons, I have no choice but to seek help here. I hope you can understand.

The reason I chose Rust is that it ensures both performance and safety while allowing me to focus on business logic without worrying too much about low-level programming issues. Rust has performed excellently in this regard, and I'm glad I chose it.

C++ is only being used to illustrate the point—it might not be a universal problem, or perhaps there’s a solution I’m not aware of. I have no other intentions; otherwise, I wouldn’t have chosen Rust.

In any case, thank you for your help.

4 Likes

Edited to avoid misleading others.

Thank you all for your answers; the problem has been perfectly resolved.

I understand that "microbenchmarking" might be something that many people frown upon, especially after reading numerous discussions within the community. Before this, I didn’t even know the term existed. I apologize if I unintentionally broke any taboo.

But I feel very strange now. I’m simply a regular user, following the documentation to use a standard feature, and I encountered a small issue while testing performance. I searched through Stack Overflow, the Rust forum, and Google, but couldn’t find the reason for this behavior.

When I came here to seek a solution, I was repeatedly told that my testing method was wrong. But this is exactly how I’m using it in my code, and this is exactly how the documentation teaches it—why can’t I test it this way?

The issue was eventually resolved with the help of Richardscollin. Thank you so much for helping me, rather than trying to prove once again that my testing method was wrong or that optimization wasn’t necessary.

In reality, even if I had learned all the available tools, it wouldn’t have mattered, as the final solution wasn’t something that could be addressed by the tests themselves—it was beyond my knowledge. Does this mean that only those who have studied compiler theory, computer architecture, operating systems, and algorithms, or have completed a computer science degree, are qualified to seek help here?

It's because you're not testing what you think you're testing, and the suggestion to return u32 doesn't "fix" it.

Here's what the result of your benchmark gives me without any changes:

Random number generation: 377.111708ms (for 100000000 iterations)
return_u8: 348.655834ms, return_tuple: 437.488292ms, return_struct: 439.341625ms

And here's what happens when I use criterion::black_box instead of std::hint::black_box:

Random number generation: 335.7585ms (for 100000000 iterations)
return_u8: 350.37925ms, return_tuple: 368.364583ms, return_struct: 368.573459ms

There are good reasons to trust that you may be measuring incorrectly when many people are informing you that you might be measuring incorrectly. And in this case the group is also providing technical reasons so that you may make a more informed decision.

4 Likes

Not at all. After all, you received help from numerous members of the community in multiple times, didn't you?

I think you are confusing the outcomes and what it entails. You expected outcome X and instead got numerous replies telling you that wasn't right, and in the end you got outcome Y. Reaching a different outcome than the one you expected and putting the blame on the community for not helping you to reach that outcome are two different things.

1 Like

I copied your benchmark code into a criterion harness:

Click to expand...
use criterion::{black_box, criterion_group, criterion_main, Criterion};
use rand::{rngs::ThreadRng, thread_rng, Rng};
use std::time::Instant;

// Define a simple structure
#[derive(Debug, Copy, Clone)]
struct RandomStruct {
    _flag: bool,
    _value: u8,
}

// First function: returns a u8 value
fn return_u8(rng: &mut impl Rng) -> u8 {
    let test: u8 = rng.gen();
    black_box(test);
    black_box(test > 10);
    test
}

// Second function: returns a tuple (bool, u8)
fn return_tuple(rng: &mut impl Rng) -> (bool, u8) {
    let test: u8 = rng.gen();
    black_box(test);
    (test > 10, test)
}

// Second function: returns a tuple (bool, u32)
fn return_tuple_u32(rng: &mut impl Rng) -> (bool, u32) {
    let test: u32 = rng.gen();
    black_box(test);
    (test > 10, test)
}

// Second function: returns a tuple (bool, u8)
fn return_tuple_u32_to_u8(rng: &mut impl Rng) -> (bool, u8) {
    let test: u32 = rng.gen();
    black_box(test);
    (test > 10, test as u8)
}

// Third function: returns a RandomStruct
fn return_struct(rng: &mut impl Rng) -> RandomStruct {
    let test: u8 = rng.gen();
    black_box(test);
    RandomStruct {
        _flag: test > 10,
        _value: test,
    }
}

// Benchmark the performance of random number generation
fn benchmark_rng(iterations: usize, rng: &mut impl Rng) {
    let start = Instant::now();
    for _ in 0..iterations {
        let value = black_box(rng.gen::<u8>());
        black_box(value); // Prevent compiler optimizations
    }
    let duration = start.elapsed();
    println!(
        "Random number generation: {:?} (for {} iterations)",
        duration, iterations
    );
}

fn setup() -> ThreadRng {
    let iterations = 100_000_000; // Test 100 million iterations
    let mut rng = thread_rng(); // Initialize a global random number generator once

    // Benchmark: random number generation
    benchmark_rng(iterations, &mut rng);

    rng
}

fn bench_u8(c: &mut Criterion) {
    let mut rng = setup();
    c.bench_function("return_u8", |b| {
        b.iter(|| {
            let value = black_box(return_u8(&mut rng));
            black_box(value); // Prevent compiler optimizations
        })
    });
}

fn bench_tuple(c: &mut Criterion) {
    let mut rng = setup();
    c.bench_function("return_tuple", |b| {
        b.iter(|| {
            let value = black_box(return_tuple(&mut rng));
            black_box(value); // Prevent compiler optimizations
        })
    });
}

fn bench_tuple_u32(c: &mut Criterion) {
    let mut rng = setup();
    c.bench_function("return_tuple_u32", |b| {
        b.iter(|| {
            let value = black_box(return_tuple_u32(&mut rng));
            black_box(value); // Prevent compiler optimizations
        })
    });
}

fn bench_tuple_u32_to_u8(c: &mut Criterion) {
    let mut rng = setup();
    c.bench_function("return_tuple_u32_to_u8", |b| {
        b.iter(|| {
            let value = black_box(return_tuple_u32_to_u8(&mut rng));
            black_box(value); // Prevent compiler optimizations
        })
    });
}

fn bench_struct(c: &mut Criterion) {
    let mut rng = setup();
    c.bench_function("return_struct", |b| {
        b.iter(|| {
            let value = black_box(return_struct(&mut rng));
            black_box(value); // Prevent compiler optimizations
        })
    });
}

criterion_group!(
    benches,
    bench_u8,
    bench_tuple,
    bench_tuple_u32,
    bench_tuple_u32_to_u8,
    bench_struct,
);
criterion_main!(benches);

The results on my machine are not what you will expect:

Random number generation: 328.380167ms (for 100000000 iterations)
return_u8               time:   [3.5309 ns 3.5337 ns 3.5372 ns]
                        change: [+0.0394% +0.1780% +0.3497%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 21 outliers among 100 measurements (21.00%)
  7 (7.00%) low mild
  2 (2.00%) high mild
  12 (12.00%) high severe

Random number generation: 325.45975ms (for 100000000 iterations)
return_tuple            time:   [3.7244 ns 3.7291 ns 3.7347 ns]
                        change: [+0.3814% +0.5782% +0.8084%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 10 outliers among 100 measurements (10.00%)
  3 (3.00%) high mild
  7 (7.00%) high severe

Random number generation: 325.577625ms (for 100000000 iterations)
return_tuple_u32        time:   [4.2203 ns 4.2287 ns 4.2386 ns]
                        change: [+0.2249% +0.4546% +0.7134%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

Random number generation: 325.15825ms (for 100000000 iterations)
return_tuple_u32_to_u8  time:   [3.7163 ns 3.7196 ns 3.7237 ns]
                        change: [+0.3899% +0.5740% +0.7698%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 15 outliers among 100 measurements (15.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  9 (9.00%) high severe

Random number generation: 325.301625ms (for 100000000 iterations)
return_struct           time:   [3.7226 ns 3.7283 ns 3.7359 ns]
                        change: [+0.0310% +0.2217% +0.3989%] (p = 0.02 < 0.05)
                        Change within noise threshold.
Found 13 outliers among 100 measurements (13.00%)
  3 (3.00%) low mild
  3 (3.00%) high mild
  7 (7.00%) high severe

It shows that returning (bool, u32) is actually the slowest for me. About 20% slower than u8. Returning (bool, u8) is only about 5% slower than u8.

That might seem like a lot, but benchmarking truly is hard to do! A significant portion of your application needs to be spent doing nothing but returning tuples to see that 5% hit. If only 10% of the application runtime is returning tuples, then you will only see a 0.5% hit.

Be very careful about what you are measuring and how.

This is a good point. Confirmation bias is no good for anybody.

5 Likes

The differences you're seeing in the benchmarks are caused by different things being passed into black_box. For a better benchmark, reorganize the code so that identical things are passed into black_box the same number of times.

"Returning tuples" is not important: the functions are inlined in the generated code into main and thus there is no explicit passing of return values.

2 Likes