The performance issue of returning multiple values(tuple)

Does this mean that only those who have studied compiler theory, computer architecture, operating systems, and algorithms, or have completed a computer science degree, are qualified to seek help here?

Definitely not. I and most other people here are trying to be helpful and explain to you why things are a bit more complicated when optimizing a compiled language. You mentioned that your prior experience was mostly with python and with the cpython interpreter it is often true that you can optimize isolated small parts of your program.

I remember coming from a similar angle myself and I also remember that I was puzzled by my findings. Changes that produced faster microbenchmarks slowed down my program! This does not mean that these benchmarks are useless. But interpreting the results requires that you look at more parameters than the wall time.

One thing I would reccomend is to have a look at profile guided optimization and BOLT.

If you are interested in how the CPU works and how to optimize performance I can recommend this talk: Understanding CPU Microarchitecture to Increase Performance.

2 Likes

Let me clarify: I am truly grateful for the community's help and even feel ashamed for benefiting from it without contributing in return.

I did not mean to convey, ā€œReaching a different outcome than the one you expected and putting the blame on the community for not helping you to reach that outcome are two different things.ā€ I was simply confused and frustrated by the issues I encountered during the process. If the translation made it seem otherwise, I sincerely apologize.

What I want to express is that, regardless of how "correct" or "wrong" my approach may have been in this process, I just wanted to understand how to avoid performance issues or whether my usage pattern could impact performance. Thatā€™s why I used a method that closely resembled a real-world scenario for testing. Whatever the performance issue was, I needed to identify and resolve it instead of using a more scientific method to test. This ā€œmicrobenchmarkā€ was already very close to my actual code, which merely has more functions and loops. I wasnā€™t trying to prove that "returning tuples affects performance," nor was I trying to claim that C++ is better than Rust. I just wanted to know what I should do if I encountered similar issues in my code. At that point, I instinctively thought I could find an answer online or, perhaps shamefully, seek help from the community. But when I was repeatedly told that my testing method was incorrect or that assembly-level knowledge was required to solve the issue, I couldnā€™t help but feel frustrated...

Regardless, I deeply appreciate the time and effort youā€™ve spent addressing my question. I am in no position to blame anyone, nor was it ever my intention.

The latest replies have provided new insights, and based on those suggestions, I removed the impact of black_box. After doing so, I observed similar performance across all approaches.

This is the latest code:

use std::time::Instant;
use criterion::black_box;
use rand::{Rng, thread_rng};

// Define a simple structure
#[derive(Debug, Copy, Clone)]
struct RandomStruct {
    flag: bool,
    value: u8,
}

// First function: returns a u8 value
fn return_u8(rng: &mut impl Rng) -> u8 {
    let test: u8 = rng.gen();
    test
}

// Second function: returns a tuple (bool, u8)
fn return_tuple(rng: &mut impl Rng) -> (bool, u8) {
    let test: u8 = rng.gen();
    (test > 10, test)
}

// Third function: returns a RandomStruct
fn return_struct(rng: &mut impl Rng) -> RandomStruct {
    let test: u8 = rng.gen();
    RandomStruct {
        flag: test > 10,
        value: test,
    }
}

// Benchmark the performance of random number generation
fn benchmark_rng(iterations: usize, rng: &mut impl Rng) -> u32 {
    let mut result: u32 = 0;
    let start = Instant::now();
    for _ in 0..iterations {
        let value: u8 = rng.gen();
        result += value as u32; // Accumulate values as u32
    }
    let duration = start.elapsed();
    println!(
        "Random number generation: {:?} (for {} iterations)",
        duration, iterations
    );
    result // Return result to ensure it is used
}

fn main() {
    let iterations = 1000_000_000; // Test 100 million iterations
    let mut rng = thread_rng(); // Initialize a global random number generator once

    // Benchmark: random number generation
    let rng_result = benchmark_rng(iterations, &mut rng);
    black_box(rng_result); // Use black_box after the test to prevent optimization

    // Test: returning a RandomStruct
    let mut struct_sum: u32 = 0;
    let start = Instant::now();
    for _ in 0..iterations {
        let result = return_struct(&mut rng);
        struct_sum += if result.flag { 1 } else { 2 }; // Add 1 or 2 based on `flag`
    }
    let duration_struct = start.elapsed();
    black_box(struct_sum);

    // Test: returning a u8 value
    let mut u8_sum: u32 = 0;
    let start = Instant::now();
    for _ in 0..iterations {
        let result = return_u8(&mut rng);
        let flag  =result > 10;
        u8_sum += if flag { 1 } else { 2 }; // Add 1 or 2 based on `>10`
    }
    let duration_u8 = start.elapsed();
    black_box(u8_sum);

    // Test: returning a tuple (bool, u8)
    let mut tuple_sum: u32 = 0;
    let start = Instant::now();
    for _ in 0..iterations {
        let result = return_tuple(&mut rng);
        tuple_sum += if result.0 { 1 } else { 2 }; // Add 1 or 2 based on `result.0`
    }
    let duration_tuple = start.elapsed();
    black_box(tuple_sum);

    // Print results
    println!("\nResults:");
    println!("RandomStruct sum: {}", struct_sum);
    println!("u8 sum: {}", u8_sum);
    println!("Tuple sum: {}", tuple_sum);

    println!(
        "Durations - return_u8: {:?}, return_tuple: {:?}, return_struct: {:?}",
        duration_u8, duration_tuple, duration_struct
    );
}

Result:

Random number generation: 1.066217523s (for 1000000000 iterations)

Results:
RandomStruct sum: 1042965858
u8 sum: 1042976607
Tuple sum: 1042966964
Durations - return_u8: 1.264039853s, return_tuple: 1.264772937s, return_struct: 1.264900587s
2 Likes

Thanks for the clarification. I'm also a self-taught developer, and I know how hard it is to deal with problems that trigger the impostor syndrome in me and make me question if my knowledge gaps are insurmountable.

The truth is that they aren't. Most concepts only take a couple of days to a week to learn (plus some days of reinforcement). It's just that there are so many things to learn that we aren't able to predict how long will the learning journey be. I've been a member of this community for around 4 years and still learning every day!

Now to the point of microbenchmarking. Yes, people frown upon it because:

  • It can be misleading. If you want to benchmark something, do it so with more meaningful units of code, preferably up to a unit of domain logic.
  • Can lead to unidiomatic code being written for very little gains or, in the worst case, negligible gains.

As developers we need to choose our battles. Some are not worth the effort.

2 Likes

You did contribute. You asked a good question and I got to learn from the answers.

8 Likes

I'd say a experienced engineer working with low level / performance sensitive code would over time have acquired this knowledge either way, even without formal education or theory, just by solving this kind of problem again and again over the years and learning from other people's issues, blog posts, documentation and experimentation. And since you're new to Rust you don't have all of that knowledge yet, so you'll still have acquire it, that's normal.
And you don't need to understand it in full detail like a compiler developer. You need some superficial familiarity to have a idea how the system behaves.

Rust isn't magic, so you have to work with the compiler and develop some mechanical sympathy to get good results.

For example I don't write assembly (outside a few toy programs way back). I don't even try to read every single instruction or build a mental model of all the program state in a function. I mostly look at the branches and loops and try to spot the hot parts (looking at profiler and source-annotated aassembly can help here) and whether it uses SIMD instructions (where applicable) or whether there are any unexpected function calls, allocations, panics and things like that.

1 Like

The most powerful tool you have is profiling your code in situ. It will tell you which areas are most expensive, and working on those first will have the most significant impact on performance. I have been trying to help you refocus so that your optimization efforts will be successful. And I think others have been hoping to do the same.

I appreciate the kind words, and want to thank you for being understanding. I personally hope that my tone in comments did not appear to be apathetic. I want every Rustacean to be successful.

5 Likes

I understand that youā€™re trying to help me truly optimize a program, and I really appreciate the time youā€™ve put in.

After careful thought, I realize that I might have described my question in a way that misrepresented my true intent. My original thought was: can I use certain statements safely and without burden to achieve the best performance, so as to avoid major refactoring later on? In fact, Iā€™ve already refactored twice. For example, one reason was that using HashMap extensively caused bad memory usage and performance in completed parts. Even after modifying the hashing algorithm.So I switched to using a Vec and directly used the index as the ID. In Python, array and dict are nearly indistinguishable in performance.

I just want to find the optimal usage patterns to avoid heavy refactoring later. When problems arise, I want to understand their causes and how to optimize them.When I removed black_box as suggested, I observed the same performance, which aligns with the theoretical results. At that point, I was able to identify that the issue was caused by black_box, and I could confidently use tuples and structs. Rather than actually testing how fast the tuple is.

I also understand that the skills you mentioned can help me understand the problem, but they feel too distant.I would spend years learning them, but the current issue is that any misuse in the project will lead to refactoring later, and there isnā€™t enough time for that.Now I am still struggling with borrowing, ownership, and unsafe, which are almost overwhelming me.

Iā€™m sorry if my question ended up being misleading. And thank you for your help.

1 Like

Ah yes. Python does not have arrays. People use lists as arrays. I suspect that under the hood a Python list and dictionary are pretty much the same thing.

I suspect it would help people here to help you if you said more about the data in those 200,000 element arrays you have to deal with and the operations you have to do on it. Likely that is where your time is going.

If development time is short it's likely better to start there rather than wasting time on micro-benchmarking this and that.

No, Python lists are the equivalent of Vec, i.e. they are dynamically resizeable arrays. The reason the performance difference isn't noticeable in Python is because all Python operations on variables have large interpreter overhead that masks it (a hash map lookup to find a variable).

That is not my understanding of how Python lists work. The distinguishing feature of an array or Vec is that it consists of multiple elements, all the same size and type, stored consecutively in memory and accessed via an integer index. Which facilitates fast access.

Python lists on the other hand only store references to objects in the elements, which presumably are allocated somewhere at random, the objects need not be of the same type. Not really an array and not exactly designed for speed.

For sure the overhead of interpretation does not help.

Well it's an array of dynamically typed references -- not the same representation as a dict, which is a hash table of dynamically typed references.

True. To be honest I did not know that until I looked it up after making my initial statement about dictionaries and lists.

Hi, jqxyz.

The goal is to perform multiple operations on a list of 200,000 elements within a few seconds, with a total computation volume reaching tens of billions of operations.

What does the operations means? How complicated with it?

for element in data

That means data.len() == 200,000, right?

And where is the bottleneck of the real performance? in sub_task or other place?

  • If the sub_task is small, it would be inlined into for loop, so no need to concern performance down of return tuple

  • If the sub_task not be inlined, the tuple may be a stack variable.

  • more details here: Compiler Explorer
    I add noinline attribute for watch there behavior

I thought, cost of return tuple has a low priority for performance.
If you don't know how to start, perf would help.

The project involves processing a list of about 200,000 elements, performing approximately 10,000 tasks.

Each task consists of two steps: filtering and processing.

Filtering:

The filtering conditions are straightforward:

  • element.attr1 > 10
  • element.attr2 == true
  • element.attr3 == elements[n].attr_x

There are usually 10-20 conditions per task.

Processing:

The processing steps are also simple:

  • element.attr4 = 11
  • element.attr5 = rng.gen_range(0..param)
    The number of steps is uncertain but generally less than 20.

Requirements:

  • The ~10,000 tasks cannot be hardcoded; they must be generated dynamically from a configuration file.
  • The tasks should complete within 5 seconds on typical devices.
  • The number of elements in the list is currently 200,000 but could increase to 1,000,000.
  • Task performance is the most critical part of the project.

Current Performance:

In the test, processing 40 billion operations already exceeds 3 seconds on my device, and on client devices, it can even take more than 10 seconds.

This is my initial test code. In fact, it doesn't meet the criteria of a so-called micro-benchmark, as my code is almost built step by step on top of this.My code will definitely be slower than this.

use rand::Rng;
use std::time::Instant;

struct Data {
    value: i32,
}

fn main() {
    let list_size = 200_000;
    let iterations = 200_000;

    // Generate a list of 200,000 structs with random `value`
    let mut rng = rand::thread_rng();
    let data_list: Vec<Data> = (0..list_size)
        .map(|_| Data {
            value: rng.gen_range(0..10000),
        })
        .collect();

    let mut total_sum = 0; // To store the sum of all valid `i`
    let start_time = Instant::now();

    // Process each element
    for data in &data_list {
        for i in 9990..iterations {
            if data.value > i {
                total_sum += i;
            }
        }
    }

    let duration = start_time.elapsed();
    println!("Time taken to process: {:.2?}", duration);
    println!("Total sum of valid i: {}", total_sum);
}

Because of the previous use of hashmap, the single execution time of the task exceeded the threshold by a lot. I had to start carefully choosing any syntax/function to keep the performance almost unaffected.

I think, what some are trying to state here regarding the term micro benchmark, is that:

benchmark(JOB_A) + benchmark(JOB_B) == benchmark(JOB_A + JOB_B)

is not guaranteed, but to the experience of them mostly in practice does not hold.

I understand what you're saying.

What I mean is, this code is not a micro-benchmark; itā€™s the demo for my initial project. Iā€™m worried you might think this is a micro-benchmark.

What Iā€™m trying to say is that this is already in its minimal state, yet it still takes 3 seconds. Knowing the performance of each function/method at this point is meaningful. If I wait until the project is completed to optimize, the code is already 10,000 lines long, and the progress is 1/3, and it would be extremely difficult to modify everything in bulk.

And thanks to parasyte's help, I'm trying to learn to use criterion.

Translation tools always manage to translate what you're saying, but they rarely express what I want to convey properly. It feels terrible.

I don't know if my words are bad after the translator translated them. But I want to say that I'm grateful to those who helped me.

1 Like

For the given example you've provided some noticeable performance can be achieved just by changing some compiler flags. On my machine allowing native cpu flags decreases the time elapsed on the example from 3 seconds to 1.7 seconds. Maybe this isn't an option if you don't know the hardware that your users have.

# .cargo/config.toml 
[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]

The next thing that can improve speed is doing things in parallel. rayon is a great crate for this. How you parallelize things depends greatly on the specific logic you want to parallelize. I recommend watching this talk as a good introduction to using the crate.

I've done this on your code example and have gone down from 3 seconds to ~500ms on my machine with some minor tweaks to the code to make it work with the library.

use std::time::Instant;

use rand::Rng;
use rayon::prelude::*;

struct Data {
    value: i32,
}

fn main() {
    let list_size = 200_000;
    let iterations = 200_000;

    // Generate a list of 200,000 structs with random `value`
    let mut rng = rand::thread_rng();
    let data_list: Vec<Data> = (0..list_size)
        .map(|_| Data {
            value: rng.gen_range(0..10000),
        })
        .collect();

    let start_time = Instant::now();


    let total_sum: i32 = data_list
        .par_iter()  // changing for loop to par_iter
        .map(|data| {
            let mut sub_sum: i32 = 0;
            for i in 9990..iterations {
                if data.value > i {
                    // I changed this to wrapping_add to prevent the
                    // possibility of the compiler inserting a panic point.
                    // It won't help much, but could have a positive performance impact
                    sub_sum = sub_sum.wrapping_add(i);
                }
            }
            sub_sum // return intermediate sum of the task
        })
        .sum(); // the parallel sum done by rayon will be more efficient

    let duration = start_time.elapsed();
    println!("Time taken to process: {:.2?}", duration);
    println!("Total sum of valid i: {}", total_sum);
}
cargo run --release
    Finished `release` profile [optimized] target(s) in 0.14s
     Running `target/release/performance-tuple-return-2`
Time taken to process: 428.91ms
Total sum of valid i: 9353115

This improvement is stacked with the native cpu flags, without the native cpu flags it's around 1 second.

Note: I'm assuming that the inner for loop (for i in 9990..iterations) you have is simulating slow tasks. If it isn't, you could also make modifications to this to improve the way you're doing the computation even further.

Of course now that we are using multiple cores the speed you see will depend on the number of cores available.

2 Likes

Thank you for your help.

I am aware of the rustflags = ["-Ctarget-cpu=native"], but the device is unknown.

I have tried using Rayon, but since individual tasks modify elements, this makes the 10,000 tasks linear. Only within a single task, where there are 10-20 filtering conditions, can multithreading be applied. Due to thread overhead, on the devices I tested, only a small portion showed performance improvements, while the others did not.

I modified the demo to be closer to the current state.

use rand::Rng;
use std::time::Instant;

struct Data {
    value: i32,
}

fn main() {
    let list_size = 200_000;
    let iterations = 10000;

    // Generate a list of 200,000 structs with random `value`
    let mut rng = rand::thread_rng();
    let data_list: Vec<Data> = (0..list_size)
        .map(|_| Data {
            value: rng.gen_range(0..1000),
        })
        .collect();

    let mut total_sum: i32 = 0; // To store the sum of all valid `i`
    let start_time = Instant::now();

    // 10000 tasks
    for i in 0..iterations {
        // Filter
        for data in &data_list {
            // Simulate condition
            for j in 0..20 {
                if data.value < i + j {
                    total_sum = total_sum.wrapping_add(i + j);
                }
            }
        }
        // Do some thing. Ignore.
    }
    let duration = start_time.elapsed();
    println!("Time taken to process: {:.2?}", duration);
    println!("Total sum of valid i: {}", total_sum);
}

The conclusion does not follow from the statement. You can often use fine-grained locking or algorithms that allow merging partial results. Yes sometimes algorithms are completely serial, but for example addition is not because it's commutative and associative and thus can be parallelized fairly easily.

It's not clear from your reduced example what a real application would do and which approach would be appropriate.

I am aware of the rustflags = ["-Ctarget-cpu=native"], but the device is unknown.

Well, if performance is paramount you can specify minimum hardware requirements for your users. For x86 there are broad [microarchitecture levels](there are feature profiles) that are forwards-compatible. The default is -v1 which is quite ancient.

The point I alluded to earlier about needing to know context about the inner loop is quite important.

For example this logic:

            for j in 0..20 {
                if data.value < i + j {
                    total_sum = total_sum.wrapping_add(i + j);
                }
            }

could be greatly improved, but I suspect that the code you have looks nothing like this.

If it happens to be similar you could convert repeated additions of the form of summing elements from n..m into a closed formula with a couple additions and a multiplication.

Also note that modifying the elements doesn't restrict you, there is the par_iter_mut. Did you try making use of that function?