~300x more efficient flattening of Vec<Result<Vec<T>, E>> in the worst case

mo8it · December 12, 2023, 6:04pm

Story

I just created an account to reply to Elegantly flatten Vec<Result<Vec<T>, E>> into Result<Vec<T>, E> but then I found that it was closed. This is why I post it here

The two methods that were suggested are the following:

fn flatten_fold<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    outer.into_iter().try_fold(vec![], |mut unrolled, result| {
        unrolled.extend(result?);
        Ok(unrolled)
    })
}

fn flatten_itertools<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    itertools::process_results(outer, |i| i.flatten().collect())
}

I didn't dive deep into how itertools::process_results and only show its benchmark results.

But the problems of flatten_fold are the following:

No pre-allocation
Fails too late if the location of the first error in the nested vectors is too far

Therefore, I pulled out my magic box unsafe and wrote a better function which is ~300x faster in the worst case which is when the last element in the outer vector is an error:

fn flatten_unsafe<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    let mut len = 0;
    let mut err = false;

    for result in &outer {
        match result {
            Ok(inner) => len += inner.len(),
            Err(_) => {
                err = true;
                break;
            }
        }
    }

    if err {
        for result in outer {
            result?;
        }

        // Safety: Can't be reached since we found an error in the vector `outer`.
        unsafe { unreachable_unchecked() };
    }

    let mut flat_v = Vec::with_capacity(len);

    let mut ptr = flat_v.as_mut_ptr();
    for result in outer {
        // Safety: If any element of `outer` is an error, we would have found it above.
        for value in unsafe { result.unwrap_unchecked() } {
            // Safety: The capacity is set as the total length.
            unsafe { ptr::write(ptr, value) };
            ptr = unsafe { ptr.add(1) };
        }
    }

    // Safety: The calculated total length.
    unsafe { flat_v.set_len(len) };

    Ok(flat_v)
}

BTW: Do you know of a better way to return the error in the first loop if it doesn't implement Clone?

Benchmarks

I use divan for the benchmarks.

Let's start with the mentioned worst case:

use divan::{bench, Bencher};
use flatten::{flatten_fold, flatten_itertools, flatten_unsafe};
use std::hint::black_box;

const LENS: &[usize] = &[4, 16, 64, 246, 1024, 4096];

fn main() {
    divan::main()
}

fn bench_v(len: usize) -> Vec<Result<Vec<u64>, ()>> {
    let mut v = vec![Ok(vec![0; 1024]); len];
    v[len - 1] = Err(());

    v
}

#[bench(consts = LENS)]
fn bench_flatten_fold<const N: usize>(bencher: Bencher) {
    bencher
        .with_inputs(|| bench_v(N))
        .bench_values(|v| black_box(flatten_fold(black_box(v))))
}

#[bench(consts = LENS)]
fn bench_flatten_itertools<const N: usize>(bencher: Bencher) {
    bencher
        .with_inputs(|| bench_v(N))
        .bench_values(|v| black_box(flatten_itertools(black_box(v))))
}

#[bench(consts = LENS)]
fn bench_flatten_unsafe<const N: usize>(bencher: Bencher) {
    bencher
        .with_inputs(|| bench_v(N))
        .bench_values(|v| black_box(flatten_unsafe(black_box(v))))
}

Results:

bench                       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ bench_flatten_fold                     │               │               │               │         │
│  ├─ 4                     2.604 µs      │ 13.27 µs      │ 2.665 µs      │ 2.833 µs      │ 100     │ 100
│  ├─ 16                    52.08 µs      │ 123.7 µs      │ 56.5 µs       │ 58.46 µs      │ 100     │ 100
│  ├─ 64                    14.14 µs      │ 331.8 µs      │ 19.82 µs      │ 24.43 µs      │ 100     │ 100
│  ├─ 246                   858.1 µs      │ 1.34 ms       │ 866.5 µs      │ 879 µs        │ 100     │ 100
│  ├─ 1024                  4.103 ms      │ 6.132 ms      │ 4.189 ms      │ 4.248 ms      │ 100     │ 100
│  ╰─ 4096                  21.67 ms      │ 23.48 ms      │ 21.86 ms      │ 21.92 ms      │ 100     │ 100
├─ bench_flatten_itertools                │               │               │               │         │
│  ├─ 4                     9.497 µs      │ 17.16 µs      │ 9.517 µs      │ 9.954 µs      │ 100     │ 100
│  ├─ 16                    47.77 µs      │ 93.53 µs      │ 48.28 µs      │ 50.79 µs      │ 100     │ 100
│  ├─ 64                    201.3 µs      │ 408.3 µs      │ 203.8 µs      │ 208.7 µs      │ 100     │ 100
│  ├─ 246                   746 µs        │ 1.573 ms      │ 754.4 µs      │ 776.1 µs      │ 100     │ 100
│  ├─ 1024                  6.674 ms      │ 7.772 ms      │ 6.704 ms      │ 6.778 ms      │ 100     │ 100
│  ╰─ 4096                  31.29 ms      │ 33.21 ms      │ 31.52 ms      │ 31.69 ms      │ 100     │ 100
╰─ bench_flatten_unsafe                   │               │               │               │         │
   ├─ 4                     63 ns         │ 145.9 ns      │ 66.13 ns      │ 67.1 ns       │ 100     │ 3200
   ├─ 16                    212.7 ns      │ 683.5 ns      │ 218.8 ns      │ 254.5 ns      │ 100     │ 800
   ├─ 64                    833.7 ns      │ 2.349 µs      │ 856.3 ns      │ 983.3 ns      │ 100     │ 400
   ├─ 246                   3.034 µs      │ 6.201 µs      │ 3.135 µs      │ 3.257 µs      │ 100     │ 100
   ├─ 1024                  12.29 µs      │ 19.69 µs      │ 12.79 µs      │ 13.35 µs      │ 100     │ 100
   ╰─ 4096                  1.988 ms      │ 3.045 ms      │ 2.027 ms      │ 2.052 ms      │ 100     │ 100

Are you looking for the click bait? Compare the mean values for N=1024: 4248/13.35 = 318.2

I know, this is obviously the worst case. How about the best case? (no errors)

Well, just comment out the line that sets the last element to Err(()) in the bench_v function:

bench                       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ bench_flatten_fold                     │               │               │               │         │
│  ├─ 4                     2.644 µs      │ 14.16 µs      │ 2.704 µs      │ 2.968 µs      │ 100     │ 100
│  ├─ 16                    43.48 µs      │ 101.3 µs      │ 48.13 µs      │ 48.81 µs      │ 100     │ 100
│  ├─ 64                    15.04 µs      │ 339.5 µs      │ 21.68 µs      │ 27.39 µs      │ 100     │ 100
│  ├─ 246                   702.1 µs      │ 1.321 ms      │ 714.7 µs      │ 735.7 µs      │ 100     │ 100
│  ├─ 1024                  3.321 ms      │ 6.055 ms      │ 3.372 ms      │ 3.44 ms       │ 100     │ 100
│  ╰─ 4096                  20.09 ms      │ 21.88 ms      │ 20.25 ms      │ 20.37 ms      │ 100     │ 100
├─ bench_flatten_itertools                │               │               │               │         │
│  ├─ 4                     11.62 µs      │ 12.28 µs      │ 12.19 µs      │ 11.95 µs      │ 100     │ 100
│  ├─ 16                    48.55 µs      │ 63.11 µs      │ 48.88 µs      │ 49.28 µs      │ 100     │ 100
│  ├─ 64                    195.7 µs      │ 410 µs        │ 197.2 µs      │ 203.7 µs      │ 100     │ 100
│  ├─ 246                   733.1 µs      │ 1.52 ms       │ 738.4 µs      │ 751 µs        │ 100     │ 100
│  ├─ 1024                  5.764 ms      │ 6.672 ms      │ 5.794 ms      │ 5.842 ms      │ 100     │ 100
│  ╰─ 4096                  29.5 ms       │ 31.7 ms       │ 29.64 ms      │ 29.76 ms      │ 100     │ 100
╰─ bench_flatten_unsafe                   │               │               │               │         │
   ├─ 4                     615.7 ns      │ 1.863 µs      │ 638.5 ns      │ 655.9 ns      │ 100     │ 200
   ├─ 16                    2.504 µs      │ 18.14 µs      │ 2.534 µs      │ 3.191 µs      │ 100     │ 100
   ├─ 64                    10.83 µs      │ 267.2 µs      │ 11.29 µs      │ 15.07 µs      │ 100     │ 100
   ├─ 246                   40.17 µs      │ 848.5 µs      │ 42.15 µs      │ 61.47 µs      │ 100     │ 100
   ├─ 1024                  3.133 ms      │ 3.825 ms      │ 3.158 ms      │ 3.208 ms      │ 100     │ 100
   ╰─ 4096                  15.59 ms      │ 17.26 ms      │ 15.78 ms      │ 15.87 ms      │ 100     │ 100

The implementation with unsafe is always faster (up to 15x)!

Actions?

Should we implement something similar in the std or itertools?

Especially because itertools is slower in every benchmark…

Pinging some people from the last thread in case they are interested: @eee @cuviper

scottmcm · December 12, 2023, 7:06pm

mo8it:

fn flatten_fold<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    outer.into_iter().try_fold(vec![], |mut unrolled, result| {
        unrolled.extend(result?);
        Ok(unrolled)
    })
}

Two clear problems with this one:

It should use with_capacity for a fair comparison with the alternative below
Since you don't need ownership of the accumulator, you can μoptimize it by using try_for_each instead (this helps because LLVM doesn't always manage to realize that wrapping up the Vec into a Result and pulling it out again isn't actually doing anything, and thus by just not doing that it's more obvious what's happening in the loop)

So for safe versions, try something like

fn flatten_fold_preallocate<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    use std::ops::ControlFlow::*;

    let (Continue(cap) | Break(cap)) = outer.iter().try_fold(0, |s, r| match r {
        Ok(v) => Continue(s + v.len()),
        Err(_) => Break(s),
    });

    let mut unrolled = Vec::with_capacity(cap);
    outer.into_iter().try_for_each(|result| {
        unrolled.extend(result?);
        Ok(())
    })?;
    Ok(unrolled)
}

which is optimized for assuming that there aren't going to be any errors.

Or if the "don't allocate and copy if there's an error" is worth keeping, I think a safe version would look something like this:

fn flatten_fold_precheck<T, E>(mut outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    let r = outer.iter().enumerate().try_fold(0, |s, (i, r)| match r {
        Ok(v) => Ok(s + v.len()),
        Err(_) => Err(i),
    });
    let mut unrolled;
    match r {
        Err(i) => return outer.swap_remove(i).map(|_| unreachable!()),
        Ok(cap) => unrolled = Vec::with_capacity(cap),        
    }
    outer.into_iter().for_each(|result| {
        unrolled.extend(result.unwrap_or_else(|_| unreachable!()));
    });
    Ok(unrolled)
}

Up-levelling a second, the reason that itertools isn't going to help as much here is that if you have concrete types and can multiple-iterate, then you can do better than if all you have is an iterator.

That's the difference between https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.join and https://docs.rs/itertools/latest/itertools/trait.Itertools.html#method.join, for example.

steffahn · December 12, 2023, 7:14pm

On first glance, my only concern (in terms of correctness) is the (lack of) handling of overflows in

len += inner.len()

though maybe that can only happen for zero-sized T? (I’m not 100% sure if there’s a guarantee that the sum of all allocation sizes never exceeds usize::MAX.) For the important cases, like non-zero-sized T on 64bit architecture, this should be irrelevant though, so the benchmark shouldn’t suffer from any appropriately minimal fixes.

Also, the code structure around err confuses me, as in principle the Err(_) branch could be made to directly return the error, and I’d be somewhat surprised (but also didn’ test it) if that made performance worse, compared to the err = true; break approach. Oh, it’s about the Clone thing… got it! I didn’t spot the remark.

Edit: If the enumerate doesn’t end up affecting performance, then

    for (i, result) in outer.iter().enumerate() {
        match result {
            Ok(inner) => len += inner.len(),
            Err(_) => {
                return outer.swap_remove(i);
            }
        }
    }

could be reasonable.

cuviper · December 12, 2023, 7:15pm

FWIW, a safe version of your code has nearly the same performance:

fn flatten_safe<T, E>(outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    let mut len = 0;
    let mut err = false;

    for result in &outer {
        match result {
            Ok(inner) => len += inner.len(),
            Err(_) => {
                err = true;
                break;
            }
        }
    }

    if err {
        for result in outer {
            result?;
        }
        unreachable!();
    }

    let mut flat_v = Vec::with_capacity(len);
    for result in outer {
        flat_v.append(&mut result?);
    }
    Ok(flat_v)
}

mo8it · December 12, 2023, 8:07pm

@scottmcm @steffahn swap_remove is really smart! I forgot about it

@steffahn You are right, I should use saturating_add!

Here is the performance with the two new functions without errors:

bench                           fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ bench_flatten_fold_precheck                │               │               │               │         │
│  ├─ 4                         2.634 µs      │ 17.7 µs       │ 2.694 µs      │ 3.177 µs      │ 100     │ 100
│  ├─ 16                        45.15 µs      │ 70.85 µs      │ 47.73 µs      │ 48.82 µs      │ 100     │ 100
│  ├─ 64                        175.4 µs      │ 253.4 µs      │ 186.1 µs      │ 186.6 µs      │ 100     │ 100
│  ├─ 256                       692 µs        │ 1.048 ms      │ 704.2 µs      │ 713.2 µs      │ 100     │ 100
│  ├─ 1024                      3.268 ms      │ 4.966 ms      │ 3.29 ms       │ 3.349 ms      │ 100     │ 100
│  ╰─ 4096                      15.74 ms      │ 16.72 ms      │ 15.89 ms      │ 15.94 ms      │ 100     │ 100
├─ bench_flatten_safe                         │               │               │               │         │
│  ├─ 4                         2.764 µs      │ 3.535 µs      │ 2.804 µs      │ 2.809 µs      │ 100     │ 100
│  ├─ 16                        4.738 µs      │ 21.2 µs       │ 4.809 µs      │ 5.052 µs      │ 100     │ 100
│  ├─ 64                        14.23 µs      │ 292 µs        │ 14.62 µs      │ 18.42 µs      │ 100     │ 100
│  ├─ 256                       48.92 µs      │ 1.046 ms      │ 50.31 µs      │ 71.23 µs      │ 100     │ 100
│  ├─ 1024                      3.27 ms       │ 4.089 ms      │ 3.3 ms        │ 3.344 ms      │ 100     │ 100
│  ╰─ 4096                      15.82 ms      │ 17.47 ms      │ 15.94 ms      │ 16.02 ms      │ 100     │ 100
╰─ bench_flatten_unsafe                       │               │               │               │         │
   ├─ 4                         580.6 ns      │ 1.928 µs      │ 590.6 ns      │ 604.5 ns      │ 100     │ 200
   ├─ 16                        2.303 µs      │ 18.29 µs      │ 2.434 µs      │ 3.023 µs      │ 100     │ 100
   ├─ 64                        10.73 µs      │ 215.5 µs      │ 11.21 µs      │ 14.77 µs      │ 100     │ 100
   ├─ 256                       40.9 µs       │ 911.6 µs      │ 42.21 µs      │ 64.65 µs      │ 100     │ 100
   ├─ 1024                      3.073 ms      │ 3.729 ms      │ 3.098 ms      │ 3.149 ms      │ 100     │ 100
   ╰─ 4096                      15.23 ms      │ 16.31 ms      │ 15.38 ms      │ 15.45 ms      │ 100     │ 100

The unsafe implementation is still faster than flatten_safe.

The performance of the worst case (only one error at the end) is almost the same for all three.

The weird thing though is that I get the same performance as flatten_safe when I use copy_nonoverlapping which is used internally by append which flatten_safe uses. Isn't it weird that copy_nonoverlapping is slower? Is something wrong here?

mo8it · December 12, 2023, 8:11pm

For other people interested in an efficient and safe implementation, here is the solution by @cuviper slightly modified to use swap_remove:

pub fn flatten_safe<T, E>(mut outer: Vec<Result<Vec<T>, E>>) -> Result<Vec<T>, E> {
    let mut len = 0;
    for (ind, result) in outer.iter().enumerate() {
        match result {
            Ok(inner) => len += inner.len(),
            Err(_) => {
                return outer.swap_remove(ind);
            }
        }
    }

    let mut flat_v = Vec::with_capacity(len);
    for result in outer {
        flat_v.append(&mut result?);
    }

    Ok(flat_v)
}

The difference between this and the unsafe implementation is not significant.

system · March 11, 2024, 8:11pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Elegantly flatten Vec<Result<Vec<T>, E>> into Result<Vec<T>, E> help	7	1827	October 19, 2023
Fold with closure that returns Result and early-out help	6	5303	January 12, 2023
Flattening a nested iterator of results help	6	3724	January 12, 2023
Flattening a vector of tuples	14	12649	January 12, 2023
Flatten an Iterator of Result<Vec<T>, Error> while collecting it help	7	1237	November 17, 2021

~300x more efficient flattening of Vec<Result<Vec<T>, E>> in the worst case

Story

Benchmarks

Actions?

Related topics