Large enums are they moved around during construction?

Clippy gave me a couple of warnings on interator enums

For example:

    --> src/collections/btree_set/mod.rs:2144:1
     |
2144 | /   enum DifferenceInner<'a, T: 'a, A: Tuning> {
2145 | | /     Stitch {
2146 | | |         // iterate all of `self` and some of `other`, spotting matches along the way
2147 | | |         self_iter: Iter<'a, T>,
2148 | | |         other_iter: Peekable<Iter<'a, T>>,
2149 | | |     },
     | | |_____- the largest variant contains at least 1760 bytes
2150 | | /     Search {
2151 | | |         // iterate `self`, look up in `other`
2152 | | |         self_iter: Iter<'a, T>,
2153 | | |         other_set: &'a BTreeSet<T, A>,
2154 | | |     },
     | | |_____- the second-largest variant contains at least 880 bytes
2155 | |       Iterate(Iter<'a, T>), // simply produce all elements in `self`
2156 | |   }
     | |___^ the entire enum is at least 1760 bytes
     |
     = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.93.0/index.html#large_enum_variant
help: consider boxing the large fields or introducing indirection in some other way to reduce the total size of the enum
     |
2148 -         other_iter: Peekable<Iter<'a, T>>,
2148 +

That got me wondering whether it mattered how large these are or not, do they (or the component structs) get moved around during construction of an iterator. I would hope not, but I haven't checked the code. Any idea what the compiler "normally" does?

The clippy warning tells you just that you might waste a lot of memory, as the size of an enum is always the size of the largest variant. Similar as the union type in C. You might make a large field a Box<T>, so that it is actually just a pointer to heap allocated data.

Yes, I get that. What I was wondering is whether they get moved around in simple cases, or is the compiler smart enough to avoid "most" moves. I mean if you simply do say

let m = BTreeSet::new();
let x = m.iter();

Will the result of m.iter() be moved into x? I guess not if the function call is inlined, but what if it isn't? Would it pass the address where the result of the function is to be stored, or push the result onto the stack and then it has to be moved to x? Maybe I need to do some experiments to find out.

LLVM is built to handle C++ which was designed to avoid such moves (that's called copy elision there).

But that doesn't mean copies are always eliminated (even in C++ something called “a guaranteed copy elision” have pretty non-trivial rules).

1 Like

I just did some bench-marking, playing around with the fixed size of the stack I use, and it does seem to make a significant difference. That doesn't exactly prove there is moving going on, but it suggests it may well be happening.

At any rate, if I iterate over a BTreeSet with just one element, my implementation is 10 times slower than std (quite disappointing...), but for 1,000 elements it is 1.2 times slower (still a little disappointing, but not too bad). This suggests it is the initialisation of the stacks that is taking a long time. I presume MaybeUninit::uninit doesn't actually do anything? In other words it leaves the memory truly uninitialised. Well I think it does initialise it in Debug mode.. hmm, so maybe my assumption is wrong. Well the documentation implies it is unitialised.

Well, for today I am a bit defeated. Whether iteration speed for a BTreeVec is even important I am not sure.

After a bit more thought, I don't think iterator performance for small sets (or maps) is a big deal, it is of course a very fast operation in absolute terms (28 ns on my chromebook vs 2.8 ns for the std version).

LLVM has passes to address it, which often work, but don't always.

See, for example, [MemCpyOpt] Allow memcpy elision for non-noalias arguments by nikic · Pull Request #107860 · llvm/llvm-project · GitHub


EDIT: For clarity, what this means is that you usually needn't worry about it, but sometimes it'll bite you in excessive stack usage.

Also, what are those iters that they're so giant? Maybe don't have iterators that are that huge?

in my experience it seems like that every time a function returns a struct or enum it gets constructed in the memory provided by the caller. this seems to be also true for nested initializations but might have some limitations idk

When you post a block of text where indentation and whitespace are important, please use a multi-line code block like:

```
some lines
    of
    text with significant whitespace.
```

e.g. your warning messages look much nicer when using a multi-line code block:

warning: large size difference between variants
    --> src/collections/btree_set/mod.rs:2144:1
     |
2144 | /   enum DifferenceInner<'a, T: 'a, A: Tuning> {
2145 | | /     Stitch {
2146 | | |         // iterate all of `self` and some of `other`, spotting matches along the way
2147 | | |         self_iter: Iter<'a, T>,
2148 | | |         other_iter: Peekable<Iter<'a, T>>,
2149 | | |     },
     | | |_____- the largest variant contains at least 1760 bytes
2150 | | /     Search {
2151 | | |         // iterate `self`, look up in `other`
2152 | | |         self_iter: Iter<'a, T>,
2153 | | |         other_set: &'a BTreeSet<T, A>,
2154 | | |     },
     | | |_____- the second-largest variant contains at least 880 bytes
2155 | |       Iterate(Iter<'a, T>), // simply produce all elements in `self`
2156 | |   }
     | |___^ the entire enum is at least 1760 bytes
     |
     = help: for further information visit https://rust-lang.github.io/rust-clippy/rust-1.93.0/index.html#large_enum_variant
help: consider boxing the large fields or introducing indirection in some other way to reduce the total size of the enum
     |
2148 -         other_iter: Peekable<Iter<'a, T>>,
2148 +  
1 Like

I have now had a look at the code for a test program, and I don't think large structs/enums are being moved around [edit: hmm, I changed my mind, see the reply ].

I think I might be able to help the compiler avoid some calls in the inner loop by re-structuring my code slightly, in particular the push_tree function which has a simple case for a "small" (single level) Btree, so it could test for that inline and only do the call if the BTree is not single level. I think a call is a relatively expensive operation. All that said, I think the performance is actually quite ok for practical purposes.

My test program, which runs in about 2 seconds, which I think is 20ns for the loop which is executed 100 million times.

use pstd::collections::BTreeSet;

fn main() {
    let set = BTreeSet::from([2, 4, 6]);
    let mut x = 1234;
    for _rep in 0..100_000_000 {
      for e in &set { x += e; } 
    }
    println!("x={}", x);
}

The assembler for the main program:

.section .text.btest::main,"ax",@progbits
	.hidden	btest::main
	.globl	btest::main
	.p2align	4
.type	btest::main,@function
btest::main:
	.cfi_startproc
	.cfi_personality 155, DW.ref.rust_eh_personality
	.cfi_lsda 27, .Lexception3
	push rbp
	.cfi_def_cfa_offset 16
	push r15
	.cfi_def_cfa_offset 24
	push r14
	.cfi_def_cfa_offset 32
	push r13
	.cfi_def_cfa_offset 40
	push r12
	.cfi_def_cfa_offset 48
	push rbx
	.cfi_def_cfa_offset 56
	sub rsp, 1784
	.cfi_def_cfa_offset 1840
	.cfi_offset rbx, -56
	.cfi_offset r12, -48
	.cfi_offset r13, -40
	.cfi_offset r14, -32
	.cfi_offset r15, -24
	.cfi_offset rbp, -16
	mov qword ptr [rsp + 920], 0
	lea r12, [rsp + 928]
	mov qword ptr [rsp + 928], 1
	mov dword ptr [rsp + 936], 0
	lea r15, [rsp + 944]
	mov word ptr [rsp + 948], 0
	mov dword ptr [rsp + 944], 1048640
	mov byte ptr [rsp + 88], 1
	mov qword ptr [rsp + 48], 0
	mov qword ptr [rsp + 80], r15
	lea rdx, [rsp + 48]
	mov rdi, r12
	mov esi, 2
	call <pstd::collections::btree_map::Tree<i32, ()>>::insert::<pstd::collections::btree_map::CustomTuning>
	cmp dword ptr [rsp + 48], 1
	jne .LBB10_30
	mov ebx, dword ptr [rsp + 56]
	mov rbp, qword ptr [rsp + 64]
	mov r13, qword ptr [rsp + 72]
	call qword ptr [rip + __rustc::__rust_no_alloc_shim_is_unstable_v2@GOTPCREL]
	mov edi, 32
	mov esi, 8
	call qword ptr [rip + __rustc::__rust_alloc@GOTPCREL]
	test rax, rax
	je .LBB10_3
	mov r14, rax
	mov qword ptr [rax], 1
	mov dword ptr [rax + 8], 0
	mov qword ptr [rax + 16], 8
	mov dword ptr [rax + 24], 0
	mov esi, 1
	mov rdi, rax
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::PairVec<i32, ()>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	movzx eax, word ptr [r14 + 8]
	cmp ax, word ptr [r14 + 10]
	jae .LBB10_13
	mov rdi, r14
	add rdi, 16
	mov rcx, qword ptr [r14]
	mov dword ptr [rcx + 4*rax], ebx
	inc eax
	mov word ptr [r14 + 8], ax
	mov esi, 2
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::ShortVec<pstd::collections::btree_map::Tree<i32, ()>>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	mov rdi, r12
	mov r12, rbp
	mov rbp, r13
	mov rbx, qword ptr [rsp + 928]
	mov r13, qword ptr [rsp + 936]
	mov qword ptr [rsp + 928], 1
	mov dword ptr [rsp + 936], 0
	movzx eax, word ptr [r14 + 24]
	movzx edx, word ptr [r14 + 26]
	cmp ax, dx
	jae .LBB10_16
	mov rcx, qword ptr [r14 + 16]
	mov esi, eax
	shl esi, 4
	mov qword ptr [rcx + rsi], rbx
	mov qword ptr [rcx + rsi + 8], r13
	lea rsi, [rax + 1]
	mov word ptr [r14 + 24], si
	cmp si, dx
	jae .LBB10_21
	shl esi, 4
	mov qword ptr [rcx + rsi], r12
	mov qword ptr [rcx + rsi + 8], rbp
	add eax, 2
	mov word ptr [r14 + 24], ax
	mov qword ptr [rsp + 928], 0
	mov qword ptr [rsp + 936], r14
	mov r12, rdi
.LBB10_30:
	cmp byte ptr [rsp + 88], 0
	jne .LBB10_32
	inc qword ptr [rsp + 920]
.LBB10_32:
	mov byte ptr [rsp + 88], 1
	mov qword ptr [rsp + 48], 0
	mov qword ptr [rsp + 80], r15
	lea rdx, [rsp + 48]
	mov rdi, r12
	mov esi, 4
	call <pstd::collections::btree_map::Tree<i32, ()>>::insert::<pstd::collections::btree_map::CustomTuning>
	cmp dword ptr [rsp + 48], 1
	jne .LBB10_41
	mov ebx, dword ptr [rsp + 56]
	mov rbp, qword ptr [rsp + 64]
	mov r13, qword ptr [rsp + 72]
	call qword ptr [rip + __rustc::__rust_no_alloc_shim_is_unstable_v2@GOTPCREL]
	mov edi, 32
	mov esi, 8
	call qword ptr [rip + __rustc::__rust_alloc@GOTPCREL]
	test rax, rax
	je .LBB10_3
	mov r14, rax
	mov qword ptr [rax], 1
	mov dword ptr [rax + 8], 0
	mov qword ptr [rax + 16], 8
	mov dword ptr [rax + 24], 0
	mov esi, 1
	mov rdi, rax
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::PairVec<i32, ()>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	movzx eax, word ptr [r14 + 8]
	cmp ax, word ptr [r14 + 10]
	jae .LBB10_13
	mov rdi, r14
	add rdi, 16
	mov rcx, qword ptr [r14]
	mov dword ptr [rcx + 4*rax], ebx
	inc eax
	mov word ptr [r14 + 8], ax
	mov esi, 2
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::ShortVec<pstd::collections::btree_map::Tree<i32, ()>>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	mov rdi, r12
	mov r12, rbp
	mov rbp, r13
	mov rbx, qword ptr [rsp + 928]
	mov r13, qword ptr [rsp + 936]
	mov qword ptr [rsp + 928], 1
	mov dword ptr [rsp + 936], 0
	movzx eax, word ptr [r14 + 24]
	movzx edx, word ptr [r14 + 26]
	cmp ax, dx
	jae .LBB10_16
	mov rcx, qword ptr [r14 + 16]
	mov esi, eax
	shl esi, 4
	mov qword ptr [rcx + rsi], rbx
	mov qword ptr [rcx + rsi + 8], r13
	lea rsi, [rax + 1]
	mov word ptr [r14 + 24], si
	cmp si, dx
	jae .LBB10_21
	shl esi, 4
	mov qword ptr [rcx + rsi], r12
	mov qword ptr [rcx + rsi + 8], rbp
	add eax, 2
	mov word ptr [r14 + 24], ax
	mov qword ptr [rsp + 928], 0
	mov qword ptr [rsp + 936], r14
	mov r12, rdi
.LBB10_41:
	cmp byte ptr [rsp + 88], 0
	jne .LBB10_43
	inc qword ptr [rsp + 920]
.LBB10_43:
	mov byte ptr [rsp + 88], 1
	mov qword ptr [rsp + 48], 0
	mov qword ptr [rsp + 80], r15
	lea rdx, [rsp + 48]
	mov rdi, r12
	mov esi, 6
	call <pstd::collections::btree_map::Tree<i32, ()>>::insert::<pstd::collections::btree_map::CustomTuning>
	cmp dword ptr [rsp + 48], 1
	jne .LBB10_52
	mov ebx, dword ptr [rsp + 56]
	mov rbp, qword ptr [rsp + 64]
	mov r13, qword ptr [rsp + 72]
	call qword ptr [rip + __rustc::__rust_no_alloc_shim_is_unstable_v2@GOTPCREL]
	mov edi, 32
	mov esi, 8
	call qword ptr [rip + __rustc::__rust_alloc@GOTPCREL]
	test rax, rax
	je .LBB10_3
	mov r14, rax
	mov qword ptr [rax], 1
	mov dword ptr [rax + 8], 0
	mov qword ptr [rax + 16], 8
	mov dword ptr [rax + 24], 0
	mov esi, 1
	mov rdi, rax
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::PairVec<i32, ()>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	movzx eax, word ptr [r14 + 8]
	cmp ax, word ptr [r14 + 10]
	jae .LBB10_13
	mov rdi, r14
	add rdi, 16
	mov rcx, qword ptr [r14]
	mov dword ptr [rcx + 4*rax], ebx
	inc eax
	mov word ptr [r14 + 8], ax
	mov esi, 2
	mov rdx, r15
	call <pstd::collections::btree_map::vecs::ShortVec<pstd::collections::btree_map::Tree<i32, ()>>>::set_alloc::<pstd::collections::btree_map::CustomTuning>
	mov r12, rbp
	mov rbp, r13
	mov rbx, qword ptr [rsp + 928]
	mov r13, qword ptr [rsp + 936]
	mov qword ptr [rsp + 928], 1
	mov dword ptr [rsp + 936], 0
	movzx eax, word ptr [r14 + 24]
	movzx edx, word ptr [r14 + 26]
	cmp ax, dx
	jae .LBB10_16
	mov rcx, qword ptr [r14 + 16]
	mov esi, eax
	shl esi, 4
	mov qword ptr [rcx + rsi], rbx
	mov qword ptr [rcx + rsi + 8], r13
	lea rsi, [rax + 1]
	mov word ptr [r14 + 24], si
	cmp si, dx
	jae .LBB10_21
	shl esi, 4
	mov qword ptr [rcx + rsi], r12
	mov qword ptr [rcx + rsi + 8], rbp
	add eax, 2
	mov word ptr [r14 + 24], ax
	mov qword ptr [rsp + 928], 0
	mov qword ptr [rsp + 936], r14
.LBB10_52:
	cmp byte ptr [rsp + 88], 0
	jne .LBB10_54
	inc qword ptr [rsp + 920]
.LBB10_54:
	movups xmm0, xmmword ptr [rsp + 920]
	movups xmm1, xmmword ptr [rsp + 936]
	movaps xmmword ptr [rsp + 16], xmm0
	movaps xmmword ptr [rsp + 32], xmm1
	mov dword ptr [rsp + 12], 1234
	mov r12, qword ptr [rsp + 16]
	test r12, r12
	je .LBB10_62
	xor ebp, ebp
	mov r13d, 1234
	lea r14, [rsp + 48]
	lea r15, [rsp + 920]
	mov rbx, qword ptr [rip + memcpy@GOTPCREL]
	jmp .LBB10_56
	.p2align	4
.LBB10_81:
	cmp ebp, 100000000
	je .LBB10_65
.LBB10_56:
	mov qword ptr [rsp + 496], 0
	mov qword ptr [rsp + 904], 0
	xorps xmm0, xmm0
	movaps xmmword ptr [rsp + 48], xmm0
	movaps xmmword ptr [rsp + 64], xmm0
	movaps xmmword ptr [rsp + 80], xmm0
	mov rdi, r14
	lea rsi, [rsp + 24]
	mov edx, 1
	call <pstd::collections::btree_map::Range<i32, ()>>::push_tree
	inc ebp
	mov edx, 864
	mov rdi, r15
	mov rsi, r14
	call rbx
	mov edx, 864
	mov rdi, r14
	mov rsi, r15
	call rbx
	mov rax, r12
	.p2align	4
.LBB10_58:
	dec rax
	mov qword ptr [rsp + 912], rax
	mov rcx, qword ptr [rsp + 56]
	cmp rcx, qword ptr [rsp + 64]
	je .LBB10_77
	mov rax, qword ptr [rsp + 48]
	movzx edx, word ptr [rax + 8]
	cmp rcx, rdx
	jae .LBB10_60
	mov rax, qword ptr [rax]
	lea rax, [rax + 4*rcx]
	inc rcx
	mov qword ptr [rsp + 56], rcx
.LBB10_79:
	test rax, rax
	je .LBB10_81
	add r13d, dword ptr [rax]
	mov dword ptr [rsp + 12], r13d
	mov rax, qword ptr [rsp + 912]
	test rax, rax
	jne .LBB10_58
	jmp .LBB10_81
.LBB10_77:
	mov rdi, r14
	call <pstd::collections::btree_map::Range<i32, ()>>::next_cold
	jmp .LBB10_79
.LBB10_62:
	mov ebx, 100000000
	lea r14, [rsp + 48]
	.p2align	4
.LBB10_63:
	mov qword ptr [rsp + 496], 0
	mov qword ptr [rsp + 904], 0
	xorps xmm0, xmm0
	movaps xmmword ptr [rsp + 48], xmm0
	movaps xmmword ptr [rsp + 64], xmm0
	movaps xmmword ptr [rsp + 80], xmm0
	mov rdi, r14
	lea rsi, [rsp + 24]
	mov edx, 1
	call <pstd::collections::btree_map::Range<i32, ()>>::push_tree
	dec ebx
	jne .LBB10_63
.LBB10_65:
	lea rax, [rsp + 12]
	mov qword ptr [rsp + 48], rax
	mov rax, qword ptr [rip + <i32 as core::fmt::Display>::fmt@GOTPCREL]
	mov qword ptr [rsp + 56], rax
	lea rdi, [rip + .Lanon.178299d487561156df92f126da52896f.31]
	lea rsi, [rsp + 48]
	call qword ptr [rip + std::io::stdio::_print@GOTPCREL]
	lea rsi, [rsp + 40]
	lea rdi, [rsp + 24]
	call <pstd::collections::btree_map::Tree<i32, ()>>::dealloc::<pstd::collections::btree_map::CustomTuning>
	cmp qword ptr [rsp + 24], 0
	jne .LBB10_69
	mov rdi, qword ptr [rsp + 32]
	mov esi, 32
	mov edx, 8
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
.LBB10_69:
	add rsp, 1784
	.cfi_def_cfa_offset 56
	pop rbx
	.cfi_def_cfa_offset 48
	pop r12
	.cfi_def_cfa_offset 40
	pop r13
	.cfi_def_cfa_offset 32
	pop r14
	.cfi_def_cfa_offset 24
	pop r15
	.cfi_def_cfa_offset 16
	pop rbp
	.cfi_def_cfa_offset 8
	ret
.LBB10_60:
	.cfi_def_cfa_offset 1840
	lea rdi, [rip + .Lanon.178299d487561156df92f126da52896f.26]
	lea rdx, [rip + .Lanon.178299d487561156df92f126da52896f.37]
	mov esi, 32
	call qword ptr [rip + core::panicking::panic@GOTPCREL]
	jmp .LBB10_4
.LBB10_3:
	mov edi, 8
	mov esi, 32
	call qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
	jmp .LBB10_4
.LBB10_13:
	lea rdi, [rip + .Lanon.178299d487561156df92f126da52896f.39]
	lea rdx, [rip + .Lanon.178299d487561156df92f126da52896f.40]
	mov esi, 39
	call qword ptr [rip + core::panicking::panic@GOTPCREL]
	jmp .LBB10_4
.LBB10_16:
	lea rdi, [rip + .Lanon.178299d487561156df92f126da52896f.39]
	lea rdx, [rip + .Lanon.178299d487561156df92f126da52896f.43]
	mov esi, 39
	call qword ptr [rip + core::panicking::panic@GOTPCREL]
	jmp .LBB10_4
.LBB10_21:
	lea rdi, [rip + .Lanon.178299d487561156df92f126da52896f.39]
	lea rdx, [rip + .Lanon.178299d487561156df92f126da52896f.43]
	mov esi, 39
	call qword ptr [rip + core::panicking::panic@GOTPCREL]
.LBB10_4:
	ud2
	jmp .LBB10_73
	mov r15, rax
	cmp qword ptr [rsp + 24], 0
	jne .LBB10_74
	mov rdi, qword ptr [rsp + 32]
	mov esi, 32
	mov edx, 8
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	mov rdi, r15
	call _Unwind_Resume@PLT
	jmp .LBB10_25
	mov r15, rax
	cmp qword ptr [rsp + 48], 0
	je .LBB10_28
	cmp qword ptr [rsp + 64], 0
	jne .LBB10_28
	mov r14, qword ptr [rsp + 72]
	jmp .LBB10_8
	jmp .LBB10_73
	jmp .LBB10_73
	mov r15, rax
	test r12, r12
	jne .LBB10_8
	mov esi, 32
	mov edx, 8
	mov rdi, rbp
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	jmp .LBB10_8
	mov r15, rax
	test rbx, rbx
	jne .LBB10_18
	mov esi, 32
	mov edx, 8
	mov rdi, r13
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
.LBB10_18:
	mov r13, rbp
	mov rbp, r12
	jmp .LBB10_26
.LBB10_25:
	mov r15, rax
.LBB10_26:
	mov esi, 32
	mov edx, 8
	mov rdi, r14
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
	jmp .LBB10_27
	mov r15, rax
.LBB10_27:
	mov r14, r13
	test rbp, rbp
	jne .LBB10_28
.LBB10_8:
	mov esi, 32
	mov edx, 8
	mov rdi, r14
	call qword ptr [rip + __rustc::__rust_dealloc@GOTPCREL]
.LBB10_28:
	lea rdi, [rsp + 920]
	call core::ptr::drop_in_place::<pstd::collections::btree_set::BTreeSet<i32>>
	jmp .LBB10_74
	call qword ptr [rip + core::panicking::panic_in_cleanup@GOTPCREL]
.LBB10_73:
	mov r15, rax
	lea rdi, [rsp + 16]
	call core::ptr::drop_in_place::<pstd::collections::btree_set::BTreeSet<i32>>
.LBB10_74:
	mov rdi, r15
	call _Unwind_Resume@PLT
	call qword ptr [rip + core::panicking::panic_in_cleanup@GOTPCREL]

Although...

mov rbx, qword ptr [rip + memcpy@GOTPCREL]

and

does seem to be copying something, I think maybe the size is 864.

It looks a bit strange, because it seems to copy, and then copy back again...??

	mov edx, 864
	mov rdi, r15
	mov rsi, r14
	call rbx
	mov edx, 864
	mov rdi, r14
	mov rsi, r15
	call rbx

The struct is defined here:

There is a leaf iterator for iterating the leaf level of the btree, and also one for iterating backwards ( DoubleEndedIterator ), then a stack of iterators for iterating at each level of the BTree (both child pointers and the keys, and again both forwards and backwards). It allows for up to 15 levels in the published code, although that is probably more than is needed, I am looking to reduce it to maybe 8 or 10.

So the iterator size is quite significant. The std version has parent pointers in the nodes, which means a stack isn't needed. I think the stack method is quite elegant, but ideally you don't want to be moving/copying the iterator struct in memory, as that is a significant overhead in setting the iterator up, which could dwarf the cost of actually iterating through a small number of elements.

Well after studying the assembler code, I believe I have two problems. Firstly MaybeUninit:uninit() actually zeros the memory ( even though there is also a zeroed alternative which is meant to do that ). Perhaps the documentation could mention the current cost is not zero but proportional to the size of the MaybeUninit. Or maybe this could be fixed.

Secondly, the call to push_tree, which completes the initialisation of the iterator, generates an unnecessary move/copy of the entire iterator structure.

So… I am contemplating what to do, if anything. Maybe a heap-allocated vector will work better, the idea was to avoid heap allocation, but the allocation won’t happen for a small (single level) BTree.

Edit: well I tried using the standard Vec for the stack, and it is significantly (5x) faster iterating small sets. It seems that with the current compiler there are hidden costs associated with using fixed size Vecs like arrayvec::ArrayVec.