Overall, the rayon approach heavily depends on how much deduplication there will actually be. If a lot of items survive, then you spend a lot more time merging and reallocating in the reduce
step. We have some benchmarks in the rayon repo of different collect
strategies we tried:
test map_collect::i_mod_10_to_i::with_collect ... bench: 2,158,359 ns/iter (+/- 64,941)
test map_collect::i_mod_10_to_i::with_fold ... bench: 456,081 ns/iter (+/- 16,461)
test map_collect::i_mod_10_to_i::with_fold_vec ... bench: 575,579 ns/iter (+/- 29,454)
test map_collect::i_mod_10_to_i::with_linked_list_collect ... bench: 4,327,805 ns/iter (+/- 261,511)
test map_collect::i_mod_10_to_i::with_linked_list_collect_vec ... bench: 2,350,412 ns/iter (+/- 101,515)
test map_collect::i_mod_10_to_i::with_linked_list_collect_vec_sized ... bench: 2,413,770 ns/iter (+/- 119,020)
test map_collect::i_mod_10_to_i::with_linked_list_map_reduce_vec_sized ... bench: 2,382,592 ns/iter (+/- 83,305)
test map_collect::i_mod_10_to_i::with_mutex ... bench: 28,778,848 ns/iter (+/- 2,216,573)
test map_collect::i_mod_10_to_i::with_mutex_vec ... bench: 6,243,400 ns/iter (+/- 1,033,037)
test map_collect::i_mod_10_to_i::with_vec_vec_sized ... bench: 2,442,997 ns/iter (+/- 149,437)
test map_collect::i_to_i::with_collect ... bench: 3,875,987 ns/iter (+/- 127,681)
test map_collect::i_to_i::with_fold ... bench: 12,839,940 ns/iter (+/- 1,007,096)
test map_collect::i_to_i::with_fold_vec ... bench: 12,852,246 ns/iter (+/- 1,583,942)
test map_collect::i_to_i::with_linked_list_collect ... bench: 7,552,594 ns/iter (+/- 595,211)
test map_collect::i_to_i::with_linked_list_collect_vec ... bench: 7,704,584 ns/iter (+/- 368,471)
test map_collect::i_to_i::with_linked_list_collect_vec_sized ... bench: 4,332,625 ns/iter (+/- 274,807)
test map_collect::i_to_i::with_linked_list_map_reduce_vec_sized ... bench: 4,344,150 ns/iter (+/- 191,931)
test map_collect::i_to_i::with_mutex ... bench: 45,221,873 ns/iter (+/- 365,509)
test map_collect::i_to_i::with_mutex_vec ... bench: 19,704,084 ns/iter (+/- 2,728,435)
test map_collect::i_to_i::with_vec_vec_sized ... bench: 4,373,531 ns/iter (+/- 270,386)
The i_mod_10
case has keys generated with i % 10
, so extreme duplication, and that's where fold
wins. But for the general FromParallelIterator for HashMap<K, V>
, we used the implementation that collects LinkedList<Vec<(K, V)>>
and moves that into a HashMap
afterward, sequentially.
If you're not producing these items with parallel computation, it probably doesn't make much sense to involve rayon at all.