Collect_with_capacity(usize)

While optimizing my regex engine (I made another thread for it) I have realized that I were losing performance on the collect.

	machine.replace_vec_all(&seq,rep.iter().map(&|&x|x)).collect()

Now I have that rewritten as

	let mut new=Vec::with_capacity(2*seq.len());
	for elem in machine.replace_vec_all(&seq,rep.iter().map(&|&x|x))
	{
		new.push(elem);
	}
	new

The whole benchmark goes from 7.43 seconds to 6.57 seconds. So it is a relevant part. It would be nice to have a collect_with_capacity function equivalent.

	machine.replace_vec_all(&seq,rep.iter().map(&|&x|x)).collect_with_capacity(2*seq.len())

Is anyone interest in this proposal?

You can already control the capacity.

collect() relies on FromIterator, which for Vec is implemented as such:

https://doc.rust-lang.org/src/alloc/vec.rs.html#1804

Note the call to size_hint, which will then be used as the capacity for the vector.

I'd check if machine.replace_vec_all(..) returns a vector with a properly working size hint and if not, make it so ;). If there's any questions along the way, please ask :).

3 Likes

Currently replace_vec_all returns a ReplacerVecIt, which just implements Iterator. Should ReplacerVecIt implement also FromIterator?

Even if I can implement that it could make sense to have this collect_with_capacity. For the programmer to have more control and to do not depend on how an external iterator is implemented.

ReplacerVecIt should then implement the default method size_hint using the size_hints of the inner vectors. If you don't implement it, it will be automatically implemented as (0, None).

Even if I can implement that it could make sense to have this collect_with_capacity. For the programmer to have more control and to do not depend on how an external iterator is implemented.

I don't think collect_with_capacity makes sense, as capacity is not a concept common to everything you can collect into.

Also, you can always wrap an external interator into one that gives a different size_hint, if need be.

1 Like

Ok, that makes sense. However, the method size_hint is too strict for my case, since I only know the approximate size and not any actual bound.

For example if I replace the regular expression "ab*c" by "pqrst" in an unknown text, then the length may decrease or increase, depending on the number of matched 'b's. Thus, size_hint cannot be defined, but collect_with_capacity could give a good enough guess.

I understand that it makes no sense for generic collections, just by ones with capacity. Perhaps it should be Iterator::to_vec_with_capacity.

I would change your rewritten version to new.extend(iter), and there's a proposal to make this even easier:
https://github.com/rust-lang/rust/issues/45840

3 Likes

Good to hear that there are people working on a similar function.