I am trying my hand at implementing some string-matching in Rust and I'm running into an issue with nested loops.
To avoid indexing into a string using chars().nth(n) (which would be an expensive operation) I am looping over the chars to access to them and their index through an enumerate.
The problem is, that when I try to enumerate over chars_b in the inner loop, it complains that it is a moved value.
Any thoughts on how I could address this without doing a .clone()?
Thank you in advance!
fn fuzzywordmatch(word_a: &String, word_b: &String) -> u16 {
let chars_a = word_a.chars();
let chars_b = word_b.chars();
let size_a = word_a.len();
let size_b = word_b.len();
let mut previous_dists: Vec<u16> = (0..(size_b + 1)).map(|i| i as u16).collect();
let mut current_dists: Vec<u16> = Vec::with_capacity(size_b + 1);
for (i, char_a) in chars_a.enumerate() {
current_dists[0] = i as u16 + 1;
for (ii, char_b) in chars_b.enumerate() {
let cost_delete = previous_dists[ii + 1] + 1;
let cost_insert = current_dists[ii] + 1;
let cost_substs = previous_dists[ii] + (char_a == char_b) as u16;
previous_dists[ii + 1] = cost_delete.min(cost_insert).min(cost_substs);
}
std::mem::swap(&mut previous_dists, &mut current_dists);
}
return previous_dists[size_b];
}
Note, the error happens specifically at: for (ii, char_b) in chars_b.enumerate()
This is the complete compiler diagnostic:
Compiler Diagnostic
error[E0382]: use of moved value: `chars_b`
--> src\main.rs:42:29
|
31 | let chars_b = word_b.chars();
| ------- move occurs because `chars_b` has type `Chars<'_>`, which does not implement the `Copy` trait
...
39 | for (i, char_a) in chars_a.enumerate() {
| -------------------------------------- inside of this loop
...
42 | for (ii, char_b) in chars_b.enumerate() {
| ^^^^^^^ ----------- `chars_b` moved due to this method call, in previous iteration of loop
|
note: `enumerate` takes ownership of the receiver `self`, which moves `chars_b`
|
1016 | fn enumerate(self) -> Enumerate<Self>
| ^^^^
help: you can `clone` the value and consume it, but this might not be your desired behavior
|
42 | for (ii, char_b) in chars_b.clone().enumerate() {
|
(There's nothing inherently wrong with a clone but) you could create it every time.
for (ii, char_b) in word_b.chars().enumerate() {
I didn't analyze your algorithm at all, but note that chars don't always correspond to a human language character. For that you need a crate or to restrict your inputs in some way.
Why do you want to avoid that clone? It's trivial (since it doesn't need to clone the string itself, quite obviously).
Also, your code is very suboptimal in other ways.
&String as a literal type is useless and harmful in a function parameter. It doesn't have any more capabilities than &str but it forces an allocation on the caller if they don't have a String (only an &str).
You are misusing the length of the string. Strings are fundamentally UTF-8 in Rust, which is a byte-based encoding. Thus, the len() method returns the number of bytes, not the number of code points (which chars are). Since .len() >= .chars().count() (by as much as a factor of 4), the last element (that you return) may be entirely nonsense after your loops, as it has not been initialized correctly.
You are indexing into the current_dists vector before it has non-zero length. Your code will unconditionally panic upon the very first iteration.
If you are trying to perform sequence alignment, as it is apparent from your code, don't roll your own. There are many, much more optimized algorithms and libraries available that you should use instead.
Rust iterators are always single-use only. Once you iterate over the elements, the iterator becomes permanently useless. There's no reuse, no rewinding. You need to create a fresh one every time you want to iterate from the beginning.
Rust iterators are not collections, they are pointers into data. Recycling for _ in chars_b without replacing it is like trying to recycle i in for(; i < 10; i++) without resetting i to 0.
Besides that, it's much faster to work on bytes, not chars (Rust's char type is a relatively expensive Unicode scalar value, not a byte). There's str.as_bytes(). For algorithms that involve substring search, it's usually safe to work on UTF-8 bytes directly.
Ah okay! I was not aware of that. Given the wording of clone() I assumed it would copy the entire data. That's good to know, the better solution might just be to use clone() then!
@quinedot Reading more into the docs on char and the byte form of strings, I'll try out both with some functional code first, but it's good to know there's alternatives for things like this too.
@paramagnetic Like I mentioned in the other reply, I didn't know cloning a string doesn't copy the underlying data, just the pointers. I'll be using clone from here on out!
As you said though, I probably should use &str instead of &string, I'm still getting my head around all the string types, the previous language I used only had one string type, so it's some getting used to! The len() vs chars().count() thing is the same, thank you for pointing it out!
To answer the last point, I'm aware I will not write the best algorithms, this is a small test project, so I'm just exploring different things. Once I do start using it somewhere I'll definitely look at more optimal solutions though!
@kornel I will try a test using bytes, I think for this usecase it might work?
With the code functional now, I'll need to do some tests to see what works. Thank you!
That's not what was stated, however. Cloning the Stringdoes copy the underlying data (otherwise we'd have two pointers responsible for freeing the same memory). But Chars is not String - it's essentially a pointer into an independently existingString, a view, and a view can be duplicated cheaply.