How to find a substring starting at a specific index?


#1

Hi. I am writing a program where I need to find a substring in a larger one, and I also need to look at the characters on the left and the right of the match.

For example, I have this large string : “xfooy bar ufoov” and I am looking for “foo”.
First match gives me: ‘x’, “foo”, ‘y’ at index 1
Second match gives me: ‘u’, “foo”, ‘v’ at index 11

I started writing an algorithm with a loop and calls to str::find but I cannot figure out how to make the find function start at a specific byte index: I want to re-inject the matched byte index to find the next match.

Something like str::find_start_at or str::find_from taking an additional argument (the starting byte index) would be perfect for my case, but I was very disappointed when I found an issue about it in a closed state, without an approval to create such a function : https://github.com/rust-lang/rust/issues/11986

I also looked at stackoverflow for a solution. The proposal is to slice the input string. I started doing that but in my case it becomes very difficult to track the previous character of the match. Indeed, I just found a bug in my code because it was becoming complicated trying to slice at the right place.

So, is there any hope that a str::find_start_at function may appear somehow? Meanwhile, what is the simplest way to iterate in a string for matches? (without doing memory allocations).

For now the only option I think of is to code a std::find_start_at function by myself, but it feels a bit like reinventing the wheel.


#2

I think it might actually be instructive to write this out. If I were to write it, it might look like this:

fn find_start_at<'a, P: Pattern<'a>>(slice: &'a str, at: usize, pat: P) -> Option<usize> {
    slice[at..].find(pat).map(|i| at + i)
}

In other words, all it’s doing is string slicing internally and then calling find. Is there some additional complexity that I’ve missed that’s specific to your problem? Showing some code might help!

It seems like either str.matches or str.match_indices is what you want?


#3

Thanks for your reply, BurntSushi.

Here is my current code, which contains a bug when there is a successive double match (“foofoo” in my question example) : https://github.com/hadrien-psydk/fixsrt/blob/master/src/main.rs#L42

It’s a program which fixes spelling mistakes in srt files. At line 84 there is the find() call on a slice that is rebuilt in each iteration. For “foofoo”, I cannot get the previous char from the second match because find() returns index 0.

Thank you for your find_start_at proposal, I will study that.

It seems like either str.matches or str.match_indices is what you want?

match_indices looked nice but I was afraid it was doing dynamic memory allocations. However, maybe it does not until collect() is called? In that case that could be another good option for my case.


#4

collect will store the remaining elements of the iterator into the relevant data type. If you collect into a Vec, then that will certainly allocate. However, simply iterating over the elements of matches or match_indices should not allocate at all.