Parameters for generic "diff": slices with iterators?

Hi there. I'm trying to write a Rust version of the diff algorithm. Diff is typically used by programmers to show which lines of a source code file have changed, but I'd like to make my function as general as possible. In particular, I want to be able to pass in UTF-8 strings and diff them by grapheme.

The question is, what should the parameters to my function look like? I started off using slices:

fn diff<T: PartialEq> (a: &[T], b: &[T]) -> ... 

This works for vectors of enums, and for bytestrings ("abc".as_bytes()), but to diff by grapheme I need a way to tell the function to iterate over the slices using an iterator like GraphemeIndices. (This seems to be unstable at the moment but CharIndices will work well enough as a substitute if necessary.)

The idea that seems most promising is to add a trait similar to IntoIterator specifying a function that takes a slice and returns an iterator over it (although since I don't want to mutate or consume, it should return a reference, like iter()).

I feel that a trait like this would be useful in other contexts, but I haven't found it. That makes me think that it's time to ask the community for advice. Have I overlooked something, should I be taking a different approach, or am I trying to do something that isn't possible and should be a feature request? Is there a good idiomatic approach to this situation? What function parameters should I use?

Shouldn't Iterator<Item=&T> or Iterator<Item=&[T]> work for this purpose?

Btw try out UnicodeSegmenation::grapheme_indices() for a stable alternative that works today.

Thanks for the reply! And thanks for the tip about grapheme_indices(). It's good to know there's a stable version.

When you suggest using Iterator<Item=&T>, do you mean replacing the slices completely with iterators, as follows?

fn diff<T: PartialEq> (a: Iterator<Item=&T>, b: Iterator<Item=&T>)

I thought about doing this but I think there are problems involved. I didn't go into much detail in my question above but the algorithm I'm basing my code on uses a lookup table of offsets to record the locations of matching subsequences in the two sequences being compared.

So if I use iterators instead of slices, there'd be a lot of using nth(offset) inside performance-sensitive loops, which would probably be slow and not an ideal coding style. Or else, instead of storing offsets in the lookup table, another way might be to store copies of iterators. But that would probably be unnecessarily complicated and take up more memory than needed.

In fact, another possibility I thought of would be to keep the function as it is, using slices, and collect strings into vectors of char or vectors of graphemes in order to pass them it. That would probably be a little slow and memory inefficient but would at least work.

But I feel like neither of these options - switching totally to iterators or sticking totally to slices - can be good idiomatic Rust. I'm just a beginner but I think I'm right in saying that in Rust, String and &str are UTF-8 strings represented as Vec<u8>. Presumably that is an optimal choice for performance and memory efficiency, and because it's close to their low-level representation.

But if I have to switch to another type - Chars or Vec<&str> - when I use generic functions, I get the feeling I'm fighting against the type system. If Rust uses UTF-8 &str for strings, encourages the use of generic functions, and even has functions like chars() and grapheme_indices() that let you convert to iterators from slices, I feel it would be most natural to have a way of using UTF-8 &str with generic functions.

I've written a lot in my reply, but I hope I don't come over as ranting or being dogmatic. I'm open to any possible solutions, but I want to be confident that I'm learning to write natural, idiomatic code. If there was something in the Rust tutorial saying "when you write generic functions, don't use &str - use only iterators or UTF-32 in Vec<u32> (or whatever)" I'd happily accept that, but as far as I've seen it's UTF-8 &str all the way. Are there any examples of avoiding &str/String and opting for a different string type?

In this case I probably don't understand what you need. You earlier wrote:

a trait similar to IntoIterator specifying a function that takes a slice and returns an iterator over it.

What would be the difference between this trait and Iterator<Item=&T> (or the same with &[T])?

I think the exact opposite is true: both can be idiomatic, it just depends on how you want the interface to be used. Taking an Iterator is more general, but gives you, the implementor, less power. Taking a slice is more specific, but gives you more power, e.g. the ability to perform efficient random access.

By the way, slice::Iter::nth() is as efficient as slice indexing itself.

What would be the difference between this trait and Iterator<Item=&T> (or the same with &[T] )?

The answer is, as you point out:

Taking an Iterator is more general, but gives you, the implementor, less power. Taking a slice is more specific, but gives you more power, e.g. the ability to perform efficient random access.

I want to write a general function, but I also want efficient random access. So maybe I'm trying to have my cake and eat it too?

I think the exact opposite is true: both can be idiomatic, it just depends on how you want the interface to be used.

I was a little unclear... in fact I think I agree with this. Slices and iterators are both features of the language, and each is appropriate in some situations. I wanted to say that trying to use only one of them all the time, even when the other one would be a better fit, would be a bad idea.

The problem is that within the function, because of the nature of the algorithm I'm using, I want to use both slices and iterators: slices for random access, iterators for cases like graphemes. It's good that slice::Iter::nth() is efficient, but my reason for wanting iterators is to accommodate iterators like Graphemes which step over a variable number of bytes with each next(). I imagine that for Graphemes, almost certainly nth() will be slow, because there's no easy way to convert the instruction "jump 10 graphemes ahead" into a byte offset.

Initially, I was trying to be concise, but I think I erred on the side of giving too little information. In the name of clarity, I've decide to risk giving too much information instead. Hopefully it won't be necessary to read through everything to understand what I'm trying to do, but just skip-read enough to get a gist. So, for reference, below are all the details.

I'm basically adapting/copying from C code which I found from the Wikipedia article on diff, and the algorithm in the C code is itself based on this research paper. In the paper, Figure 2 on page 6 contains the relevant pseudocode. Fig. 1 on page 3 and Fig. 3 on page 7 together give visual illustrations of the strategy the code is using.

I was reluctant to post chunks of my own version of the code out of embarrassment about the possible poor quality but maybe it would be clearer to just paste the first half of the loop in the relevant function for context. The important thing to note is how the splice index variable x gets stored in the memoization array fv[] and then retrieved - this is the part that needs random access. The (very C-like) while loop is the only place where the contents of the two sequences a and b being compared are actually examined.

let mut fv = [mid * 2 + 1; 0];
let mut rv = [mid * 2 + 1; 0];

fv[1 + mid] = 0;
rv[delta - 1 + mid] = n;

for d in 0..=mid {
	let mut x: usize;
	let mut y: usize;
	for k in (0-d..=d).rev().step_by(2) {
		let km = k + mid;
		// Find the end of the furthest reaching 
		// forward D-path in diagonal k.
		// Greedy algorithm in Figure 2
		x = if k == 0-d || k != d && fv[km - 1] < fv[km + 1] {
			fv[km + 1]
		} else {
			fv[km - 1] + 1
		}
		y = x - k;

	// Start point of middle snake
		let startx = x;
		let starty = y;

		while x < n && y < m && a[x] == b[y] {
				x += 1; y += 1;
		}
		fv[km] = x; // end Figure 2

		if odd && k >= delta - (d - 1) && k <= delta + (d - 1) {
			// Check for overlap with reverse path
			if x >= rv[km] {
				return MiddleSnake {
					x: startx, 
					y: starty, 
					u: x, 
					v: y, 
					d: 2 * d - 1
				};
			}
		// for loop continues...

I hope this makes things clearer. At the moment it looks like what I'm trying to do may not be possible, or at least not advisable. Is there any way to have random access but also account for complications like variable-length graphemes?

It does look like that to me, to be honest. Iterators are in the language because you can't have efficient random access in general; only specific data structures, for example arrays and hash maps, can provide it.

From the rest of your description, I'm not 100% sure, but do you want to perform two different traversals on the same string? If you need random access in one of these, I could imagine two approaches that would work:

  • Take a slice because you need it, and use it for what you need it. Then (or first, or while you are using the slice - depends on what order you want ), ask for an iterator over the grapheme clusters and step over it one by one. As long as you have immutable refs and iterators, you can use them at the same time.
  • Alternatively, traverse the grapheme clusters with the appropriate iterator (so take an iterator as your function argument), then cache their start-end indices as ranges or subslices in a vector that you can then index directly. This way your baseline complexity remains O(n), and after the initial pass through the string/slice, you can access whichever grapheme you need.

Thanks for your suggestions for approaches. Certainly either of them will be good for a function that takes &str and iterates over graphemes. But my ultimate aim, if at all possible, is to have a single generic function that will handle either generic slices iterated over one item at a time, or strings iterated over by grapheme.

Considering the first approach:

Take a slice because you need it, and use it for what you need it. Then (or first, or while you are using the slice - depends on what order you want ), ask for an iterator over the grapheme clusters and step over it one by one. As long as you have immutable refs and iterators, you can use them at the same time.

In a generic function taking this approach, the difficult part is asking for the iterator. The function will need to be able to determine whether it should use iter() to iterate over a slice or grapheme() to iterate over a string. What is the easiest way to do this, is the question I was trying to ask in the title.

Considering the second approach:

Alternatively, traverse the grapheme clusters with the appropriate iterator (so take an iterator as your function argument), then cache their start-end indices as ranges or subslices in a vector that you can then index directly. This way your baseline complexity remains O(n), and after the initial pass through the string/slice, you can access whichever grapheme you need.

It seems like this approach is more likely to be applicable to the generic case. To iterate over slices one item at a time, the caller could pass in [[0..=1]..=[(n-1)..=n]] (note: this is just pseudocode) where n is the length of the slice) as the list of indices. I have a couple of suggestions for variations on this approach:

  • Instead of passing in a list of indices, how about passing in a function or closure instead? That would potentially make things easier in the case where all items are one length.

  • How about collecting graphemes in a vector of graphemes? I mean doing something like:
    let vector_of_graphemes: Vec<&str> = input_string.graphemes().collect();
    Would this be slow or wasteful?

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.