String processing best practices


#1

I need to parse a string. I don’t want to pull regexes or grammar generator or some similar big dependency, because the format is relatively simple; just a series of easily recognizable fields.

Normally in C++, and C before it, I would just use iterators. But there iterators are just pointers into the string and ranges are just pairs of pointers.

So I tried to use iterators in Rust as well. But they are too restricted. They can basically only be used for one-at-a-time processing, because there seems to be no way to get the slice that remains to be read by iterator, slice up to the iterator and similar operations that are often needed to pass the parsed bits to functions.

Obviously it can be done with indices. But I hate indices. Even in plain C I always preferred iterating with pointers over indices, because pointers can be directly accessed without having to pull the start around everywhere and because they provide at least some type-checking. Indices are just numbers, much easier to mix up.

So what should I be using? Do I really have to use indices?

A slice is internally a pair of pointers and so is an iterator. And two pointers can be converted to offset of the second from the first. They should be convertible to each other. So is it at least possible, given a string (slice) s and a subslice x to calculate slice a that is the part of s before x and a slice b that is the part of s after x? Without resorting to unsafe; there is nothing unsafe on the operations (or at least more unsafe than indexing).


#2

An Iterator is not necessarily convertible to a slice. (Note that in Rust, they don’t pretend to be pointers like in C++.) I don’t imagine it would be hard to add slicing methods to Chars… but then the instant you compose the iterator, that method would disappear, making it kinda slightly useless. In any case, you can use char_indices which you can use to convert back to a slice.

Honestly, just use slices. It’s so much simpler than trying to wrangle iterators into shape. If you’re paranoid about mixing up indices, you could wrap the slices in a type that only defines a more restricted Index implementation (i.e. it only takes ByteIndex rather than usize).

Finally, strcursor might be useful. It gives you a way to step forward and backward through a string without having to deal with indices and also grab slices.


#3

A plain Iterator is not necessarily a slice nor convertible to one. But a slice could easily implement Iterator.

That would be somewhat helpful. In fact in current 1.4.0-nightly I see methods on slice like std::str::slice_shift_char that does almost the same thing as std::Iterator::next for str would (except it returns the rest slice instead of mutating invocant).

Yes, but there are cases where it would be helpful to pass the slice to something that works on an iterator and then get the slice it iterated over and/or the slice that is left to be processed.

C++ also has different classes of iterators and some allow some operations and other don’t.

… except it complicates matter quite a bit because then I need to wrap it to something that returns only the char for passing to any algorithm that wants to work with the chars only.

Yes, that’s what I realized. I still need to at least be able to convert the slice to index, or use indices anyway.

ByteIndex still does not have any relation to string. What I would like most would be

impl Index<RangeTo<&str>> for &str

and

impl Index<RangeFrom<&str>> for &str

so I could take a subslice, consume characters from it using it in iterator-like fashion and then I could just ask for the covered bit as slice.

And then it would be cool if this could be statically checked, only allowing it if the indexing slice is, perhaps indirectly, borrowed from the indexed slice.

Yes, that looks nice.

In fact some time before 1.0 there was something called RandomAccessIterator (I don’t remember the details) and this is reintroduction of something like that.


#4

The problem is answering “what does next do?”. There’s at least two different, relatively valid answers; a similar reason is why str has chars and bytes, not merely into_iter like, say, Vec does.

You can do that with a little pointer arithmetic… which isn’t great, but it does work.

Rust doesn’t have dependent types, and isn’t likely to any time soon. Sorry. :slight_smile:

It’s not. You cannot do O(1) code point or grapheme cluster indexing on a string, which is what RAI was about.


#5

I have a similar issue for my learning project, and the solution I came up with was to use Iterator + Copy: with Copy, I can save the iterator state before trying to match one of the parsing rule, and then restore it if it fails in order to try another rule. I do not know if the same solution can apply to your case.

And it is still not perfect when I want a slice of the iterator for functions that want slices, like string→integer conversion. If the iterator is known to be based on a slice, some kind of Slice_iterator (#![allow(non_camel_case_types)] rulz ;‌) with fn get_mark(&self) and fn get_slice(&self, start, end), and of course fn prev(&mut self) would be nice. Of course, it will be lost if iterator adapters are stacked on top of it, but for a parsing task it seems acceptable.


#6

Yes, I will probably just make some helper for that.

It would be nice if the standard library had something for directly asking the distance between (starts of) two slices. That with the currently unstable slice_shift_char will make the kind of string processing I have in mind quite reasonable with slices.

Hm, I guess I am overestimating what the borrow checker actually knows (it only knows that when a reference is created with lifetime 'a, then the referee is read-only for the rest of 'a, right?).

True.

In C++ there are single-pass iterators (that can do exactly what Iterator can in Rust), multi-pass iterators, bidirectional iterators and random-access iterators.

And Chars is a multi-pass iterator. Or could be, but Rust currently does not have any definition for what it means. At least it could allow some optimizations, like peek could be implemented directly instead of needing the Peekable wrapper.


#7

Yes. That’s why I am more inclined to simply adding a next_char() to slice taking mutable slice invocant that will work like Chars.next() but directly on slice.