What is the best way to iterate over lines of a string with offset?

There's a method called .lines() but it doesn't let you know how far the first character of each line is from the start of the original string.

I was wrong below. See the correct answer from @jofas .

You can take advantage of the fact that a single '\n' is consumed between lines and add one to the length of each substring to calculate the next offset.

    let s = "one\ntwo\nthree\n\nfour";
    let mut offset = 0;
    let lines = s.lines().map(|s| {
        let i = offset;
        offset += s.len() + 1;
        (s, i)
    });
    for (line, offset) in lines {
        println!("{offset} {line}");
    }

playground

EDIT: This produces the wrong result if a '\r\n' is added.

'\r\n' is also considered a line break (thanks, Windows). One could simply look up what comes after the line, i.e. '\n' or '\r\n' and add one or two to the byte offset, or—if using nightly is a possibility—Lines::remainder (which gives the byte offset of the next line, so one would need to store the previously observed offset to get the offset of the current line):

#![feature(str_lines_remainder)]

fn main() {
    let x = "a\nb\nc\r\nd";

    let mut lines = x.lines();

    let byte_offset = x.len() - lines.remainder().unwrap().len();
    assert_eq!(byte_offset, 0);

    assert_eq!("a", lines.next().unwrap());

    let byte_offset = x.len() - lines.remainder().unwrap().len();
    assert_eq!(byte_offset, 2);

    assert_eq!("b", lines.next().unwrap());

    let byte_offset = x.len() - lines.remainder().unwrap().len();
    assert_eq!(byte_offset, 4);

    assert_eq!("c", lines.next().unwrap());

    let byte_offset = x.len() - lines.remainder().unwrap().len();
    assert_eq!(byte_offset, 7);

    assert_eq!("d", lines.next().unwrap());
}

Playground.

1 Like

Yes, but the '\r' is not consumed by lines, it is present in the iterated value.

Oh, I was misinterpreting the doc, but it seems I'm wrong based on testing.

You're right.

1 Like

It is, you can see so in my example (note how the third line is "c" and not "c\r"). A carriage return not immediately followed by a line break is not considered a line break and only then is the carriage return present in the substring representing the line's content.

1 Like

Thanks for the correction! :slight_smile:

It looks like it is possible to do with BufRead::read_line since it doesn't consume the terminator.

    use std::io::BufRead;
    use std::io::Cursor;

    let input = "one\r\ntwo\nthree\n\nfour";
    let mut cursor = Cursor::new(input);

    let mut offset = 0;
    let mut line = String::new();
    loop {
        line.clear();
        let n = match cursor.read_line(&mut line) {
            Ok(0) | Err(_) => break,
            Ok(n) => n,
        };
        let line = line.trim();
        println!("{offset} {line}");
        offset += n;
    }

playground

1 Like

Rather than capturing a local mut offset, there's an iterator method called .scan(), and rather than .lines(), I used a combination of .split('\n') and .strip_prefix('\r'). The resulting code is a little bit more complicated than yours, but correct.

If anyone post an answer that is entirely lazy, doesn't capture local variables, I will mark it as the solution. For my part, I have solved the problem on my own, but the requirement for my code is a little bit different so I won't post it here.

Here is an iterator version of this, however:

  • It allocates and returns a new String for every line.
  • I only approximated the trimming of the terminator and just trimmed all \n and \r.

EDIT: Here's a version with a generic AsRef<[u8]> input. Also removed the extraneous line.clear().

EDIT2: This version doesn't allocate a String per line, but doesn't use the Iterator interface.

You drive a hard bargain.

This is zero-cost:

string.lines().map(|s| {
    let offset = s.as_ptr() as usize - string.as_ptr() as usize;
});

Substrings are pointers to the original string, so the offset is always known exactly without having to count anything.

I haven't used ptr::offset_from since that is unsafe, and theoretically if lines() gave you unrelated lines that aren't in your string (which it won't but the API doesn't forbid it), it would be UB instead of a nonsense offset.

6 Likes

I think this is the first time I've seen raw pointers used, without unsafe, to do something useful. This never would have occurred to me, so I guess it's been too long since I wrote C.

There's also str::split_inclusive('\n') (great for indenting!), but I generally do this by repeated input[pos..].find('\n')

This is the solution I sought. Entirely lazy, no capturing of local variables, no allocations.

// we can even have no_std if we remove dbg!
// #![no_std]

fn lines_with_offset(text: &str) -> impl Iterator<Item = (usize, &'_ str)> {
    text.split('\n')
        .scan((0, ""), |(prev_end_offset, _), line| {
            let start_offset = *prev_end_offset;
            *prev_end_offset += '\n'.len_utf8() + line.len();
            Some((start_offset, line))
        })
        .map(|(start_offset, line)| match line.strip_prefix('\r') {
            Some(line) => (start_offset + '\r'.len_utf8(), line),
            None => (start_offset, line),
        })
        .map(|(start_offset, line)| (start_offset, line.strip_suffix('\r').unwrap_or(line)))
}

fn main() {
    let text = "hello\nworld\nabc\ndef\n\rghi\r\njkl\r\nmno\n\rfinal line";
    for (start_offset, line) in lines_with_offset(text) {
        let end_offset = start_offset + line.len();
        let line_ = &text[start_offset..end_offset];
        dbg!((start_offset, end_offset, line, line_)); // this line may be removed to get no_std
        assert_eq!(line, line_);
    }
}

Playground doesn't work now for some reason (probably because gist API changed?), so you may run it locally.


NOTE: The Rust std's .scan() sucks ass, so in my actual solution, I used IterScan::scan_copy.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.