Split by Paragraphs

I wanted to split text by paragraphs in a similar fashion to lines. I did some quick googling, found nothing and decided to write it myself.

I am wrapping over Lines and checking for all-whitespace lines to separate paragraphs. After finding first and last non-empty line I stitch them together using unsafe code.

It appears to be working, however I am not 100% confident about using unsafe and would appreciate someone having a look at this, and would appreciate any feedback generally.

You could probably in a lot of cases do a simple call to split on "\n\n". Something like this.

Your usage of unsafe code doesn't seem warranted to me, especially since this can be done without using unsafe functionality.

To me it seems like a shortcut for s.split("\n\n"), unless you were to implement an algorithm that also understand \r.

1 Like

I wanted to split at lines that consisted of whitespace characters too, so splitting at \n\n does not cover it.
(Playground)

The other simple option would be to use regex but I did not like that idea.

What I did in my playground was to trim and check if the current paragraph is empty, if it is go to the next immediately. That should have the effect of ignoring whitespace.

I modified your example by adding some spaces between the second and third paragraph, so that instead of \n\n\n there's \n \n \n at which point that approach fails to detect "empty" (whitespace only) lines and outputs:

0: This is the first paragraph.
1: This is the second paragraph, which is separated by two newline characters.
  
  
  
And here is the third paragraph.
1 Like

Here's a version of your code in safe Rust. (One slight behavior difference: It includes the final newline as part of the returned paragraph.) Implementing next_back is left as an exercise. Playground.

pub struct Paragraphs<'a> {
    s: &'a str,
}

impl<'a> Iterator for Paragraphs<'a> {
    type Item = &'a str;

    fn next(&mut self) -> Option<Self::Item> {
        // Find the start of the paragraph.
        let mut pos = loop {
            if self.s.is_empty() {
                return None
            }
            let (line, rest) = split_first_line(self.s);
            if line.chars().all(char::is_whitespace) {
                // Discard blank line.
                self.s = rest;
            } else {
                // Found a non-blank line.
                break line.len();
            }
        };

        // Find the end of the paragraph.
        loop {
            let (line, rest) = split_first_line(&self.s[pos..]);
            if line.chars().all(char::is_whitespace) {
                // Blank line or empty line: end of paragraph.
                let result = &self.s[..pos];
                self.s = rest;
                return Some(result);
            }
            // Non-blank line: continue looping.
            pos += line.len();
        }
    }
}

fn split_first_line(s: &str) -> (&str, &str) {
    let len = match s.find('\n') {
        Some(i) => i + 1,
        None => s.len(),
    };
    s.split_at(len)
}

Edit: Made various improvements to readability.

6 Likes

This implementation fails some tests about mirroring the behavior of lines. (Playground)

I wanted to match the behavior of lines as much as possible so as to avoid any surprises when migrating from it, that's also why I opted to wrap over it (apart from laziness).

I am not sure whether it's better to use a little unsafe code or re-implement the underlying lines logic in this case.


test file: split-paragraphs/tests/tests.rs at main · lubomirkurcak/split-paragraphs · GitHub

You should lean towards re-implementing underlying logic over unsafe; in general, you're better off with machine-verified safety than human-verified safety, because mistakes made by the safety verifier lead to UB, and the machine won't make one-off mistakes that a human will make.

Plus, if the machine verifier has bugs in it, causing it to consistently make mistakes, those can be fixed so that when you upgrade the toolchain, you get told about the safety hole; human verification tends not to be rechecked except on changes.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.