Split a bytes vec by a sequence of chars

I want to extract the payload of a http request, that I get as a vector of bytes.
In the request, the payload is separated from the rest by the sequence \r\n\r\n, that's why I want to split my vec at this position, and take the second element.

My current solution is to use the following function I wrote.

fn find_payload_index(buffer: &Vec<u8>) -> usize {
    for (pos, e) in buffer.iter().enumerate() {
        if pos < 3 {
            continue
        }
        if buffer[pos - 3] == 13 && buffer[pos - 2] == 10 && buffer[pos - 1] == 13 && buffer[pos] == 10 {
            return pos + 1;
        }
    }
    0
}

13 is the ASCII value of \r and 10 the value of \n. I then split by the returned index. While this solution is technically working, it feels very unclean, and I was wondering how to do this in a more elegant way.

Perhaps something like

fn find_payload_index(buffer: &[u8]) -> Option<usize> {
    buffer
        .windows(4)
        .enumerate()
        .find(|&(_, w)| matches!(w, b"\r\n\r\n"))
        .map(|(ix, _)| ix + 4)
// or if you want to have the "if nothing found: return 0" behavior, you can instead
// use a `.map_or(0, |(ix, _)| ix + 4)` call, and change the return type back
}

looks cleaner.

With the itertools crate, you can be even more concise

use itertools::Itertools;
fn find_payload_index(buffer: &[u8]) -> Option<usize> {
    buffer
        .windows(4)
        .find_position(|&w| matches!(w, b"\r\n\r\n"))
        .map(|(ix, _)| ix + 4)
}
2 Likes

You don't need itertools.

fn find_payload_index(buffer: &[u8]) -> Option<usize> {
    buffer
        .windows(4)
        .position(|w| matches!(w, b"\r\n\r\n"))
        .map(|ix| ix + 4)
}
2 Likes

I've seen this solution several times on this kind of questions. But I'm still feeling curious - if there are any KMP based implementations of splits/finds in Rust std (probably not) or among popular crates. Since it provides better asymptotics (n + m vs n*m)

aww, so many methods in Iterator.. they could've mentioned that position is also a thing in the docs of find_position.

I'm not sure about how exactly it's implemented but regex crate supports searching slices of u8.

1 Like

There's a state-of-the-art implementation of the Aho-corasick algorithm in Rust.

1 Like

I was just finishing up an example of that; it's a finite-automaton-based implementation, so it doesn't ever scan an input character more than once:

/// Returns a tuple (header, body)
fn split_request<'a>(req: &'a[u8])->(&'a [u8], Option<&'a [u8]>) {
    use regex::bytes::Regex;
    let re = Regex::new(r"\r\n\r\n").unwrap();
    let mut split = re.splitn(req, 2);
    (split.next().unwrap(), split.next())
}

(Playground)

2 Likes