Something like MagicalBufferDelimiter

Hello,

Is there a crate that can do the job of MagicalBufferDelimiter in this algorithm below :

    let mut file = std::fs::File::open("a_large_text_file_with_delimiters").unwrap();
    let delim = "--delimited text--\r\n";
    let mut reader = MagicalBufferDelimiter::new(&mut file, delim);
    let mut buf = [0; 1024];
    let mut list_of_delimited_text = Vec::new();
    let mut working_string = String::new();
    
    let mut res = reader.read(&mut buf);
    loop {
        match res {
            Err(err) => {
                panic!("Error while reading : {err}");
            }
            Ok(sz) if sz == 0 => {
                break;
            }
            Ok(sz) => {
                working_string.push_str( from_utf8(&buf[0..sz]).unwrap() );
                if reader.has_delimiter_or_eof() {
                    list_of_delimited_text.push( working_string );
                    working_string.clear();
                }
            }
        }
        res = reader.read(&mut buf);
    }

I'm trying to implement this myself but it gives me a hard time :frowning:

did you check str::split_terminator()? does that suit your use case?

1 Like

Unless it's literally gigabytes, it's faster to simply read the whole content to a string and split that, but otherwise it's somewhat tough.

To implement this as a method at the moment requires manually implementing Iterator which is a pain. You'd probably want to wrap BufReader in std::io - Rust for it to handle buffering and peek into the buffer to look for the separator, but it would be a pain to keep all the state straight still.

The crate you're looking for seems to be:

But it's send unmaintained, so you might want to just use it as reference?

2 Likes

Since you just collect data until the delimiter is found, you can make use of BufRead to extend one blob at a time until the delimiter is found, as opposed to a fixed-size buffer that may get filled with half of the delimiter.

1 Like

Oh, and something else occured to me.

working_string.push_str( from_utf8(&buf[0..sz]).unwrap() );

This will panic if a read happens to end in the middle of a UTF-8 codepoint, even if the input as a whole is valid UTF-8. Dealing with split code points is a similar challenge to dealing with split delimiters (albeit limited to a few bytes in length).

(In contrast the playground I posted defers UTF-8 checks until an entire delimited chunk has been read.)

2 Likes

Thanks ! I missed it ! It's a little bit old but may it do the job ! I will have a look !

Ha it's a really good algorithm that work perfectly for the case of my example. But my example is not representative of my exact need. A big thank to you , sorry to the misunderstood it's my fault.
In fact my input file looks more :

fn seed() {
    std::fs::write(
        FILE,
        "-- FILE DELIMITER --\r\n
        name: MyStream1.bigtext\r\n
        Some massive text to treat.....
        [...]
        -- FILE DELIMITER --\r\n
        name: MySTteam2.bigtext\r\n
        Some massive text here too .....
        [...] 
    ",
    )
    .unwrap();
}

In this last case :
(1)- The size of the buffer have to be limited (because it's a multithreading app the use of memory have to be limited)
(2)- The delimiter have to not be in the returned buffer if possible (to not match it twice -> for performance issue)
(3)- And because of (2) we have to know if yes or no the Delimiter has been detected in the last readed buffer

That almost looks like a mime message, eg mailparse — Rust parser // Lib.rs

3 Likes

In what I read mailparse doesn't return a stream on the body, it directly loads all in memory. But your trail make me have a look at multipart , which return the body part in a buffer. It contains the mechanism !

Thanks you all !

Hello,

I've been writing BufReadSplitter that resolve the problem :

   let mut reader = BufReadSplitter::new(10, &mut input_reader, "<SEP>".as_bytes());
   loop {
       // Output buffer
       let mut buf = [0u8; 100];
       loop {
           // Read the input buffer
           let sz = reader.read(&mut buf).unwrap();

           if sz > 0 {
              /* ... do something with reader ... */
           } else {
               break;
           }
       }
       match reader.next() {
           Some(Ok(_)) => {/* ... next buffer ... */}
           Some(Err(err)) => panic!("Error : {err}"),
           None => break,
       }
   }

With the possibility to change the Separator afterward with reader.next_split_on("<OTHER_SEP>".as_bytes()) instead of reader.next().

https://crates.io/crates/buf_read_splitter