Need help/suggestion to implement iterator/reader

Hi,

I'm struggling to find a good pattern for the following problem: I implemented line-based parser based on the bstr and nom crate, processing a byte-line to create a struct. For performance reasons the struct contains borrowed data. The struct looks like this:

#[derive(Debug, PartialEq, Serialize)]
pub struct Record<'a> {
    pub(crate) fields: Vec<Field<'a>>,
}

The code to process a file looks like this:

for result in reader.byte_lines() {
        let line = result?;

        if let Ok(record) = Record::from_bytes(&line) {
            // ...
        }
}

Now, I thought it would be a good idea to provide a Reader which hides the reading details and provides a error type. But it's not possible, because of the lifetimes.

Are there other alternatives/patterns to an iterator?

Your mistake is probably in use of references inside Field. If it's an actual field that stores data, not a FieldView that is a temporary reprojection of actual field, then it shouldn't use any temporary references inside.

Temporary references in structs in 99% of cases are a horrible idea that will haunt you and make everything in Rust 10x harder or outright impossible.

In your case the temporary borrowing Record is doomed to exist only within a single iteration of the loop. You could read the whole file ahead of time and then split by lines. Then the Record would only be only doomed to live within the function that loaded the file. As long as it's chained by a lifetime, it will not be freely usable outside of functions/scopes that allocated the data it borrows.

1 Like

What should does your Reader encapsulate?

While I agree with the challenges of lifetimes, when it comes time to ripping through large data sets, borrowing needs to be considered. That may not be the case here. But in general, whatever pre-processing that involve ephemeral types, think borrow. Once you have completed the processing, take ownership.

Thanks for your replies!

@kornel

I know that the references are the culprits, but I need the references for performance reasons. What I want is a Record struct, which takes ownership of a byte line (Vec<u8>) and provides the parsed (nested) data (Field, Subfield, etc.). The parsed data lives as long as the record exists. But I'm not sure if it's possible, all my attemps to achieve this failed.

@EdmundsEcho

The reader should encapsulate the file processing (file, gzip or stdin), provide an iterator over the parsed records, handle flags like (skipping of invalid records) and provide a error type (parsing errors, io errors, etc.).

Most of the time, a record is used read only and therefore I thought the benefits of using borrowed types is reasonable. I created a second owned type of the record for the other use cases (N.B. same problem here: I can't use ToOwned trait because of the lifetimes).

What I want to archieve, is something like this:

let reader = Reader::from_file("filename.dat.gz").unwrap();
// let reader = Reader::from_stdin().unwrap();

for record in reader.records() {
   // do something with record
}

Problem: I can't collect all records because the vector would be too large.

Borrowing doesn't solve that problem, though. If you can collect all the Records, it doesn't matter whether they own the data or refer to it elsewhere -- it still has to exist somewhere. So that problem is with collect, and by extension with implementing Iterator for Reader, not with Record owning its contents.

If Record borrows from a Vec<u8>, it should be possible to do what you want without moving the temporary line into Record. But this will be more verbose than for record in reader because what you're designing is basically a streaming iterator, which is more limited in what it can do than a regular Iterator (relevant: streaming iterators don't support collect).

On the other hand, if Record owns its contents, then you can implement Iterator<Item = Record> for Reader, but Record shouldn't have a lifetime parameter because it doesn't borrow from anything. If you need internal self-referential pointers for "performance reasons", you'll have to use unsafe and promise to manually uphold all the aliasing / mutability rules whenever you create a reference.

2 Likes

What are you doing with the nested fields/struct? Specifically, is there a reducing computation or are you sending the struct to another process? Knowing how the struct is consumed and the relative size compared to the input will help inform the design.

An alternative to self-references is to store ranges in Field, and reify those into short-lived references to the owned buffer on demand.