Custom iterator over binary file

Hello,

I found this beautiful example of creating an iterator from a Vec or a slice.

I'm working with a potentially larger than memory binary file, and it seems I need to create an iterator over the file, to return one Item/record at a time. Can then use map or whatever to process each Item/record.

So, without loading the whole file in memory, how can I do this?
The binary file layout is as follows:

len ....... len ......... len ........ etc.
4 bytes of item/record-len at the start (including the 4 len bytes).

Please help with suggestions, ideas.
I've made a lot of progress on other aspects of my project but I'm stuck with the creation of this custom binary file iterator.

You can probably write a serde deserializer for that

Here's a possible implementation.

3 Likes

I'm afraid not; I'm already using a parser for the data.
It's just that I need to split the binary file up into items I can iterate over, so that I can support larger-than-memory files. I expect most of the files I encounter with this to be in hundreds of gigs.

@H2CO3 I don't understand how/what I must pass to bytes in this example.

In my main function...

    let mut file = File::open(FILENAME)?;

    let mut two_bytes: [u8; 2] = [0; 2];
    file.read_exact(&mut two_bytes).expect("first two bytes error");

    let file_struct = RecordIter { reader: two_bytes };

    for item in file_struct {
      dbg!(item);
    };

It says:

the trait bound `[u8; 2]: std::io::Read` is not satisfied [E0277]
Help: the trait `std::io::Read` is implemented for `&[u8]`
Note: required for `impls::RecordIter<[u8; 2]>` to implement `std::iter::Iterator`
Note: required for `impls::RecordIter<[u8; 2]>` to implement `std::iter::IntoIterator`
Help: convert the array to a `&[u8]` slice instead

impls:: here refers to my impls.rs, where I have placed the Iterator impl.

EDIT: Is this right... bytes is a slice containing the whole file?
How do I make a File into a &[u8] without having to load the whole file in memory?

You can pass any reader (type that implements the Read trait). You don't need to pass a slice. That was just an example. In reality, you'd simply do

let file = File::open(FILENAME)?;
let file_struct = RecordIter { reader: file };

But this could already have been inferred from the fact that the reader field has type R and there's a Read bound on it.

The whole point of this iterator is that you don't need to manually perform any reading upfront.

Yes, absolutely.
Thanks again for the explanation.
I realized that you showed a slice because it's usable in the Rust playground... after I posted the question :slight_smile:

Truly, truly... thank you! Your ten minutes means a whole lot to me.

1 Like

Seems it's missing the first four len bytes itself in the returned Vec.
Trying to figure it out (my brain's CPU is spinning...) :slight_smile:

The length bytes are already read (==gone from the reader) by the time the buffer is constructed. They have to be, otherwise it would be impossible to know the length.

Your problem was not precisely specified, but if you need those bytes, you can simply make a bigger vector and prepend those 4 bytes to the front of the buffer. (Like this.)

Think I got it working (also just noticed your edit with a sample).
Trust me, your help is invaluable.
May you always be healthy & content in life :pray:

Almost forgot this... so at the other end, after something crunches these Items, is there a way to write/flush the output into a file (appending as it goes), rather than having to collect everything into a Vec (memory)?

Is Write::write_all() not sufficient?

It probably will be... just that I didn't know about it. Will try it out.
I've found the docs.rs site layout quite confusing.
So I've never really learnt to properly understand what capabilities are available in any given crate for example.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.