Iterate over varying size chunks of binary data

Want to parse a big binary file (tens to potentially hundreds of gigabytes) efficiently, therefore looking at parallellism (order of execution & completion doesn't matter).

Structure of the file is as follows:
First 2 bytes - length of a logical record including itself (the first 2 bytes).
At position '0 + length of first logical record', take 2 bytes to find the length of the next logical record,
and so on.

Example data:
UPDATE: to include a few more bytes to show where the record type info is.
00 12 00 00 nn... whatever (up to 18 bytes (0x12 is 18), where 'nn' is the record type).
7F F0 00 00 nn... whatever (up to 32_752 bytes (0x7ff0 is 32752), where 'nn' is the record type).

Does the below pseudocode make sense? Ideally, I want this to be iterable, so that I can use Rayon to handle the parallelisation.

let mut pos = 0;
let mut f = File::open("file.bin");
let mut buf = [0; 2];
while pos <= f.metadata().unwrap().len() {
  f.seek(SeekFrom::Start(pos));
  f.read_exact(&mut buf)?; 
  let recsize = buf.read_u16::<BigEndian()>?
  let recbuf = BufReader::with_capacity(&recsize, f);
  // parallelize below block somehow? 
  // because I will be parsing the data + serde_json-ing the parsed data
  let print_record = Record::read_from(&mut recbuf);
  let pos += buf;
}

If you are reading a single file, I would not try to parallelize that. The only exception is if processing the data takes longer than the IO itself, and that is rarely the case. Even when it is, I would not parallelize the IO itself, but the processing of that IO.

I would also not use seek unless there are large parts of the file you don't need at all.

I recommend using three backticks for code blocks, as they have syntax highlighting:

```
// your code here
```

In particular, please don't let stray curly braces fall outside the code block.

Hey Alice,

Whoops, sorry about the stray bracket. Will use the three backticks from now.

Ok, I'll do the I/O sequentially.
The read_from function within the Record struct will get some values (u32, u8, etc.) as is, from the input BufRead; however, there are slices on which I will be performing transformations (say codepage conversions, or unpacking some packed field).
Currently, the read_from returns a struct. I'm looking to add some more code in the end of the function to convert that struct into externally tagged JSON. So... will Record::read_from be worth parallellizing at least then/now? We're talking about million+ logical records.

Also, any comments on the above method of splitting out logical records from a single binary file?
Ok, so I can't use seek. What else can I use? I have to split the single binary file into logical records, and then do a match to parse each different logical record type differently.

One option is to use memory mapped files (e.g. with memmap), it's one of the most efficient ways to process large files. With this approach your file will look like a simple memory buffer and OS will handle caching and reading data for you. But it has several major disadvantages:

  • If file is no longer available (e.g. drive with the file got disconnected), your application will simply abort execution instead of giving you an error (it's possible to catch and process failure signal, but it's really difficult to do correctly, and borderline impossible for a library).
  • Data can change under your nose (e.g. if other process writes to the memory mapped file), which does not play well with Rust aliasing guarantees, e.g. if you have converted &[u8] to &str by checking that it's indeed a valid UTF-8 string, strictly speaking you can't return this &str safely, since underlying buffer may change in future.

So this approach works poorly for a library crate, but sometimes can be quite convenient for an application.

Hi,

Thanks for the suggestion. At the moment, I'm looking to keep it simple, as I'm really a beginner.
memmap would be going hardcore too soon, before I even have a MVP.

I was able to achieve the logic I inteded above with this. Vec is iterable so I had to make a Vec of Vecs.

// smf is an array like this - [u8; number]
let smf_bytes = smf.len();
let mut pos= 0;
let mut file_vec: Vec<Vec<u8>> = Vec::new();
while pos < smf_bytes {
    let rec_len = u16::from_be_bytes(smf[pos..pos+2].try_into().unwrap());
    let vec_size = rec_len as usize;
    let mut rec_vec = vec![0u8; vec_size];
    // println!("pos:{} rec_len:{} vec_size:{}", pos, rec_len, vec_size);
    rec_vec.copy_from_slice(&smf[pos..pos+rec_len as usize].to_vec());
    file_vec.push(rec_vec);
    pos += rec_len as usize;
}
&file_vec.par_iter()
    .for_each(|rec| do_it(rec));

This assumes that I can hold the whole file in the vector, and that assumes the whole file goes into memory. I will have to refactor this to fill up the file_vec up to a specific size, and then call the par_iter and wait for its return to empty file_vec, allowing it to proceed with the next bunch.

Also, I guess I can avoid reading the whole file into file_vec and instead use the below statement
when I know that my program will be running in *nix environments.
// when productionizing, use f.read_exact_at(target_vec, offset);

Thoughts?
If this is hacky, please let me know if there's a more elegant way,
Thank you!

I can think of a few options, but in all of them you need to hope that your computation has some weight compared to I/O, or it won't help you much.

  • Serially skim the file to collect just (offset, length) of each record, then use par_iter() to read and process them.
  • Serially read the records and spawn them for further processing. Wrap everything in a scope first if you need to borrow shared data.
  • Create a serial Iterator producing records, and use par_bridge() to parallelize further processing.

Okay, option 1 seems closest to what I already have.
Can you show an example please of how I should go about implementing it..
Skim the whole file for offsets in a while loop (like above), then, once the whole file is parsed for offsets... pass the offsets (assuming I give it a vec of tuples (offset, length)) to a par_iter?

Leaning towards least modifications to what I have already because at the moment, it's just a hunch that processing will be a lot (each logical record, max 32kbytes in length, is a mish-mash of codepage, binary, and hex). Hundreds of different logical record types, and tens of thousands of such logical records...

Yes, and then each pair can read_exact_at to get their actual data.

It's best if you can get some measurements to support your hunch. Do you already have a working single-threaded program? If you time your-program, how do the real/user/sys times look? Parallelizing can only distribute the user time. If the real/sys times are much larger, that's probably I/O.

I'm currently testing from Windows so can't implement/do either of these (read_exact_at or time)..

EDIT: Please suggest, @cuviper :slight_smile:

Hey, a quick question -

So I open a file in let's say, main(), and then I can pass the call par_iter on the Vec of (offset, length) tuple and pass f to the function? (or should it be &f)...
If you have time, an example will be really helpful.
Thank you!

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.