Iterate over varying size chunks of binary data

kekronbekron · April 10, 2020, 8:44am

Emptying out text, as I'd like to wipe these, please.

alice · April 10, 2020, 9:11am

If you are reading a single file, I would not try to parallelize that. The only exception is if processing the data takes longer than the IO itself, and that is rarely the case. Even when it is, I would not parallelize the IO itself, but the processing of that IO.

I would also not use seek unless there are large parts of the file you don't need at all.

alice · April 10, 2020, 9:23am

I recommend using three backticks for code blocks, as they have syntax highlighting:

```
// your code here
```

In particular, please don't let stray curly braces fall outside the code block.

newpavlov · April 10, 2020, 10:33am

One option is to use memory mapped files (e.g. with memmap), it's one of the most efficient ways to process large files. With this approach your file will look like a simple memory buffer and OS will handle caching and reading data for you. But it has several major disadvantages:

If file is no longer available (e.g. drive with the file got disconnected), your application will simply abort execution instead of giving you an error (it's possible to catch and process failure signal, but it's really difficult to do correctly, and borderline impossible for a library).
Data can change under your nose (e.g. if other process writes to the memory mapped file), which does not play well with Rust aliasing guarantees, e.g. if you have converted &[u8] to &str by checking that it's indeed a valid UTF-8 string, strictly speaking you can't return this &str safely, since underlying buffer may change in future.

So this approach works poorly for a library crate, but sometimes can be quite convenient for an application.

cuviper · April 12, 2020, 4:58pm

I can think of a few options, but in all of them you need to hope that your computation has some weight compared to I/O, or it won't help you much.

Serially skim the file to collect just (offset, length) of each record, then use par_iter() to read and process them.
Serially read the records and spawn them for further processing. Wrap everything in a scope first if you need to borrow shared data.
Create a serial Iterator producing records, and use par_bridge() to parallelize further processing.

cuviper · April 13, 2020, 4:34pm

Yes, and then each pair can read_exact_at to get their actual data.

It's best if you can get some measurements to support your hunch. Do you already have a working single-threaded program? If you time your-program, how do the real/user/sys times look? Parallelizing can only distribute the user time. If the real/sys times are much larger, that's probably I/O.

system · July 21, 2020, 4:04am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Custom iterator over binary file help	13	416	October 18, 2023
Parsing iterator over a file	2	2005	January 12, 2023
Reading Binary Files - A trivial program not so trivial for me help	6	6491	May 27, 2021
Read a binary file into bytes iterator or stream help	6	1107	August 24, 2021
Reading a file multiple lines at a time	4	4058	October 12, 2021

Iterate over varying size chunks of binary data

Related Topics