Iterate over varying size chunks of binary data

Emptying out text, as I'd like to wipe these, please.

If you are reading a single file, I would not try to parallelize that. The only exception is if processing the data takes longer than the IO itself, and that is rarely the case. Even when it is, I would not parallelize the IO itself, but the processing of that IO.

I would also not use seek unless there are large parts of the file you don't need at all.

I recommend using three backticks for code blocks, as they have syntax highlighting:

```
// your code here
```

In particular, please don't let stray curly braces fall outside the code block.

One option is to use memory mapped files (e.g. with memmap), it's one of the most efficient ways to process large files. With this approach your file will look like a simple memory buffer and OS will handle caching and reading data for you. But it has several major disadvantages:

  • If file is no longer available (e.g. drive with the file got disconnected), your application will simply abort execution instead of giving you an error (it's possible to catch and process failure signal, but it's really difficult to do correctly, and borderline impossible for a library).
  • Data can change under your nose (e.g. if other process writes to the memory mapped file), which does not play well with Rust aliasing guarantees, e.g. if you have converted &[u8] to &str by checking that it's indeed a valid UTF-8 string, strictly speaking you can't return this &str safely, since underlying buffer may change in future.

So this approach works poorly for a library crate, but sometimes can be quite convenient for an application.

I can think of a few options, but in all of them you need to hope that your computation has some weight compared to I/O, or it won't help you much.

  • Serially skim the file to collect just (offset, length) of each record, then use par_iter() to read and process them.
  • Serially read the records and spawn them for further processing. Wrap everything in a scope first if you need to borrow shared data.
  • Create a serial Iterator producing records, and use par_bridge() to parallelize further processing.

Yes, and then each pair can read_exact_at to get their actual data.

It's best if you can get some measurements to support your hunch. Do you already have a working single-threaded program? If you time your-program, how do the real/user/sys times look? Parallelizing can only distribute the user time. If the real/sys times are much larger, that's probably I/O.

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.