Tips on how to optimize a file parser (std, diesel, serde)

I have this app that reads all the .chr files (they're similar to .txt files) inside of a folder, pulls 3 values (elv, exp, gld), then turns them into SQL data & inserts them in a table (PostgreSQL), and also serializes the data into .json & saves it in a file called characters_db.json.

When I have 12000 .chr files it takes ≈8 seconds to complete the whole process (Dual core Intel Celeron). I've used Instant::now(); to check which part takes the most time and it's the reading process by far (99% of the total time). Originally I was using the regex crate but once I removed it, the performance got much better. So how could I further optimize the reading .chr files part of the code?

The code is a bit long so I'm sharing it as a gist here, and also here's an example of a .chr file just in case: Example

You would probably benefit from parallelizing the file reading since you have a fairly large number of files.

lines and split are convenient, but they necessarily work by reading ahead in the string to find the character they divide the string with. You can often get at least a minor speed[1] up by going character by character and deciding what to do next by matching on the current character. In your case the files are short and you're mostly ignoring the initial part of the string you're splitting, so it may not have as much of an effect here.

  1. assuming you're careful about what you do in the loop ↩ī¸Ž

1 Like

Instead of reading the entire file to a string and then iterating it line by line, you could try using BufReader and BufRead::lines.

BufReader's docs say "It does not help when reading very large amounts at once", but File's docs say "It can be more efficient to read the contents of a file with a buffered Reader. This can be accomplished with BufReader<R>" with an example using read_to_string, so I'm not sure what to think, but maybe worth a try?


Obligatory question: you're running in --release, right?

(regex in particular slows down a ton in debug.)

Have you run a profiler to get a flamegraph of where the time is being spent more specifically? Reading files (once they're OS-cached, at least) should be so much faster than DB calls that 99% makes me suspicious that there's something weird going on.

The "at once" is critical there.

BufReader's advantage comes when you want to make lots of little reads in your Rust code, like when doing a "is it a newline yet?" loops. Then the BufReader will request one large chunk of the file from the OS, and give it out to you in the small pieces without needing to go back to the OS every time.

But if you're calling Read::read with a very large value already, then the buffer isn't helpful since it'll have to go ask the OS every time anyway. So there's no need for something like read in std::fs - Rust to use a buffered reader, since it's basically just reading into a buffer anyway.