After doing a benchmark test using flamegraph I can see that most of the time is spent in Split Iterator next method (45% of execution time) and inside that method most part is gone in TwoWaySearcher (35% of execution time).
Using the crate bstr I can get better results, as I can work with bytes directly without casting to str, but still around 45% of execution time is gone in the split_str function.
Also I can appreciate that a significant portion of time of both split functions is gone in swapgs_restore_regs_and_return_to_usermode function.
Do you know what can make string splitting so slow? Should I check a different crate for a faster string splitting?
Besides something silly like accidentally making your algorithm O(n2) and needing to split input more than once, could it just be that splitting takes the most time because your program is quite simple and that's where all the processing is spent?
You might also find that writing your own splitting routine with something like the memchr crate will help because you can make assumptions about your input that a general-purpose crate like the standard library can't.
TwoWaySearcher should only be needed for patterns longer than one byte, not for finding a comma. If you're searching for a one byte long pattern, and it's a release build, I would expect the iterator to be inlined until it won't even show up in the flamegraph.
It can be that my program is quite simple, but I think that I am doing other most costly operations than splitting which take less time than this part.
A more detailed workflow is:
The file is composed by N blocks of compressed data (in LZ4 format).
I use a BufReader to read the file in chunks of the block size.
I decompress the data into a 1MB array located in the stack.
I read every row from the decompressed data buffer.
I split every row as if it were a CSV.
I write into a CSV file the fields I interested in.
I would guess the decompression and writing part to be slower than the splitting one, but I found that is far more requiring the splitting than those two.
Splitting takes 30% with memmem::find_iter, writing takes 10% and decompression about 10% too.
No, I am using BufReader to read the file, reading chunks of the block size. Mmap caused too many page faults, so I replaced it by BufReader and I got better results.
Yes, it is my last option if I cannot find anything with that already implemented in an efficient way for my needs.
Changing the iterator I can see that SmallVec::extend is the part that takes most of the time, and it is reduced to 26.5% of execution time, which is a good improvement.
I am not sure if this can be improved even more.
PS: The split iterator is passed directly to SmallVec::extend so there are high chances that the long time is due to split iterator which is now inlined.