Why is splitting slice by a comma so slow?

I am developing a program for reading CSV like files, it is not CSV but we can assume that is some kind of delimiter separated values files.

The workflow will be as follows:

  • Get one block of data.
  • Get a row from the block. The row is given as raw bytes where a comma is used to separate the different field values.
  • Split the row by comma to get an iterator over the different fields.
  • Get only the desired fields.
  • Write the fields to a CSV file.

The thing is that surprisingly the slowest part of the program is the split part, it is even slower than to write the contents to a file.

After doing a benchmark test using flamegraph I can see that most of the time is spent in Split Iterator next method (45% of execution time) and inside that method most part is gone in TwoWaySearcher (35% of execution time).

Using the crate bstr I can get better results, as I can work with bytes directly without casting to str, but still around 45% of execution time is gone in the split_str function.

Also I can appreciate that a significant portion of time of both split functions is gone in swapgs_restore_regs_and_return_to_usermode function.

Do you know what can make string splitting so slow? Should I check a different crate for a faster string splitting?

Thank you! :smiley:

Besides something silly like accidentally making your algorithm O(n2) and needing to split input more than once, could it just be that splitting takes the most time because your program is quite simple and that's where all the processing is spent?

You might also find that writing your own splitting routine with something like the memchr crate will help because you can make assumptions about your input that a general-purpose crate like the standard library can't.

Are you using mmap to read the CSV file?

Maybe it takes so much time because it is reading the file during the
execution of memchr.

TwoWaySearcher should only be needed for patterns longer than one byte, not for finding a comma. If you're searching for a one byte long pattern, and it's a release build, I would expect the iterator to be inlined until it won't even show up in the flamegraph.

It can be that my program is quite simple, but I think that I am doing other most costly operations than splitting which take less time than this part.

A more detailed workflow is:

  • The file is composed by N blocks of compressed data (in LZ4 format).
  • I use a BufReader to read the file in chunks of the block size.
  • I decompress the data into a 1MB array located in the stack.
  • I read every row from the decompressed data buffer.
  • I split every row as if it were a CSV.
  • I write into a CSV file the fields I interested in.

I would guess the decompression and writing part to be slower than the splitting one, but I found that is far more requiring the splitting than those two.

Splitting takes 30% with memmem::find_iter, writing takes 10% and decompression about 10% too.

No, I am using BufReader to read the file, reading chunks of the block size. Mmap caused too many page faults, so I replaced it by BufReader and I got better results.

Yes, it is my last option if I cannot find anything with that already implemented in an efficient way for my needs.

I have implemented my own iterator using memchr as @Michael-F-Bryan said and I can see a performance gain.

Changing the iterator I can see that SmallVec::extend is the part that takes most of the time, and it is reduced to 26.5% of execution time, which is a good improvement.

I am not sure if this can be improved even more.

PS: The split iterator is passed directly to SmallVec::extend so there are high chances that the long time is due to split iterator which is now inlined.

1 Like

Just to verify do you use release mode? (--release)

Yes, I do. Otherwise the time spent for processing a file goes from 10 seconds to several minutes.

NOTE: Just to add information, my input file is around 2GB, with 171 fields per row and around 2 million records.

I add here a snippet to the ByteSplit iterator I created using memchr, just in case it can be useful for someone else.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.