Suggestions on approach for a preprocessor

I’m working on a challenge project trying to figure out the fastest way to search through 10,000,000,000 u64 and return a specific 1,000,000 of them. This amounts to around 74GB, which is too big to fit in ram. I don’t know if this is relevant but the u64 are a stand in for flags that would be used to filter into separate lists for other processors later down the line, but that’s outside the scope of the challenge

The most obvious approach to me is to build a data structure inside a file, stacking all the u64 next to each other and then using “The Maths™” to figure out where in the file to jump to and std::io::seek to get there.

Is this a reasonable approach or am I missing something simpler/better? Are there any gotcha’s I’m missing? I initially thought file fragmentation might be an issue but someone pointed out that fragmentation is irrelevant on an SSD so as long as I specify that as a constraint it shouldn’t be an issue.

[edit] I messed up on my math on the size, I forgot to translate bits to bytes :roll_eyes:

You may want to look into mmap (or CreateFileMapping on Windows) for random access (probably unsafe). But seeking a file stream might be good as well (it’s safe)

serde + bincode will help you with parsing the file.

I don’t think that fragmentation should be of concern to you if you’re writing a program (SSD or not).

That may be entirely I/O bound, so think about efficient disk usage.

Results can fit in ram.

For the data it could be useful to have an index. If you just bisect naively, you’ll do lots of seeks. With an index you could just grab the right “page” from the file.

Thanks for the replies

I was thinking about having them be referenced by a 10 digit index and make the data structures a fixed size, so I could just stack them together and multiply the length of the data structure by the index (with any necessary adjustments) to jump right to that location. Maybe also add a start and end code bounding the data so I can verify I didn’t land somewhere unexpected, and maybe some sort of hash to check against corruption.

That way I don’t have to search the data I can just jump to the right place. It sounds like std::io::seek is the right thing for jumping to a known location in a file, now I just have to decide how I’m going to build the data structure and what (if any) extra checks I want to do in case of a bad jump or file corruption

Also this doesn’t expect to be the native home for the data, so in the case of corruption it can just kick back an error and ask for the data again so it can rebuild the table. It’s just a preprocessor for sifting the data so the real processors don’t have to each swallow 1,000,000 or so items every time a request comes in which is several times a day (If I remember correctly in the original situation they had set up a controller that worked with the preprocessor and then doled out the different lists to instanced cloud processors for scalability, but I don’t have to deal with part that [yet])

Oh, are you thinking the throughput on the bus/drive controller is going to be the bottleneck here? I hadn’t considered that. Maybe Raid would help give more bandwidth even though it’s solid state?

Mmmm… after looking into it, it seems solid states should be able to handle upwards of 100,000 IOPS, and the resulting data is likely to be (much) less than 100mb so I’m hoping bandwidth won’t be a problem so long as I thread the requests

I’m thinking that if it doesn’t fit in RAM, and isn’t computationally expensive (search of uncompressed data generally isn’t), then it’s likely to be I/O bound merely because even fastest SSD is orders of magnitude slower than RAM.

Things you can do when working with large files with random access:

  • Is certain subset of the data “hot”? e.g. some of your numbers are searched more frequently than others? Put them together in one section of a file (or a separate file). This way filesystem cache will be more efficient.

  • If the data compresses well, and you use compression scheme that is faster to decompress than disk read speed, you may get higher throughput (there are even special compression algorithms just for integers). For compression you may need to split the data into pages/blocks (e.g. 4KB at a time) and decompress whole block even if you read just 8 bytes out of it, but it may still be faster, since you save lots of time on I/O, and compressed data makes better use of disk cache.

  • You can get that much RAM for ~$800.

These are good suggestions, I’ll look more into them.

A large amount of the data (about 70%) belong to one list and only one list so I may start with something basic like caching a list of the references that don’t belong to that set in ram and testing that first.

In the end its probably going to boil down to whether it’s faster to search through a 25GB list of 3 billion items 1 million times in ram, vs reading 4 bytes from predictable file locations 1 million times

Also it’s supposed to be geared towards a cloud environment so lots of ram may end up being dramatically more expensive (for now). Maybe I’ll end up designing it to support all 3 options (ram only, disk only, disk with ram cache) to allow for adaptability.