Hello, I would like to ask advice on architecture for reading millions of small files from a directory tree.
Millions, as in 10-80mio, small, as in a few kB, files as in json.
There is already a working software, the link is below. My question is about the possibilities for optimizing the part that walks the directory and reads the files. Both theoretical and practical advice are welcome.
The software takes geographical objects (features) that are stored individually in json files. Thus each feature in its own file. These files are stored in a directory tree of unknown structure.
Firstly, the total spatial extent of the input is computed as the union of the extent of each feature. In order to do this, each feature is visited, the json file is read and the feature extent is computed.
My first, implementation used walkdir to walk the directory tree recursively and visit each feature.
My second, and current implementation finds the first level subdirectories (if any) and does the directory walking for each in a rayon parallel iterator. Finally, process any loose features that might also be at the root dir.
The problem with both approaches is that they are very slow. To the point that it is not really feasible to run on a HDD. An SSD can manage, but I'm not very happy about the fact that the software requires an SSD to run with sizable input.
I've seen someone recommend to walk the directory and spin up a thread per file to process the file in this response. It is an already good advice. But maybe there are also other things that I could do?
You could try the parallel walker in the ignore crate. I don't have any guesses as to whether it will be better than your bespoke rayon variant though.
You're saying you have 10-80 million small files? If so, then I don't understand why needing an SSD concerns you. That is a very extreme use case. Your post here also doesn't make it clear whether you specifically chose to have 80 million small files as an architectural decision or whether it has been foisted upon you. If it's the former, then maybe revisit that choice.
Many small files will likely not be stored contiguously in your hard disk, so you're effectively doing millions of small random reads, which mechanical hard disks are very bad for. The only thing your software could do is to try reading them in the order they are stored on disk, but I don't know APIs for that. Another option would be changing how your data is stored to avoid this problem (e.g. bundling them in a database, or in some archive format such that they're more likely to be contiguous in your disk memory)
Let's say you have 10-80 million files of 5kB each. That's 50-400GB of data.
A decent data transfer speed for a mechanical HDD is around 100 MB/sec, so even if you have no data fragmentation on disk and you use the most efficient possible way to read the files, it will still take 500-4000 seconds (i.e. between 8 and 67 minutes) just to read the data.
Is this too slow for you? A SATA SSD could be around 5 times faster than this, and a very fast NVMe SSD maybe about 50 times faster. Obviously, these figures are all very speculative and could vary widely depending on your exact combination of hardware and software, but the point is to set realistic expectations. Large datasets typically take hours to process. Reducing this time often requires changing the input data format, as other posters have already suggested.
That's more an artifact of the combination of 1. the size of your problem and 2. the inherent slowness of random hard drive reads.
Assuming you don't want to throw away perfectly good data, there's no software on Earth that can change those fundamental characteristics, or the tension between them.
Also note that while walking the file system in parallel on a SSD should be faster than serial walks to some degree, on spinning rust (heh) hard drives it is not a guaranteed speed boost. It would be in your interest to see whether serial or parallel walks are faster, as the random layout of the data on the HDD could just yield suboptimal scanning for said data.
Hours are fine for that task. But with HDD you can only access around 100 files per second (rough back-of-the-envelope calculation: 7200 rpm, 1 second… 120 disk accesses). And 80 million files… now we are talking days, not hours.
If you're targeting HDDs you'll want to heavily optimize disk traversal order and readaheads to cooperate with the physical constraints of the drive. I have written the platter-walk and reapfrog crates for this purpose. You can use fastar which integrates both to get an idea how long it would take to just slurp in the data. Probably not production-grade, but they might be useful for experiments or as inspiration.
The idea is to use a filesystem that supports FIEMAP to get the physical disk offsets and then open multiple files in physical order, issuing readaheads across multiple files that can be coalesced into larger background reads with fewer disk head seeks by the IO subsystem.
HDDs generally don't like wildly parallel access because it means more seeks, not fewer. A little parallelism can help in cases where the IO queues would otherwise be mostly empty. But QD=0 is better solved by readaheads, not by parallelism. SSDs are a very different story because they have actual parallelism built into the hardware and don't have mechanical parts that need to move.
There are further possible optimizations which I haven't implemented yet:
using preadv2(..., RWF_NOWAIT) to probe the readiness of the prefetched data across multiple files to avoid some blocking
Thank you for all the responses so far, they are full of insight as always on this forum
The many small files are given, I cannot do much about it at this stage. I shied away from converting to another format, because of the data duplication and processing time it would require (I think). But this is all just me theorizing...
My reasoning was that if I want to convert to faster storage format first, I would need to walk the files anyway. So I thought it is better to skip the conversion and process them right away.
Regarding processing times, a couple of hours is perfectly acceptable. Day(s) are not really. But to be more specific, the current parallel implementation takes about 1h for 10mio files on an NVMe SSD. This is acceptable, I think. The same setup with the previous sequential walk took 3.5h.
I started a run on 80mio files, parallel walk on HDD yesterday afternoon. It is about half way through in 24h-ish hours. This is not acceptable I think.
So as I understand for your responses, there is not much that I can do in the software itself when it comes to HDD. The true solution would be to restructure the data.
If most of the contents of the HDD are these files, then it might be faster to copy a disk image of it (which is a linear rather than random access) to SSD (or possibly even RAM if the data set is small enough), then read the files from that copy of the file system.
Hm, I cannot tell to be honest.
I always assumed that it would be a convert one, use once situation, but this might not be true.
I would like to make the software useful for others too. It is not meant to be a single-project tool. Thus I cannot really foresee how others would process their data.
The goal of the software is to take these many small files and merge them into chunks (tiles). In this particular case, the target format is 3D Tiles, but other export formats will follow.
In my case, the small files are a result of another processing step which generates the data, the 3D models in the files.
Windows, for example, is just particularly bad at dealing with lots of little files. (See things like Set up a Dev Drive on Windows 11 | Microsoft Learn for how the default filesystem configuration in Windows is just bad for many-small-things scenarios.) And any OS doing live virus scanning on file open will penalize reading millions of things.
One thing you'll see in games, for example, is that many of them support separate files -- for convenience during development -- but always ship with a package format of some sort because it makes such a difference in runtime performance to read one physical file with lots of logical files inside.
If at all possible, stuffing them into even something simple like an uncompressed tar/zip/etc file as part of the processing step would probably make things much faster.
Linux, mostly. At least for the amount of input that I described.
A couple users ran it under Windows, but only for a much smaller set of input.
If it runs well on Linux for large input, but it can also run on Windows, macOS even if sub-optimally, that's fine with me.
Just to add a rather theoretical approach with a lot of ifs: if you have exclusive access to the medium and if it's using a file system with a supporting crate, you could open the drive as a block device and try to re-order file access to overcome the mechanical drawbacks of an HDD
Another approach requires your OS to re-order concurrent file accesses (see Elevator algorithm - Wikipedia). The idea is to just open thousands of files in parallel and hope for the OS to have a disk-scheduler to make the best out of it. Ideally it would recognize if the read-write head is near a file that has been requested to be read.
Third idea: use an SSD as transparent cache-layer to access the HDD. This will help to speed up accesses to nearby regions.
Otherwise (like already mentioned multiple times) I highly suggest to somehow move the data over to an SSD.