Hi there!
I'm currently writing an application where I have, at one point, to iterate recursively through a LOT of directories (let's say something around 100k).
Problem is, I also have to follow symbolic links, which some of them may point to directories already visited by the algorithm. But due to the treatment performed by the said algorithm, I have to make sure the treatment is not applied twice - and there is also the performance aspect, traversing directories multiple times makes the whole program go a lot slower.
Which means I have to check, for each directory, if it has already been visited by the program. For that, the dumb and naive method I think about would be to build a Vec<PathBuf>
and put each directory inside. The problem is that 1) it takes quite a bit of memory (although this not a major problem) but also that 2) it would be quite slow to compare each directory against the history.
Let's say we're at the middle of the analysis, we have an history of 50k item. The remaining 50k items will have to be matched against a 50k+ history, which means more than 2.5 billions comparison, each requiring to compare character by character. So that's clearly out of the question.
I've also though about hashing the path, but the default hashing method of Rust only works on 8 bytes, which gives a risk of collision that isn't neligeable (I currently have a base of 100k items but it may grow a lot to get in the millions in the near future).
Longest hash algorithms tend to be slower as well, and there would still be a lot of comparisons involved as these would give string results instead of simple numbers, so that's a no as well.
So, is there a method (even if not simple at all) to check if a directory has already been visited? I can't get to figure out a solution for now...
Thanks in advance for your help!