Search optimization in a very (100k+) large paths history

Hi there!

I'm currently writing an application where I have, at one point, to iterate recursively through a LOT of directories (let's say something around 100k).

Problem is, I also have to follow symbolic links, which some of them may point to directories already visited by the algorithm. But due to the treatment performed by the said algorithm, I have to make sure the treatment is not applied twice - and there is also the performance aspect, traversing directories multiple times makes the whole program go a lot slower.

Which means I have to check, for each directory, if it has already been visited by the program. For that, the dumb and naive method I think about would be to build a Vec<PathBuf> and put each directory inside. The problem is that 1) it takes quite a bit of memory (although this not a major problem) but also that 2) it would be quite slow to compare each directory against the history.

Let's say we're at the middle of the analysis, we have an history of 50k item. The remaining 50k items will have to be matched against a 50k+ history, which means more than 2.5 billions comparison, each requiring to compare character by character. So that's clearly out of the question.

I've also though about hashing the path, but the default hashing method of Rust only works on 8 bytes, which gives a risk of collision that isn't neligeable (I currently have a base of 100k items but it may grow a lot to get in the millions in the near future).

Longest hash algorithms tend to be slower as well, and there would still be a lot of comparisons involved as these would give string results instead of simple numbers, so that's a no as well.

So, is there a method (even if not simple at all) to check if a directory has already been visited? I can't get to figure out a solution for now... :confused:

Thanks in advance for your help! :slight_smile:

A BTreeSet can do it in a logarithmic number of operations.

2 Likes

Another option that is specific to strings such as paths would be a trie, but I'm not familiar with the trie implementations available in Rust, so you would have to research that on your own.

1 Like

I didn't see that, thought that it wasn't possible with the way keys are hashed.
I'll try that out, thanks :slight_smile:

A BTreeSet does not hash the keys. It works by storing them in alphabetical order.

2 Likes

Ok that seems more logical, I've just tested in my program and it's only a 1.5x~2x performance penalty, which is a LOT less than I expected.

Thanks a lot for your help :smiley:

You're welcome!

1.5x~2.x compared to what?

Compared to my base algorithm, which simply walks through directories and performs a few very light checkings (edit in release mode of course, debug build has an higher performance penalty due to low code optimization).

One recommendation I'd have is to construct your set with a certain capacity, as it sounds like you already know roughly how many items will be input. With 100k+ items, this will significantly reduce allocations.

A BTreeSet cannot be initialized to a certain capacity because it is a tree-based structure that stores things across several allocations.

1 Like

The problem is also that the program will sometimes be run on very small filesystem structures, so allocating 100k+ items in such case would be a waste.

What would be really useful though is to be able to bump the allocation by a thousand items each time for instance, but as @alice indicated this is not possible for BTreeSet.

If you use a HashSet with the resolved path as the key, you don't need to worry about collisions from a safety perspective since if there is a collision, it will just use the next slot for the duplicate hash and check that the contents match when it a hash already exists (or at least that's often how that works). Whether that's more performant would depend on multiple factors, but its worth a benchmark if performance is important.

2 Likes

I just tested it and it's actually a lot faster xD

Now there's only a 20% performance panelty (but I do a bit more work in my algorithm now), compared to almost 100% with a BTreeSet. The results seem to be the same.

Maybe because it's not sorted?

The HashSet has O(1) cost, so is in general preferred for speed.

1 Like

Try creating the HashSet with a high capacity so it doesn’t need to reallocate as suggested above. Also that will reduce collisions which will improve performance generally.

1 Like

I just tried that, even by pre-allocating just the right number of slots (dozens of thousands) this doesn't make much of a difference (less than 5% in total).

I guess this is because it holds PathBuf values which, just like String, need to allocate, and I bet that most of the allocations actually happen in there.

But thanks for the suggestion :slight_smile:

On linux at least, you could use the inode and device id to identify unique directories (though you would lose the path they were accessed via).

4 Likes

Oh that's a really good idea, this program is mostly meant to be used on Linux so this could be nice. Would the std::os::linux::fs::MetadataExt::st_ino method do the trick?

If your symlinks can traverse multiple devices, you'll need both st_ino and st_dev, and is the safer approach. If you're completely certain it's all on one device, just st_ino would work.

1 Like

Right. That's what I get for not bothering to check docs before commenting.