Hi,
I tried to build a script that loads a csv and identifies and maps connected components. So i.e.:
"id" | "phone number"
1 | 123
2 | 123
2 | 234
3 | 234
4 | 555
5 | 555
5 | 666
6 | 777
would return:
"id" | "shared_ids"
1 | [1, 2, 3]
2 | [1, 2, 3]
3 | [1, 2, 3]
4 | [4, 5]
5 | [4, 5]
So I want to show the full chain for everyone in the chain.
With quite some help of ChatGPT (in some moments I was pretty stuck) I created this script: GitHub
The idea here was that I load the CSV into vectors where the values are stored and the rest of the script I use HashMaps and HashSets for speed, but also try to use references as much as possible to keep memory consumption as low as possible while having good speed.
The speed looks good (ran faster than my Python equivalent (see on Medium, if interested). However the memory consumption looks wasteful. The source file is 120 MB big, but the memory consumption is at 3 GB (in Python I had 2.1 GB roughly).
I am wondering where I waste the memory so much. I think there is too much cloning, but I did not see how to prevent that.
I would be super happy to have some feedback on this