I am required to create an internal mirror from crates.io for a closed environment.
Effectively: I need to stand up an internal mirror that has no connection to the web that is purely standalone and self contained in what is called an "air-gapped-closed-environment" - I literally can download something, burn it to a CD (DVD) then I must "sneakernet" the CD/DVD into to that area and load it onto the machine.
I've got most of this figured out ..
The basic technique I am using is this:
a) step 1 - Clone git hub: crates.io-index
b) Loop through this to get all names, and all versions and status.
That gives:
188K crate names
Or 1.6 million unique NAMES + VERSIONS
Next - is I can apply this the "api/v1" interface to fetch files.
SO .. and I am about to download 188K crates
(if I expand this to all non-yanked versions the count is 1.6 million)
yea, I can write a python tool to scrape these...
the for() loop does not care it will just continue to pull things until the loop ends.
I can throttle (ie: there are 86400 seconds in a day, if I limit to 2 per second (172000 per day, its about 2 days to download the basic list but 10 days for all versions (1.6 million) - or I can not-throttle the requests and go for it.
I would rather be a "good citizen" and not create anything like a DOS attack.
So my ask is this:
Is there a better way (preferred way) for me to pull all of these?
or do I just throttle the request to just a few requests?
I am ready to "release the hounds" upon the beast.. and prefer to politely ask first.
totally agree and its a shame a tool to do this does not exist but my needs are:
1)first and for most a standalone air gapped solution.
the ability to white list/black list of crates i can populate / seed that list with a popularity…
but in the end i need to force items hence the blacklist/white list is important
and i do not care if i let it run over the weekend
yea i have seen a number of things that act as a caching server but nothing aimed at the closed/air gapped world or that are clearly supporting that use case
my ability to adapt an existing solution is not good right now .. most everything rust is written in rust and i am a rust noobie (super noobie) python is easier to get it going with.
example: i think this would be a great addition to cargo, ie
cargo —create-mirror DIRNAME
but i cant do that in the time allotted. so i do it with python
.