A better way to scrape crates.io

I am required to create an internal mirror from crates.io for a closed environment.

Effectively: I need to stand up an internal mirror that has no connection to the web that is purely standalone and self contained in what is called an "air-gapped-closed-environment" - I literally can download something, burn it to a CD (DVD) then I must "sneakernet" the CD/DVD into to that area and load it onto the machine.

I've got most of this figured out ..

The basic technique I am using is this:
a) step 1 - Clone git hub: crates.io-index
b) Loop through this to get all names, and all versions and status.

That gives:
188K crate names
Or 1.6 million unique NAMES + VERSIONS

Next - is I can apply this the "api/v1" interface to fetch files.

SO .. and I am about to download 188K crates
(if I expand this to all non-yanked versions the count is 1.6 million)

yea, I can write a python tool to scrape these...
the for() loop does not care it will just continue to pull things until the loop ends.

I can throttle (ie: there are 86400 seconds in a day, if I limit to 2 per second (172000 per day, its about 2 days to download the basic list but 10 days for all versions (1.6 million) - or I can not-throttle the requests and go for it.

I would rather be a "good citizen" and not create anything like a DOS attack.

So my ask is this:
Is there a better way (preferred way) for me to pull all of these?
or do I just throttle the request to just a few requests?

I am ready to "release the hounds" upon the beast..  and prefer to politely ask first.

Thanks.

1 Like

It's linked from the crates.io home page.

https://crates.io/data-access

That gets you access to any public information that's in crates.io's database. You'll still need to download the individual .crate files.

yea so there are 1.6 million individual crate files

my question is there a better or more preferred way to mirror the 1.6 million data files?
i can do that… but i want to ask first

Some third-party registries support mirroring crates.io. (I have no opinion or recommendation between them.)

It would be a waste to download all versions of all crates. There's a lot of dead and squatted crates there.

The db dump has download numbers. Download only the couple of most popular versions of the top crates.

1 Like

totally agree and its a shame a tool to do this does not exist but my needs are:

1)first and for most a standalone air gapped solution.

  1. the ability to white list/black list of crates i can populate / seed that list with a popularity…

  2. but in the end i need to force items hence the blacklist/white list is important

  3. and i do not care if i let it run over the weekend

  4. yea i have seen a number of things that act as a caching server but nothing aimed at the closed/air gapped world or that are clearly supporting that use case

  5. my ability to adapt an existing solution is not good right now .. most everything rust is written in rust and i am a rust noobie (super noobie) python is easier to get it going with.

example: i think this would be a great addition to cargo, ie

     cargo —create-mirror DIRNAME 

but i cant do that in the time allotted. so i do it with python
.

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.