Pre-caching cargo index and crates


#1

In CI builds, especially builds based on Docker, Cargo often ends up re-downloading the whole index, and all crates, on every build (because ~/.cargo is thrown away). This is problematic, because clones from GitHub are slow, and the index is quite large. Re-downloading of crate files isn’t fast either.

The obvious solution seems to pre-load and cache the index. And maybe also preload crates needed by the build, or at least some most popular ones.

What would be the best way to do it?

I’ve found a few ways, but the best ones are a bit too fiddly to nicely fit in one-liners of dockerfiles or yaml definitions that CI platforms use.

  • cargo search whatever populates the index as a side effect. It feels a bit hacky, and could stop working if Cargo changed implementation.

  • Getting Cargo.toml + Cargo.lock from a project, but replacing lib.rs with an empty file & running cargo build has a pretty nice effect of getting all required dependencies pre built. The downside is that it is a bit fiddly to do with a bash 1-liner. It makes the container itself depend on a particular project built in it, which seems like a layering violation.

  • Putting ~/.cargo in a custom distro package, e.g. apt-get install my-cargo-index. That neatly integrates with existing build systems that install and cache system dependencies anyway. It is an extra hassle to maintain a package for the index.

Are there other approaches? How do you solve this problem?


#2

What CI system are you using? It should have a built in ability to cache directories across builds.


#3

I have run into the same problem in couple of different projects. One using Travis with docker (and Travis appears to ignore its standard cache directories when Docker is used). Another is home-grown and mounts persistent directories using a network filesystem that isn’t supported by libgit2, so Cargo can’t clone there.

So in both cases obvious solutions to preserve CARGO_HOME aren’t obvious :slight_smile:


#4

For travis, you should just need to volume mount the cache directories into the container.


#5

I’ve tried that, and I was unable to make it work. Regardless of that Travis can do, the question still remains for CI solutions that rely on Docker layers for caching.


#6

Travis supports caching build artifacts for this purpose. You can combine it with the docker volume mount (mentioned above) to actually cache stuff for reals.

The Travis cache works by creating a tarball of the cache directories after a build completes, and sends it to permanent storage (S3, etc). When a build starts, the latest tarball is pulled from permanent storage and extracted to your build directory. What’s really nice is that this complexity is all handled by Travis itself, all you have to do is enable the feature and give it some directory paths to cache.

There is one little problem you’ll run into if the build system relies on timestamps for cache invalidation, like gnu-make and similar. Because the artifact is extracted from a tarball before every build, all of the cache files will always have an older timestamp than the source code from git. Meaning of course that the cache is never used by systems like gnu-make. (You have to do some silly hacks to get make to use file content hashes instead of timestamps to get it working.)

annnnyway, I don’t think I’ve ever touched any file in the cargo cache. And I don’t actually know whether it cares about timestamps or not.


#7
  1. It works exactly as you describe for Travis targets that don’t have services: docker. For targets with services: docker the cached directories just aren’t there on the next build (whether I mount them or not). Maybe it’s just due to a mistake in my setup, but I’ve wasted days poking Travis and trying different paths and hacks. In the end it was easier to give up on Travis’ cache that-is-supposed-to-work-but-doesn’t and rely on baking cache into the Docker image.

  2. My biggest problem is with company’s in-house CI which has only two options: a) Docker cache, b) network mount unsupported by libgit2/cargo. So I’m very interested in non-Travis solutions.


#8

That’s really strange. We use it with services: docker and it’s been great. Here’s a snippet of our .travis.yml if it helps?

sudo: required
dist: trusty

language: go

go:
- "1.9.1"

services:
- docker

# A place to store cached images
cache:
  directories:
  - $TRAVIS_BUILD_DIR/cache/
  - $TRAVIS_BUILD_DIR/vendor/

before_install:
- docker login -u "$DOCKER_USERNAME" -p "$DOCKER_PASSWORD" ...

script:
- make && make test && ./dockerize

Yep, there’s that pesky gnu-make, about to implode the cache if not for stupid stuff hidden in Makefile. Anyway, the dockerize script is basically just a few niceties wrapped around docker build and docker push.

Perhaps what you’re missing is that you aren’t able to mount a volume as part of the docker build step, which means that you can’t pull cached resources into the image (aside from the obvious base image). We do our compilation outside of docker, and simply copy the binary to the image as one of the [very few] build steps in Dockerfile.

Speaking of caching base images, you can also do some hacks with multi-stage builds if you structure your Dockerfile appropriately. We have also used this strategy in some other projects, especially with nodeJS and Python since dependency management is a nightmare and we don’t want to end up with compilers and build dependencies in our production containers. This article describes multi-stage builds in more detail.


#9

Random idea: I don’t know if it runs early enough, or can update the cache in time for cargo to use the result, but maybe a build.rs that pulls down a cache tarball (or similar) if the cargo directory is (more or less) empty?

Otherwise it might be a cargo addition/plugin to do the same, earlier.