PSA: Please use absolute URLs in crate READMEs

Many READMEs of crates contain relative URLs such as [the LICENSE file](./LICENSE) or embedded images ![logo](../../assets/logo.png).

The problem is that these URLs are meant to be relative to the repository, but when they're published in a crate, they lose their original context. Crates are published as tar.gz files, and they don't have URLs for individual files. Some crates use paths to parent directories (../LICENSE) which don't even refer to files in the crate tarball. READMEs in published crates are technically full of broken links.

I assume most crate authors expect these URLs to be fixed by resolving them the same URLs as GitHub (and other hosts) do. This is surprisingly hard to do correctly:

  • Sites like GitHub and GitLab rewrite URLs in Markdown in non-trivial ways, because they have their own URL scheme for linked documents, and another scheme for a CDN that serves images. Resolving links for GitHub/GitLab-dependent READMEs requires reverse-engineering what these sites do.

  • URLs need to contain git commit hash (not all crates have this info, and it's not guaranteed to be correct) or name of the main branch. It's possible to rename the main branch, and finding the correct name requires API calls querying the repo.

  • Cargo.toml only contains a URL to a repository, but not the path where the crate is in the repository (if it's a monorepo). Correct support for repo-relative URLs requires cloning the repository, scanning its directory structure, parsing all Cargo.toml files in it to find where the crate was in the repo. Mapping of the path in the repo to a URL is again not a simple repo-relative URL, but a custom directory scheme that varies between code hosts.

  • There are plenty of edge cases with Cargo.toml containing readme = "../README", symlinks, relative paths in the README (../../assets/logo.png), and absolute paths /logo.png that can't be simply interpreted as abs path per URL spec.

crates-io has some GitHub-specific rewriting code that works in most common cases, but everyone with a non-default configuration, or a different code host, is out of luck. Sadly, there's no standard for mapping of in-repository paths to their public web URLs, so crate READMEs require all these fiddly fixups, and every code host does it slightly differently.

5 Likes

Would a solution in Cargo package & publish to handle or warn on such links outside of the crate folder work?

Either warn or fail on it, effectively do not allow it, and have the user make sure they are only referencing files inside the crate folder.

Or have the commands copy over the file from the folder and update the link to not be outside of the crate folder.

Of course this would only solve the problem for new published crates and crate versions, not for existing ones. But making sure at least new crates work properly here would be good

3 Likes

I don't know about GitLab's support, but at least on GitHub, they set HEAD correctly, so you can use git ls-remote to get the commit SHA for the HEAD at time of resolution (and probably git symbolic-ref to determine if HEAD is just pointing at a branch).

1 Like

It would be nice if URLs were resolved before publishing. That's probably the best moment to do this, since Cargo has access to the local repository, commit hash, all the paths, etc.

1 Like

I'd emit it as a warning because relative URLs are so common that you'll break the release process for a large chunk of the ecosystem.

As part of mdbook-linkcheck I wrote a general linkchecking crate that cargo could use if it needs more advanced logic than "look for all the links starting with ./".

I've opened Cargo issue for this:

1 Like