Specifying multiple git mirrors for Cargo git dependencies

You don't have to agree on what counts as the "official" one. You just have to agree on the one you wanna use. (And be able to trust the hash function, but we digress.)

(We don't consider our own GAnarchy development repo as the "official" one, altho nobody else has contributed their own GAnarchy development repos to any of our instances. It's just a development repo in a sea of decentralized development repos.)

But, who knows when our repos are gonna go down? And who knows who else depends on stuff we worked on? Single points of failure are no good for anyone.

1 Like

Yes, I do. The "official one" is a consensus point for code reviewers and contributors to converge on using the exact same version, instead of having everyone using a slightly different fork of the code except for a few unlucky people who picked the version with malware in it.

If someone slips malware into serde, I want everyone to catch it, so that people will actually notice and fix it.

“Don’t put all your eggs in one basket” is all wrong. I tell you “put all your eggs in one basket, and then watch that basket.” Look round you and take notice; men who do that do not often fail. It is easy to watch and carry the one basket. It is trying to carry too many baskets that breaks most eggs in this country. He who carries three baskets must put one on his head, which is apt to tumble and trip him up.

— Andrew Carnegie, according to https://quoteinvestigator.com/2017/02/16/eggs/

Note that I am not trying to argue against:

  • Content mirroring. Once you have a SHA hash that's considered "official", you should use as many mirrors as you can. The reliability requirements of file servers vs the attention economics of open source code review are qualitatively different, so it shouldn't be a surprise that they have different needs.

  • Cryptographic hashing. Your Cargo.lock file already has a content hash. This makes using mirrors easier and more reliable, too.

1 Like

Note that the idea includes pinning to a particular git commit, so (setting aside preimage attacks, or assuming some mechanism like commit-signing is used to prevent them) everyone is still building the exact same code, even if they retrieve it from different locations.

We want ppl to be using a fork from someone they actually trust, so they're unlikely to get anywhere near the malware in the first place. That's the thing with chains of trust - it allows the whole chain to re-review things over and over. You're way more likely to catch malware that way, rather than having a handful of ppl review a thing and everyone else assume it's safe.

You can use something like IPFS. Several projects for using Git in IPFS exist, or the packaged crate can simply be published in IPFS. The address of something in IPFS is a hash, so it's safe as long as you have the right address.

However the addresses of the crates in Cargo.toml will look like http://127.0.0.1:8080/ipfs/... and you'll need an IPFS node installed, or to use a public IPFS gateway but it'll be centralized.

Sadly, IPFS does not seem ready for real life use yet... (slow and complicated to use/understand/interface)

Everyone should be able to build your crate, even if they don't have a DHT node installed. In this case they could use the centralized default repo. Cargo's mission is not to be a DHT client, so we need a standard for using DHTs at the OS level and to put (optionally) the crates' hashes in Cargo.toml. The only problem is that DHTs like IPFS claim to be universal but today we don't have the tools to use them universally.

I don't disagree with this, but I also don't think that it actually contradicts my claim that good open source software usually has an "official version", and the more clear it is which one is official, and the more collaboration there is between the official version and its downstreams, the better off everyone is.

Prove it.

Are people in the middle of the chain of custody allowed to modify the code? Because if so, none of the reviews upstream of them count.

Thinking of a specific example to back my point, the Debian Linux distribution famously* injected a security flaw into their version of openssl when they tried to fix a more minor fault. If more people were reviewing not just "a version of openssl" but the exact code that Debian was distributing, then that fault might've been caught earlier.

* at least, if you follow LWN

1 Like

Not if they want to keep commit-compatibility, at least. (They can merge and add their own changes on top but that still includes the original commit so you can generally assume it's been more-or-less reviewed.)

Rewriting Git history is not what I'm worried about. I'm worried about new commits that silently break everything being introduced by downstream without anybody noticing, because obscure downstreams don't get much public review.

1 Like

It really seems like this hole idea of decentralizing a git repository on a blockchain (which really seems like a waste of computational resources... Why not just decentralize GitHub?) hasn't been thought out very well. Git was never designed to be put on a blockchain. SHAtter doesn't affect git because git only uses SHA-1 for commits, not for anything else. For commit signing, public-key cryptography (GPG keys and such) are used. For file hashing -- if Git even supports that -- I'm pretty positive it uses SHA-2 or SHA-3. Thus far, no one has been successful to create a collision with SHA-2 or SHA-3 consistently. Furthermore, creating a collision with SHA-1 in terms of commit hashes has no benefit, and therefore no one has actually done it because they gain nothing out of it. There are no special superpowers you get if you collide with an existing commit hash in the repository, and in such an instance git is more likely to, I assume, regenerate the hash using some extra factor and create an entirely unique hash. Changing to a more secure hash algorithm for commit hashes is nonsensical and a waste of computational resources (not to mention it would break all git repositories in use today unless you had a compatibility mechanism -- and that's assuming that such a change was even accepted to begin with, which is unlikely). Git does not use SHA-1 in a security-sensitive context anywhere.
If you can prove to me how SHAtter does in fact affect actual git usage today and not this theoretical blockchain implementation, and it affects it in a security-sensitive context where git still (for whatever reason) uses SHA-1, then I will accept your premise that SHAtter affects git and the algorithm should be changed. However, I can't imagine such a scenario where the hash algorithm would ever become a problem, especially if you consider that its only used in commit hashes and nowhere else. But if you've got proof, please, show us all, because a change to the hash algorithm would affect all of us here.
Furthermore, I'm still confused on how your supposed "consensus" algorithm would work. There is no true "consensus" algorithm in git, because there hasn't ever been a need for one: git sends requests to a git server, and it sends files and information back, possibly signed and hashed with an actually secure hash function like SHA-2/3 or BLAKE2/3. Perhaps git verifies the signature, perhaps it doesn't (I'm not very clear on that). It undoubtedly verifies the hash of each file it gets, but other than that it doesn't need to "agree" with the server on anything, other than "Hey, does my copy of the repository match yours?"
Furthermore, I get the feeling that there's a critical misunderstanding of the blockchain in play here. Data on the blockchain is permanent: once stored it cannot be erased. But Git history is mutable: I can squash two commits, and both commits will be merged into one, and one of them is going to vanish. This is a good thing: you've merged both commits, so there's no point in keeping the extra one around. That applies to all commits in a squash. The blockchain, then, would make a squash impossible, or would lead to a bloated git history because "Oh, there's this squashed set of commits here, but then these are all the individual commits that make up the merged one too". The entire point of a squash is to eliminate unnecessary commits. No one wants to see that I updated the documentation of a project a hundred times in 24 hours to add a very large amount of information and then to go back and fix small typos and other grammatical errors and inconsistencies; they want to only see that I altered the documentation in some way. And though this example might be exaggerated, it applies to everything: if I add a driver to an OS and then forget to format the code, no one wants to see that I just committed to format the code; they just want to see that I added the driver. Perhaps Git will tell them "Hey, there was this squashed commit where the code was formatted", but people certainly don't want it appearing as an entirely separate commit.

I think you have misunderstood. Soni was only observing that Git already operates analogously to a blockchain in one specific way: Both git and blockchain protocols use Merkle trees to ensure that if we have the same hash value for our current state, then (in the absence of hash collisions – yes, this is a major caveat) we must agree on the entire history leading up to that state.

Cargo.lock already pins git dependencies to specific commit hashes, so rewriting the history of a repository (e.g. by squashing) already has the potential to break downstream Cargo projects using that repo as a git repository. People who publish crates as git dependencies are therefore already constrained in their ability to throw away old commits.

Again, all of this above is just describing how git and Cargo already work today. None of it would change by adding the ability for Cargo to fetch the same commit from multiple git mirrors.

This topic seems to be generating more heat than light. As a moderator, I'd like to ask everyone to read what's already been discussed carefully, and comment only if you are adding more information or asking specific questions. Or, just skip the thread if it doesn't address a problem you are interested in.

3 Likes

There's lots of debate to be had about the security of using git hashing as a security mechanism and also about webs of trust and distributed development.

But overall I think what Soni is asking for is not unreasonable at all. Apparently others have thought so too. I found this: https://github.com/rust-lang/cargo/issues/7497

1 Like

I apologize if my post added to the heat. I'm just failing to precisely understand what Soni is actually trying to get at. It seems like they go in one direction and then completely flip tracks and start going down another, and I'm not seeing any kind of correlation between any of what they're talking about.
As for the issue referenced above, couldn't that just be solved with a commit reference in the dependency table itself? I thought cargo already allowed you to pin git deps to commits. Maybe that was branches.
(Also, might I suggest renaming this topic? I find the title kinda clickbait-ish, and the title doesn't really adequately describe what the OP is trying to get at, and the original post is just really confusing for me, and further posts aren't really clarifying much unfortunately.)

Good idea. Done.

Pretty much, if you have two commits with the same hash, then (in the absence of hash collisions) they are the exact same, regardless of where you get them from.

And while you can edit a repo, you can't edit a commit. You can throw away a commit but then you break downstream users - unless those users have someplace else they can fetch the commit they're looking for from.

All we were asking for was a way to tell cargo about many of those someplace-elses.

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.