Etiquette of flagging crates as typos/confusables

I was thinking about adding some protection against typosquatting on lib.rs — detect when crates have confusingly similar names, which could mislead users into installing a wrong crate, and suggest using another crate instead. This is most important from security perspective, but can also be helpful when crates have unusual spellings, or when crate names match common search keywords but aren't the best choice for that keyword (e.g. an abandoned request crate vs creatively spelled reqwest).

However, this problem overall turns out to be quite icky.

First, it's really hard to define what is "confusingly similar":

  • By an edit distance function ttf and fft are typos, but one could argue that to users interested in either of these things the difference is clear. There's also a bunch of embedded hal crates for chipsets with gibberish model names that differ only by a digit.
  • Is -rs/-rust/cargo- prefix/suffix too similar? Singular vs plural forms? Words swapped?
  • There are also crates that are intentionally named similarly, because they have a similar purpose (e.g. fasta/fastq/fastx, or variations of sdl2).

There's a trade-off between the rate of false positives, and ability to effectively detect typosquatting. With 150K+ crates it's a daunting task. The line is blurry and subjective, so even manual moderation can't guarantee that everyone will be happy.

Secondly, any action based on this has very unpleasant implications.

There has been a real typosquatting attack on crates.io by rustdecimal crate squatting rust_decimal, so at least a difference of _ in between words could be considered as a problem.

But there's iter_tools crate. It's adds a small tweak over itertools. It has basically just one user, but it seems to be maintained, and doesn't show any malicious intent. Would it be appropriate to mark this crate as having a too-similar name?

In the abstract, people quote Sturgeon's law. Many Rust users, and probably even more non-users, express a sentiment that from supply-chain-security they don't want to rely on crates from "some random person". However, turning that into any policy is easily going to insult 90% of crate authors.

An actual typosquatting attack would likely first publish a legitimately-looking crate and wait until there are enough users or a particular target tricked into using it, before turning the crate maliciuos. But this means that even legitimately-looking crates need to be flagged as a potential problem. And that can easily be interpeted as a serious accusation over a subjective rule.

So I don't see how can I do anything about typosquatting without causing a shitstorm.

9 Likes

Yes. And "has a potentially confusing name similar to {whatever}" seems like a great addition to libs.rs. I would appreciate that so I can choose names that don't confuse my audience. In other words, that seems useful for non-malicious crate developers.

Would something Bayesian with a flagging tool work for identifying "similar confusing name"? Sort of like the way the original spam filters worked?

Just a warning about similar names would be good as a first iteration. Within certain edit distance threshold "{x} crates have similar names. {popular_crate} has the most downloads among those." All crates would have such a warning - assuming x is non-zero - irrespective of the popularity of the crate. As an example, both tokio and tokyo would show the warning, not only tokyo. That should at least step around the insulting people problem, but might generate more warnings than you would like to.

11 Likes

One signal that could be used is that if they're distinct things, the public API of each will be almost completely different. That criterion won't do by itself for flagging typosquatting, though, because it will be a false positive on "fork and make an improvement", or "that's C bindings, this is a reimplementation in Rust”; and a false negative on malware that is attempting to catch typos at the cargo add point only and doesn't bother to resemble the real thing.

I wonder if you could run some sort of machine learning (by which I do not mean just “ask an LLM”, but perhaps some form of clustering) on the public API/docs and get interesting results that aren’t solely malware detection. For example, someone might be actively interested in discovering forks or reimplementations if they’re unsatisfied with the library they're currently using.

This is probably an entire research project, though, so more practically, I like the idea of reporting on similar crates without necessarily having a “warning” presentation, just “FYI you might be looking for this”.

2 Likes

It's certainly a difficult problem. I've mistakenly added an incorrect package doing cargo install nextest instead of cargo install cargo-nextest. This is a case of self typo squating prevention though.

It would be interesting if there was a variant of cargo, crates.io, etc. which would only allow adding packages that are "trusted". I could log into crates.io, whitelist a small set of authors and crates for direct or indirect dependencies. You could achieve this by self hosting your registry, but that seems like it's own rabbit hole.

2 Likes

I'm afraid that it would be actively dangerous, because potential typosquatters and spammers could get free advertising on the pages of popular crates.

When there is a significant difference in factors like downloads, rev deps, age, owners' popularity, etc., it's relatively easy to decide which crate is the expected one. When both are similarly unpopular then maybe both could show "did you mean 'the other one'?".

2 Likes

Without having checked I assume the 1-3 letter space to be somewhat densely packed. So there'll be low edit distances between a lot of things in general. Edit distance should become more useful in a more sparse space.

Maybe a more neutral "in case you're here due to a typo: top N similarly-named crates" list + whatever other reputation indicators. Without explicitly saying "this isn't the crate you're looking for".

3 Likes

maybe take inspiration from Wikipedia and use "Not to be confused with":

Article on Xeon:

Not to be confused with Xenon or Intel Xe.

9 Likes

Not true for those hal crates either. They likely have very similar APIs.

For the hal crates are most similar crates maintained by the same people? If so you could ignore similar crates with the same set of owners.

1 Like

Sorry, I spoke a little imprecisely; I meant that crates for distinct fields, like ttf vs fft, would likely have different APIs. Many crates will of course be very similar while not being substitutable for each other.

But please also note I said this is just one possible signal. It is certainly not a sufficient condition for "not a typosquat".

They could, but the question is would it be actually dangerous or helpful. For an average user, what is going to be the reaction to that warning? Either

I should verify that I really meant to look for this crate, perhaps take the downloads, authors, owners etc into consideration. (1)

OR

Ah, a different crate name in the warning, I should definitely use the crate indicated in the warning. (2)

If (2) is the most likely reaction, then I agree it is dangerous. But if (1) is more likely, I think it might bring malicious crates to the communities attention more often.

2 Likes

Npm.js worked around the squatting problem by adding namespaces. @john/charts is distinct from @harry/charts. I think it's a good mitigation. Personally I would migrate all crate names over and add a redirect mechanism from the old global crate name, but I get that that might be tricky to pull off.

3 Likes

Adding namespaces allows squatting namespaces instead as well as making it less clear which namespace is the official one for say serde.

3 Likes

isn't this what cargo vet does?

Without namespaces crates that do similar things are likely to have similar names. Developers have two choices:

  1. use minor variations of names to provide information about the topic that the crate deals with, while not colliding with existing crates e.g. tuple, tupl, tuples, typle.
  2. use a more creative name that doesn't directly indicate the crate's purpose, and lose the searchability by name. e.g. nom, winnow, chumsky.

Using namespaces would allow both: a "brand" and a clear "purpose". nom::parser, winnow::parser, chumsky::parser.

1 Like

You could just name your crate <brand>_<purpose> if you were worried about your crate name being too similar to existing crates.

It sucks, but

Namespaces have been discussed to death

and crates.io still hasn't done it. I don't expect they ever will.

1 Like

Namespaces only and no namespaces both have good points, but allowing both like npm is kind of the worst of both worlds.

I think cargo could maybe introduce some sort of grouping mechanism for discovery and attribution purposes (eg, these are what serde.rs thinks the "cool" serde formats are, vs these are the RustCrypto officially supported crates), but not as part of the canonical package name.

1 Like

Please don't discuss namespaces here — this is off-topic. I'm asking here about the registry as it exists today, not what it should have been instead.

There's already an RFC for namespaces, where all of these arguments have already been made many many times.

Crates.io will have per-project namespaces, but the top-level shared namespace will remain open, and names of the namespaces can be squatted or misleading as well.

7 Likes

Dunno if it helps, but PyPI, without having namespaces, rejects uploads of new Python packages with names too similar to existing names. It certainly gets some people angry from time to time ("but why was my package rejected, I checked the name was available when I started it months ago and now it's all over the code!!!"), however it is clearly safer than any measures you can take on the side of the package index browsing site.

The precise rules they use to compute similarity are probably buried somewhere in the code at GitHub - pypi/warehouse: The Python Package Index (sorry, I don't have the time to search where exactly right now).