TL;DR: Crate listings and search could be better, but I need you to help create test/evaluation data!
Ranking by downloads is meh
In various listing pages (apart from searches) crates.io and crates.rs sorted crates mostly by a raw number of recent downloads. While this is better than nothing, it has some problems:
- The number is noisy, as it includes downloads by bots and duplicate hits from continious integration, rebuilding of docker images, etc.
- It's heavily biased against applications. Installs outside
cargo install
obviously aren't counted. Dependencies tend to be redownloaded and updated much more frequently. Even app installs themselves create download hits for their own dependencies, so apps are doomed to lag behind libraries in the rankings. - It makes deprecated and abandoned crates look too good, since they keep getting hits from their old users (it takes a while to migrate away from a deprecated crate), even if they're getting zero new users.
- Some crates get a free ride if they're a dependency (…of a dependency of a dependency…) of some popular crate. There are some single-purpose helpers, at v0.1.0, last updated in 2015, which aren't generally useful outside of their parent crate, which are among the top crates by download numbers.
- It's tough for new crates that start from zero.
So I'm developing a better ranking algorithm.
Gaming of the algorithm
Of course, it will be possible to game the algorithm, but I'm not worried about it. It can't be worse than the current one, which basically has an API for gaming it: while true; do curl <your crate's tarball URL>; done
(but please don't do that).
More seriously, I think the "gaming" problem can be divided into two categories:
- People who have a real crate, and want it used, and want a low-effort shortcut to popularity.
- People who don't have a real crate, but want a free advertising space, SEO link juice, or to troll and deface the website.
In the first category, I expect there will be natural limits to how far authors will game the system. For example, someone could game rankings by filling all crate's metadata with dummy information and generate fake documentation. Even if that increased rankings, it would confuse and scare away potential users, so there's no point doing it if the author wants the crate to be used. And if the author "games" the system by writing actual complete metadata, and real documentation, that's not too bad!
I think the second category is solvable with things like filtering and a ban-hammer. Crates that are straight spam will be identifiable by the nature of promoted keywords, links and persistence of the spammers pushing them.
Fortuantely, so far, I haven't seen examples of either kind of spamming. Rust community is nice
Alternative rankins
I've implemented an alternative ranking algorithm that is a combination of many factors:
- How big and complete is the README, does it have code examples? (according to user survey done for crates-io ranking, this was very important)
- Does the crate have links to repo, docs and homepage? (users want to see the code, issue tracker, documentation, etc.)
- Does it have tests and CI?
- Does it have a 1.0 release or long release history? (crates v0.1.0 look dodgy and most of them are abandoned. If it's actually stable, the crate author should say so by releasing 1.0.0).
- Are there co-owners, co-authors/contributors? (crates with bus factor == 1 are a liability. Having contributors proves the crate has been useful to more than one person).
- How long it's been since the last release? (if a crate looks stable I'm allowing it to go 2 years without any release, but even stable crates need to update their dependencies from time to time. Expected update frequency is shorter for nightly and young 0.x crates).
- Whether dependencies are outdated or deprecated (sign of a long-abandoned crate)
- How many crates have used it in the past and stopped using it? (e.g. users migrate from the
gcc
to thecc
crate). - Number of reverse dependecies
- Number of downloads, but filtered to remove some noise, and downloads by closely related crates, such as "foo" inflating downloads for its "foo_derive" dependency (this prevents internal crates from showing before their "parent" crates).
All these factors are flawed in some ways, and have exceptions. However, I hope that on average they're correlated with usefulness of crates, and there are very few crates unlucky enough to be an exception in more than a few of these factors.
I plan to add more factors, such as:
- cargo-crev reviews,
- use of deprecated methods, clippy issues (but only serious ones),
- ratio of comments to code, number of undocumented public methods,
- author's reputation based on other crates (but I'm not entirely sure about that one. I don't want to create cliques that prevent new authors from getting in),
- whether number of downloads is falling over time,
- maybe compilation time and produced code size? Users care about that, but I'm not sure how to make that comparable across all the crates.
- I'd like to reward portability (e.g. crates working also on Windows, cross-compiling to ARM), but I need to figure out how to do that without penalizing crates that are intentionally for platform-specific functionality.
I don't plan to include GitHub stars. They're GitHub-specific, and I don't want to penalize anyone for using their own hosting. The number itself also has lots of problems (e.g. some people star to bookmark projects, not to endorse them. And people generally don't unstar projects they no longer use).
Test data
Here's where you can help. To know whether the ranking is actually good, fine tune it, and monitor its performance over time, I need data. So please submit what crates you think should be highly ranked, and what results you expect for search queries.
-
Crates matching a search query: https://forms.gle/HfbvBSryNk19exUm7 What the search should be finding, and mistakes it could make? For example, search for "compile c" is supposed to find the "cc" crate, and not the deprecated "gcc" crate.
-
Relative ranking of pairs of crates: https://forms.gle/SFntxLhGJB7xzFy19 Which one should be higher in a category listing? You can list your favorites and underdogs, vs old and deprecated ones. For example, "chrono" > "time".
It's totally fine if these are just your subjective preferences. I need hundreds of data points, so please submit multiple answers! If you need inspiration, you can browse or search https://crates.rs and spot cases where things seem out of order: irrelevant crates are ranked high, and good crates are shown too low.