Improving ranking and crate search

kornel · March 30, 2019, 12:34am

TL;DR: Crate listings and search could be better, but I need you to help create test/evaluation data!

Ranking by downloads is meh

In various listing pages (apart from searches) crates.io and crates.rs sorted crates mostly by a raw number of recent downloads. While this is better than nothing, it has some problems:

The number is noisy, as it includes downloads by bots and duplicate hits from continious integration, rebuilding of docker images, etc.
It's heavily biased against applications. Installs outside cargo install obviously aren't counted. Dependencies tend to be redownloaded and updated much more frequently. Even app installs themselves create download hits for their own dependencies, so apps are doomed to lag behind libraries in the rankings.
It makes deprecated and abandoned crates look too good, since they keep getting hits from their old users (it takes a while to migrate away from a deprecated crate), even if they're getting zero new users.
Some crates get a free ride if they're a dependency (…of a dependency of a dependency…) of some popular crate. There are some single-purpose helpers, at v0.1.0, last updated in 2015, which aren't generally useful outside of their parent crate, which are among the top crates by download numbers.
It's tough for new crates that start from zero.

So I'm developing a better ranking algorithm.

Gaming of the algorithm

Of course, it will be possible to game the algorithm, but I'm not worried about it. It can't be worse than the current one, which basically has an API for gaming it: while true; do curl <your crate's tarball URL>; done (but please don't do that).

More seriously, I think the "gaming" problem can be divided into two categories:

People who have a real crate, and want it used, and want a low-effort shortcut to popularity.
People who don't have a real crate, but want a free advertising space, SEO link juice, or to troll and deface the website.

In the first category, I expect there will be natural limits to how far authors will game the system. For example, someone could game rankings by filling all crate's metadata with dummy information and generate fake documentation. Even if that increased rankings, it would confuse and scare away potential users, so there's no point doing it if the author wants the crate to be used. And if the author "games" the system by writing actual complete metadata, and real documentation, that's not too bad!

I think the second category is solvable with things like filtering and a ban-hammer. Crates that are straight spam will be identifiable by the nature of promoted keywords, links and persistence of the spammers pushing them.

Fortuantely, so far, I haven't seen examples of either kind of spamming. Rust community is nice

Alternative rankins

I've implemented an alternative ranking algorithm that is a combination of many factors:

How big and complete is the README, does it have code examples? (according to user survey done for crates-io ranking, this was very important)
Does the crate have links to repo, docs and homepage? (users want to see the code, issue tracker, documentation, etc.)
Does it have tests and CI?
Does it have a 1.0 release or long release history? (crates v0.1.0 look dodgy and most of them are abandoned. If it's actually stable, the crate author should say so by releasing 1.0.0).
Are there co-owners, co-authors/contributors? (crates with bus factor == 1 are a liability. Having contributors proves the crate has been useful to more than one person).
How long it's been since the last release? (if a crate looks stable I'm allowing it to go 2 years without any release, but even stable crates need to update their dependencies from time to time. Expected update frequency is shorter for nightly and young 0.x crates).
Whether dependencies are outdated or deprecated (sign of a long-abandoned crate)
How many crates have used it in the past and stopped using it? (e.g. users migrate from the gcc to the cc crate).
Number of reverse dependecies
Number of downloads, but filtered to remove some noise, and downloads by closely related crates, such as "foo" inflating downloads for its "foo_derive" dependency (this prevents internal crates from showing before their "parent" crates).

All these factors are flawed in some ways, and have exceptions. However, I hope that on average they're correlated with usefulness of crates, and there are very few crates unlucky enough to be an exception in more than a few of these factors.

I plan to add more factors, such as:

cargo-crev reviews,
use of deprecated methods, clippy issues (but only serious ones),
ratio of comments to code, number of undocumented public methods,
author's reputation based on other crates (but I'm not entirely sure about that one. I don't want to create cliques that prevent new authors from getting in),
whether number of downloads is falling over time,
maybe compilation time and produced code size? Users care about that, but I'm not sure how to make that comparable across all the crates.
I'd like to reward portability (e.g. crates working also on Windows, cross-compiling to ARM), but I need to figure out how to do that without penalizing crates that are intentionally for platform-specific functionality.

I don't plan to include GitHub stars. They're GitHub-specific, and I don't want to penalize anyone for using their own hosting. The number itself also has lots of problems (e.g. some people star to bookmark projects, not to endorse them. And people generally don't unstar projects they no longer use).

Test data

Here's where you can help. To know whether the ranking is actually good, fine tune it, and monitor its performance over time, I need data. So please submit what crates you think should be highly ranked, and what results you expect for search queries.

Crates matching a search query: https://forms.gle/HfbvBSryNk19exUm7 What the search should be finding, and mistakes it could make? For example, search for "compile c" is supposed to find the "cc" crate, and not the deprecated "gcc" crate.
Relative ranking of pairs of crates: https://forms.gle/SFntxLhGJB7xzFy19 Which one should be higher in a category listing? You can list your favorites and underdogs, vs old and deprecated ones. For example, "chrono" > "time".

It's totally fine if these are just your subjective preferences. I need hundreds of data points, so please submit multiple answers! If you need inspiration, you can browse or search https://crates.rs and spot cases where things seem out of order: irrelevant crates are ranked high, and good crates are shown too low.

anon80458984 · March 30, 2019, 4:53am

Apologies if this is off topic. I'm not sure where to post this, and your post seemed to be a "nearest neighbor."
I wish crates had a 'trust score', where "1.0" = "this crate is great + useful" and "0.0" = "no one uses it."
We combine trust scores combine(t1, t2) = t1 + t2 - t1 * t2. Note, this guarantees that for 0 <= t1, t2 <= 1, we have:
t1 <= combine(t1, t2)
t2 <= combine(t1, t2)
combine(t1, t2) <= 1
combine is commutative, associative
Each user manually specifies a list of "trusted root crates" (each which gets assigned score "1.0").
If crateA has trust tA and depends on crateB, then crateB gets a trust contribution of 0.1 * tA.
For each user, "frequent dependants of crates I use" get high trust scores, while crates that are infrequently used have low trust scores.
Suppose each user published their list of "crates I trust", then when searching, I can say something like: @kornel , @OptimisticPeach , @RustyYato , @parasyte , @vitalyd , @cuviper all answered a bunch of my questions on Rust forum -- so go ahead and give weight to crattes they trust/use.
Intuitively, this seems to capture this notion of trust:

I trust crates used by crates I use.
I trust crates used by people answering Rust beginner questions.

kornel · March 30, 2019, 12:20pm

I've already done this, discussed in another thread: What makes a trustworthy crate maintainer? - #3 by kornel

dylan.dpc · March 30, 2019, 2:50pm

Do look at GitHub - crev-dev/cargo-crev: A cryptographically verifiable code review system for the cargo (Rust) package manager.

najamelan · March 30, 2019, 6:25pm

Thanks for the initiative. A better crate suggestion would be awesome. Is this intended for crates.io or only for crates.rs?

I think it's worth being careful not to overdo it with assumptions. Applications most certainly don't need code examples in the readme, some crate might be better known to the masses, but the oddball out might actually be the one to get it right, or to innovate, or be really good for some niche problem. In general, I think serious libraries have a good readme, good documentation, unit tests, compile without warnings, ... So there is definitely room for improvement over number of downloads. Just beware of potential unjust punishing effect, or rewarding stimulating gaming to much (search rank optimization clearly springs to mind).

Also reviews are great,when it's reviews for the current version. Maybe the author fixed issues pointed out and the crate deserves a second chance.

There is one thing I always know beforehand when I search, and that is whether I search a library or an application. Maybe it's more appropriate that they are shown separately, like a tabbed interface... Crates that have both should obviously show up in both.

And as a last note, it's important that the most weighted score should always be relevancy. It's not much good that a crate is awesome if it doesn't do what you need...

anon80458984 · March 30, 2019, 7:17pm

@kornel : It looks like you implemented a scoring per person whereas I proposed a method for scoring per crate.

I really like your system of trusting crate co-owners, trusting dependencies, and trusting core groups (Mozilla, Rust).

Is there a way to use the 'maintanier trust score" you have developed in a package search? I'd like some way to balance "relevancy" together with "trust worthiness of crate maintanier".

anon80458984 · March 30, 2019, 7:18pm

@dylan.dpc : This appears to be more work, as it requires actual code review -- instead of just looking at the graph of who's working with who / who's using who's packages.

However, I do agree, having crypto signed msgs of "I reviewed this code base & vouch for it" would be nice.

kornel · March 30, 2019, 7:20pm

The code does the same per crate, too, but this is off-topic.

kornel · March 30, 2019, 7:56pm

Good idea. I've tweaked that ranking.

najamelan · March 30, 2019, 8:00pm

Is this algorithm already running on crates.rs? In any case I just did 2 searches. I think this could be improved:

showing an exact match first is good I think
some fuzzy string matching for typos would be cool too

chrisd · March 30, 2019, 8:02pm

You know I didn't even realise applications were indexed. It didn't even occur to me for some reason and I can't say I've ever seen one when doing a search. It's a weird blindspot considering now that I'm aware I've noticed that crates.rs explicitly mentions applications.

kornel · March 30, 2019, 8:07pm

I increase ranking for exact match, but don't always put one first, because there are some squatted or abandoned crates. For example I'm super happy that juniper comes out as the first result for "graphql" search.

Fuzzy matching would be nice indeed. Tantivy doesn't do it itself, so I'll have to… find a crate for it.

najamelan · March 30, 2019, 8:12pm

I suppose it's good to weed out landgrabbers, but when searching for "slog" I would expect it to show up first. It's score for other factors can't be that bad... it's just that it's common to search for crates you already know (at least I do it a lot) and having to weed through results in that case would be quite a regression in my opinion compared to what crates.io does.

BurntSushi · March 30, 2019, 8:20pm

Presumably tantivy will allow you to override the tokenizer to use n-grams? That should get you some level of fuzzy search that helps with typos.

Riateche · March 31, 2019, 12:02am

I'm not sure if it makes sense to have libraries and applications in the same ranking at all. I don't think you can reasonably compare a library and an application - they're just too different!

I'd say visitors usually search for either libraries or applications, but not for both at the same time. So it would probably be fine to just add an option to the search form and create two separate rankings. If a crate provides both a library and a binary, it should appear in both rankings.

carols10cents · March 31, 2019, 12:38pm

I just wanted to provide some clarification: for searches, crates.io ranks by relevance to the search query by default. In categories and keywords, recent downloads (last 90 days) are indeed the default. There's certainly improvements that can be made to the relevance calculation and some of the factors mentioned in this thread could be worked into a relevance score, but I think it's an entirely separate problem from ranking within a category or keyword and I'm seeing some conflation of the issues here.

I also think there's a third category of the "gaming" problem: people who want to make any system unusable for anyone to make the point that they disagree with the current system, or for "teh lulz". These can also be handled with a ban-hammer, but it takes people time to identify and handle these cases. Ask me how I know.

Overall, I applaud your efforts in attempting to tackle this problem and I wish you luck! It's definitely tricky!

kornel · March 31, 2019, 4:22pm

Yeah, sorry for the confusion. I had sorting in categories and similar listings in mind. In my implementation they're more related, since even search results mix query relevance and other score.

sgrif · April 1, 2019, 6:31pm

Definitely let us know how this experiment goes, improving discoverability on crates.io is definitely a goal we have for 2019.

For general search, I've been questioning if we can ever really give results that are better than googling "rust thing_you_want". If the input to the search is "thing I want to do", and what you want is "the crates people are using to do that thing", ultimately the metric we need there is what people clicked on when searching for a given term. We're just never going to beat dedicated search engines for that.

That isn't to say that I think we shouldn't have search or that it's never useful, but I think we may want to focus more on searches that are allow us to use additional information we have, such as searching within a category/keyword, filtering to only no_std or wasm compatible crates, etc (aka advanced search). There's also the use case of "I know the crate I want to visit, and this is slightly more convenient than having to remember that the url is /crate/:name", or when folks can't quite remember the exact name of what they were looking for.

proc · April 4, 2019, 2:04pm

now you witched it other than that: thanks for making this effort, this could potentially lessen the burden for new people who're not used to rusts "low-fat-std + crates"

proc · April 4, 2019, 2:13pm

I have to say that I never really use anything beside crates.io's own search option to find stuff. I even have my search-keyword in firefox just for crates.io (and also docs & rust-std). I just don't expect any search engine to be up-to-date with rust crates, or to have catched up to what exactly I want (not a game, duh).

Topic		Replies	Views
Crate Ranking System? help	11	1247	April 10, 2020
Lib.rs (was Crates.rs) — a new, faster crate index website announcements	60	13505	July 3, 2022
RFC: Ranking crates on crates.io help	2	1023	January 12, 2023
Crates.io.. search by traits, & safety?	20	2458	January 12, 2023
Lib.rs (was Crates.rs) — what's next? community	79	8345	December 5, 2024

Improving ranking and crate search

Ranking by downloads is meh

Gaming of the algorithm

Alternative rankins

Test data

Related topics