I've ran clippy on 100K+ crates

Here are the results (200MB sqlite db) from clippy ran on every crate that I managed to build.

They're from clippy v1.80-v1.83 with almost everything enabled, including pedantic, suspicious, and nursery lints, plus some rustc lints. It includes rustc lints and warnings. I've allowed crates to silence their warnings, and the extra lints were enabled only for crates that did not have a [lints] table in Cargo.toml.

The data is broken down by crate, so you can check what clippy thinks about each crate. I've kept only one message per error code per crate. Some codes have an extra text added after a space — these are added by me, extracted from the messages in an attempt to make the codes more specific, e.g. the deprecated code has the deprecated feature name added to it, like deprecated try.

Most common lints I hit:

  1. 59833× unused_imports
  2. 56975× clippy::missing_errors_doc
  3. 53225× unused_qualifications
  4. 48305× missing_docs crate
  5. 43795× elided_lifetimes_in_paths
  6. 43360× clippy::missing_panics_doc
  7. 41601× clippy::semicolon_if_nothing_returned
  8. 38447× missing_docs struct
  9. 37365× missing_docs method
  10. 36692× unused_crate_dependencies

Least common lints in this dataset:

(that's mostly because I didn't enable the restriction group of lints, but some crates did)

  1. clippy::cast_nan_to_int
  2. clippy::if_then_some_else_none
  3. clippy::manual_hash_one
  4. clippy::manual_is_finite
  5. clippy::manual_rotate
  6. clippy::panic
  7. clippy::result_map_unit_fn
  8. clippy::should_panic_without_expect
  9. clippy::unsound_collection_transmute
26 Likes

This is very helpful. I thought I was being an exemplary crate maintainer by running clippy before publishing crates and telling it to treat warnings as errors (and refusing to publish unless it passes without any errors). But apparently -W clippy::all is needed to achieve maximum chest hairs.

I'm guessing unused_imports comes from people writing code with all features enabled while this (I'm assuming) runs with default features. unused_qualifications I don't understand. The rest are all pretty normal.

It just doesn't warn by default, right? So it's super easy just never to notice. It's easy to add a use for an item in a file after you already have some qualified usages in the same file, right? Similarly, a refactor can move code to a place where something was already imported via use.

This is an annoying one to discover that I'm guilty of. Petition to make cargo enable this lint by default. (Not really, I'm sure there's a reason for it, but I feel a little stupid for publishing crates that trigger this).

I suppose, it's just weird that it's above all the other ones.

This may be caused by 1.80 adding size_of to the prelude, making std::mem::size_of redundant.

It doesn't seem to trigger on prelude stuff so I don't think so.

If you want to see what Clippy can do, try:

cargo clippy --all --fix -- -Wclippy::all -Wclippy::pedantic -Wclippy::nursery

There's also --broken-code to apply some not-always-perfect lints :slight_smile:

5 Likes

It might be interesting to split the data between crates that appear to be trying to work with clippy (using clippy:: lint names anywhere, or having configured {"rust-analyzer.check.command": "clippy"}, or for that matter having the string cargo clippy anywhere), and those that don't.

We might expect the former to be trying to have zero clippy warnings, and failing due to maintainer error or due to new lints being introduced, whereas the latter will have many more warnings[1] and a different distribution, so the differences might be interesting.


  1. especially because clippy::pedantic has quite a few lints that have the character “do this thing this arbitrary way rather than that one” (e.g. explicit_iter_loop) ↩︎

4 Likes

Elided lifetimes makes sense, it's one I try to make sure I enable.

I should absolutely get better around enabling doc lints, it's such a low bar to better interface design.

2 Likes

Dang, that made some work for me.

2 Likes

It's quite interesting!

Have I done something wrong, or are crates missing in the crates table? I was trying to partition the messages based on crates with a version that reached 1.x or not, then I realized there was a big mismatch between my join and clippy_results alone. For example, the first crates with code = unused_imports have id = 109, 110, 214, which are all missing from the crates table.

I suppose that including all the crates would have made the file too big, though; maybe that's the reason.

On a side note, I have a crate with "missing_errors_doc" because the doc is only in the trait methods' declaration but not in all their implementations (the methods operate on integers, and the implementations are made for each integer type in a declarative macro, which I don't like but is unavoidable). I don't know if that's considered a problem.

Ah, that's a bug. I've used INSERT OR REPLACE to add crates, and that increments ID leaving previous results orphaned if I've checked the same crate twice, but the latest results are okay. You can just delete the orphaned results.

1 Like

Sorry to be a bother. The clippy_results table and the most common hits are the most valuable information anyway.

unused_crate_dependencies appears to have false positives. Haven't dug into if this is known yet, nor tried to make a minimal reproducer.

EDIT: Seems to have several reported bugs about false positives. I think I may have hit yet another case though.

EDIT: Nope, it was a variant of an already reported issue on closer inspection.

Out of curiosity roughly how much disk space, cpu, and network resources does this consume?

I'm wondering if I did a similar analysis across all crates how much it might end up costing.

The Rust project regularly uses Crater/Rustwide to run checks across ~all publicly reachable Rust code. According to the docs, that runs on an AWS c5.2xlarge machine with 2Tb storage[1], and the cargobomb machine used for beta regression test runs has 30GB RAM.

I don't recall where to check how long a Crater run generally takes, but the answer is a good long while.

If you do happen to do this, make sure to do so on a sandboxed machine. Even if you only run clippy, you're running untrusted code during the build process.


  1. I'm certain they mean 2TB, since 2Tb is only 250GB. (Byte vs bit) ↩︎

1 Like

It took me 4 months, but I wasn't running it full time, only as ad-hoc background job. Assuming 30s per crate, that's 50 CPU-days to try them all.

It doesn't need anything special. Building Rust on a server is the same as building on your own machine locally. I've been using Hetzner's ARM machine, giving builds 8 cores, 16GB RAM, and ~300GB disk space. Disk space needs to be monitored and purged regularly. Rust/Cargo will eat all the disk space you give it.

I use some tricks that help a lot:

  1. I'm using a git index and unstable no-index-update for instant dependency resolution.

  2. Disabling debug and incremental builds (they massively balloon disk space and I/O)

  3. Grouping crates by similarities of their dependencies, and checking them in batches as workspace members (of about 20). Sometimes dependencies conflict and batches need to be reshuffled.

  4. Creating a giant Cargo.lock with everything that has built successfully so far. This helps Cargo skip a lot of deps as Fresh, and avoids big rebuilds due to tiny version variations.

With the last two I don't have to recompile syn a million times.

8 Likes

Thanks for these tips. It's really interesting and helpful to me, because I happen to write os-checker which is a dedicated tool[1] to run checkers like clippy on a bunch of OS-related Rust codebases[2].

I met a storage problem a few days ago that the disk limit Github Action was hit. And I had to run the checks in batch. But my strategy is naive to run them in fixed size chunk.

However the biggest problem for os-checker is to learn what target triples a codebase should be run on. It means a checker on a repo will emit multiple results based on compilation condition/flags. With wrong compilation condition, we might see thousands of errors due to lack of stdlib and get large JSON outputs, like this.

// results on wrong targets; these targets are detected in the repo's scripts though
database $ ll ui/repos/kern-crates/rcore-tutorial-v3-with-hal-component/
total 84M
-rw-r--r-- 1 root root 8.3M Sep  7 11:14 aarch64-unknown-none.json
-rw-r--r-- 1 root root  42M Sep  7 11:14 All-Targets.json
-rw-r--r-- 1 root root  621 Sep  7 11:14 basic.json
-rw-r--r-- 1 root root 8.2M Sep  7 11:14 loongarch64-unknown-none.json
-rw-r--r-- 1 root root  18M Sep  7 11:14 riscv64gc-unknown-none-elf.json
-rw-r--r-- 1 root root 1.3K Sep  7 11:14 x86_64-unknown-linux-gnu.json
-rw-r--r-- 1 root root 8.2M Sep  7 11:14 x86_64-unknown-none.json

// fixed by forcing checking on x86_64-unknown-linux-gnu
database $ ll ui/repos/kern-crates/rcore-tutorial-v3-with-hal-component/
total 108K
-rw-r--r-- 1 root root 52K Sep  7 15:42 All-Targets.json
-rw-r--r-- 1 root root 421 Sep  7 15:42 basic.json
-rw-r--r-- 1 root root 52K Sep  7 15:42 x86_64-unknown-linux-gnu.json

  1. os-checker is definitely not specific to OS crates, it can be used on any Rust repo. But currently, I write it mainly for checking OS crates. ↩︎

  2. most of them are OS lib components / reusable modules. ↩︎

3 Likes