Keywords for crate search — help with synonyms and canonical keywords

Keywords used in crates-io crates, like any folksonomy, are a bit messy and inconsistent. Please help me clean them up!

I'm collecting keyword synonyms/aliases that can be used to normalize spellings of crate keywords, which will improve search, keyword pages, automatic categorization and finding of similar crates.

I've started from StackOverflow's list of tag aliases, but SO has their peculiar naming schemes, and a bias for assuming that abbreviations are JS/.Net/IBM product names, so the data has some questionable aliases that don't make sense for Rust.

Can you help me review the list? Delete confusing aliases. Add new terms and spelling corrections specific to Rust. Swap to/from columns if the other spelling is more common (although for technical reasons I slightly prefer tags with words separated by hyphens, e.g. file-type rather than filetype).

You can just edit the spreadsheet here:

I'm going to review changes anyway, so don't worry about messing it up. If you want to be extra careful, you can make a copy and send me a link to it. If you prefer working with real files, you can edit:

and make a PR on GitLab.


The aliases are searched exhaustively, so it's ok to have multi-step replacement:

foo,bar,5
bar,baz,5

will replace foo -> baz.


Thanks!

6 Likes

Wow, how much work have you put into this? I've been scrolling around randomly and am not spotting a lot to fix. It's already in really good shape. I expected the data to be a lot rawer than it is. I'm impressed.


What does "hidden" mean? Why are these 0?

FIND REPLACE 0 = hidden, 1-5 similarity (5=same thing)
32bit 32-bit 0
32bit 32-bit 4
64bit 64-bit 0
7bit 7-bit 0
7z 7zip 0

It might be helpful to switch the format so that we can more easily see what the canonical tags are, e.g.:

Canonical Synonyms
arrays array, byte-array, char-array, character-arrays, javascript-array, jsonarray, mongodb-arrays, static-array, string-array, sub-arrays, swift-array
fun humor, humour
html div, div-layouts, divs, html-attributes, html-comments, html-syntax, html-tag, nested-divs, span, webpage,
sorting array-sorting, date-sorting, sort, sorted, sorting-algorithm

What do you think about tag splitting?

FIND REPLACE
algorithmsanddatastructures algorithms, data-structures
android-performance android, performance
html-dom html, dom
javascript-array javascript, arrays

Hidden means the keyword is not displayed on the website, but it is used for finding similar crates.

The numeric column is taken from StackOverflow votes, so there isn't a deep meaning to it. Some are 0 only because nobody bothered to vote for it yet.

There are dupicated rows. Currently last row takes precedence, but the dupes should be deleted.

I'd prefer the opposite of tag splitting. Longer tags are more precise. I can always split them trivially with str.split('-').