(Solved) Is it possible to make this sample name anagrams program smaller?

I came across an old blog post about generating name anagrams from a file: Naming twins in Python & Perl - brad's life — LiveJournal

Since I'm familiar with Python, here is a simple Python version of this that shows what I'm trying to do:

from collections import defaultdict

anagrams = defaultdict(list)

for line in open('dist.male.first'):
    name = line.split()[0]
    sorted_name = "".join(sorted(name))
    anagrams[sorted_name].append(name)

for sorted_name, names in anagrams.items():
    if len(names) > 1:
        print(anagrams[sorted_name])

This takes about 100 ms to run on my machine. You can download the file being opened at: https://www2.census.gov/topics/genealogy/1990surnames/dist.male.first .

I thought it would be a good way to practice working with strings and hashmaps in Rust, so here's what I came up with:

use std::fs::File;
use std::collections::HashMap;
use std::iter::{Iterator, FromIterator};
use std::io::{BufRead, BufReader, Result};

fn main() -> Result<()> {
    let file = File::open("dist.male.first")?;
    let mut anagrams: HashMap<String, Vec<String>> = HashMap::new();

    for line in BufReader::new(file).lines() {
        let line = line.unwrap();
        let words: Vec<&str> = line.split(" ").collect();
        let name = String::from(words[0]);
        let mut chars: Vec<char> = name.chars().collect();
        chars.sort();
        let sorted_name = String::from_iter(chars);
        anagrams.entry(sorted_name).or_insert_with(Vec::new).push(name);
    }
    for names in anagrams.values() {
        if names.len() > 1 {
            println!("{:?}", names);
        }
    }
    Ok(())
}

As you can see, the Rust version is longer, but on my machine, it runs noticeably faster: about 10 ms or so (I didn't do rigorous timing on either, I just ran it with time several times) when run in release mode. Python's ~50 ms startup time is actually noticeable after the more or less instant run of the Rust version, so I think the additional effort is worth it.

My question is: is there a way to make simple data munging tasks like this more concise in Rust?

Thank you for taking the time to read this. I hope you're doing OK.

There are a few things that you can do in order to improve the current performances:

  • Don't collect words. You can just do something like
let name = line.split(' ').next().unwrap().to_string();
  • You can lock the io::stdout() and use the writeln! macro in order to save time in the output

I tried to use criterion to benchmark your code, and with the suggestions I made I went from 1.26 ms to 744 us.

There are some other things you could do, but there are a bit harder. The first thing is you are performing an allocation to get the chars of the name, then another allocation to collect the sorted vec into a string. Depending on your real use case, you could just store the Vec and avoid the allocation. Or you could create a temporary char vec outside the loop and reuse it in order do dramatically decrease the number of allocations. Last but not least, if you expect to have ASCII strings, you could use bstr and perform the sort directly on the buffer (you will also save a lot of memory).

EDIT: now that I re-read your question, I realized that you would like less lines of code. IMHO, this can be extremely relative -- do you want less code in your main or less code in general? For the former, you can theoretically have a crate that performs all the work for you, and you will need just a few LOC. For the latter... How many LOCs are Vec and HashMap? Even in Python we have a nice and small syntax, but there is a lot of code behind the scenes... And after all, it is just a matter of two or three lines of code more in Rust, not a big amount.

1 Like

Well, here's where I ended up, using some more iterator things, more type inference, and more specific methods:

use std::{collections::HashMap, fs::File, io::{BufRead, BufReader, Result}};
use itertools::Itertools;

fn main() -> Result<()> {
    let file = File::open("dist.male.first")?;
    let mut anagrams: HashMap<String, Vec<String>> = HashMap::new();

    for line in BufReader::new(file).lines() {
        let name = line?.split_whitespace().next().unwrap().to_owned();
        let sorted_name = name.chars().sorted();
        anagrams.entry(sorted_name.collect()).or_default().push(name);
    }
    for names in anagrams.values().filter(|x| x.len() > 1) {
        println!("{:?}", names);
    }
    Ok(())
}

(tested only that it compiles, not that it still works)

Or, if you like pipes, you could rearrange it to something like this:

use std::{fs::File, io::{BufRead, BufReader}};
use itertools::Itertools;

fn main() {
    let file = File::open("dist.male.first").unwrap();
    BufReader::new(file)
        .lines()
        .filter_map(|x| Some(x.ok()?.split_whitespace().next()?.to_owned()))
        .map(|x| (x.chars().sorted().collect::<String>(), x))
        .into_group_map()
        .values()
        .filter(|x| x.len() > 1)
        .for_each(|names| println!("{:?}", names));
}

(That might be overkill on the chaining, though.)

4 Likes

Thanks for the responses. I didn't now the itertools crate existed.

I wonder how I go about learning these library functions. One mistake I made was assuming that VS Code would complete what was available. That is not the case. I just have to read through API docs.

Suppose I wanted to sort the output. By myself, I can't find anything better than:

    let mut interesting_names = anagrams
        .values()
        .filter(|x| x.len() > 1)
        .map(|a| a.to_owned())
        .into_iter()
        .collect::<Vec<Vec<String>>>();
    interesting_names.sort_by_key(|v| v[0].to_owned());

It's not clear to me when I want to call to_owned() and when I want to call clone(). Lots of things like that.

I guess it's a matter of reading the fabulous manual and spending time with the language.

If you need to understand how to use a specific crate, nothing is better than docs.rs.
About searching for a feature that could be implemented by an unknown crate, it is mostly a matter of experience and things like this. You can always try to search inside crates.io or lib.rs for specific keywords, but it takes a bit more time.

You can use sort_by instead of sort_by_key and avoid the allocation:

interesting_names.sort_by(|a, b| a.get(0).cmp(&b.get(0)));

(This works because Option<T> implements Ord if T: Ord)

If you come from a high language like Python, it is pretty normal: generally you don't have to think about an owned object and the references to it. When you use to_owned() is because you are handling some sort of borrowing object (like a slice or a str) and you need to get something you can own ( like a Vec or a String). On the other hand, when you clone, you are just copying (well, cloning :smile:) the object. If you have got a &T, you will get a new T, but if you have got a slice, your clone will be just another slice (which still refers to initial owned data).

Yup! The Rust book is an awesome lecture. If you really understand what is in there, you will have a clearer idea of the concepts of ownership, borrowing, lifetime and so on. :blush: It takes a bit of time but it is really worth!

1 Like

Minor note: I am not a big fan of shadowing prelude items such as Result.

Thus, instead of

use ::std::{
    fs::File,
    io::{
        BufRead,
        BufReader,
        Result,
    },
};

fn main() -> Result<()>

I would use

use ::std::{
    fs::File,
    io::{
        self, // use ::std::io;
        BufRead,
        BufReader,
    },
};

fn main() -> io::Result<()>
1 Like

For iterators, take a look at https://danielkeep.github.io/itercheat_baked.html. (People seem to either like it or hate it, so who knows.)

But yes, it's knowing things like "you can just use .nth(i) instead of .collect::<Vec<_>>()[i] if you only need one element" on iterators. clippy can help you with some of these things.

The way I think about that one:

  • If I own a T and want another T, I use .clone()
  • If I have something (immutably) borrowed (&T) and I want to own it or change it, I use .to_owned()

(You can often use .clone() for the latter as well, but I prefer seeing the difference by the function name.)

There's some additional string-specific discussion in

2 Likes

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.