While working on the nucleotide count exercise in Exercism, I noticed that there were two types of solutions.
In Option 1, the counting logic lives in the count
method and the nucleotide_counts
method calls count
4 times, once for each nucleotide.
use std::collections::HashMap;
pub fn count(to_find: char, strand: &str) -> usize {
strand.chars().filter(|&c| c == to_find).count()
}
pub fn nucleotide_counts(strand: &str) -> HashMap<char, usize> {
let mut h = HashMap::new();
for acid in "ATCG".chars() {
h.insert(acid, count(acid, strand));
}
h
}
In Option 2, the counting logic lives in nucleotide_counts
and count
uses nucleotide_counts
to find the count for a single nucleotide. The argument given for this approach is that nucleotide_counts
only has to traverse the entire strand once, so it is therefore faster.
use std::collections::HashMap;
use std::collections::hash_map::Entry;
pub fn count(to_find: char, strand: &str) -> usize {
if let Some(x) = nucleotide_counts(&strand).get(&to_find) {
*x
} else {
0
}
}
pub fn nucleotide_counts(strand: &str) -> HashMap<char, usize> {
let mut h: HashMap<char, usize> = "ACGT".chars().map(|c| (c, 0)).collect();
for acid in strand.chars() {
match h.entry(acid) {
Entry::Occupied(entry) => {
*entry.into_mut() += 1;
}
Entry::Vacant(_) => {}
}
}
h
}
But none of the tests really exercise performance. The longest strand provided is 70 characters. So I changed my test suite to test nucleotide_counts
using a 10-million character strand.
I then ran the test against both implementations with time cargo test
a few times.
Option 1, which should traverse the string 4 times, takes about 3.1-3.3 seconds. Option 2, which should only traverse the string once, takes 7.4-8.0 seconds.
Any ideas why Option 2, which seems more performant, is slower?