How far to take iterators to avoid `for` loops

I'm still very new to rust, but I've read through the vast majority of the book. I just started working on a bioinformatics crate to leverage toward my day job. I'm working on a simple biological sequence (ie DNA, etc) parser, one function of which changes a DNA sequence into its complementary sequence (eg A -> T). Based on my prior python knowledge, my first thought is a set of nested for loops and they do in fact do the trick....

Note self.sequence is just a String such as "ATCGGGCAT"

    fn complement(&self) -> String {
        let pairs =  vec![['A','T'], ['T','A'], ['C','G'], ['G','C'], ['N','N']];
        let out = String::new()
        for base in self.sequence.to_uppercase().chars() {
            for pair in pairs.iter() {
                if base == pair[0] {
                    out.push(pair[1]);
                }
            }
        }
        out
    }

However, the book and the community are always emphasizing the awesome power of rust iterators, and I want to get on board! So... I wrote of a chain of nested iterators that also does the job...

fn complement(&self) -> String {
    let pairs = vec![['A', 'T'], ['T', 'A'], ['C', 'G'], ['G', 'C'], ['N', 'N']];
    let out = self.
        sequence
        .to_uppercase()
        .chars()
        .map(|base| {
            pairs
                .iter()
                .filter(|[l, _c]| l == &base)
                .map(|[_l, c]| c)
                .collect::<String>()
        })
        .collect::<String>();
    out
}

But when I Iook at this is looks dangerously close to a case of K.I.S.S (ie Keep It Simple Stupid) and my central question is... Are the iterators worth it in this case? That is, is there a performance increase? Or really, is it just better rust? On a related note, I might be completely overdoing the Iterator method calls and maybe there's a far more concise way to do this with iterators. Thanks for your help and insight.

Well, your two cases don't do the same thing, I think

let out = self.sequence.to_uppercase()

should be

let mut out = String::new();

and

sequence
        .to_uppercase()
        .chars()
        .map(|base| { ... })
        .collect::<String>()

should be

sequence
        .to_uppercase()
        .chars()
        .flat_map(|base| { ... })
        .collect::<String>()

To answer your question, it depends, for simple loops it may be more clear to just use a for loop. But as things get more complex, it becomes more clear to it into iterators. The Iterator combinators make all of the transformations clear and easier to follow than a spaghetti mess of a for loop.

1 Like

@RustyYato Thanks for the quick answer. And thanks for noticing that - it was a transcribing error. I had commented out the for loops to work on the iterator and missed some code changes to be back in line with the current state. I think I fixed it, they should do the same thing now.

I really like the idea of doing this with iterators - it was just in comparing the two, one looks far far more verbose. But I agree with you that the iterator syntax it's easier to see what is happening at each step, whereas the for loop has rely on unfortunate vector indices which constantly require thinking back to the vector structure, albeit simple in this case. Do you know if iterators generally scale well compared to for loops performance-wise? I haven't benchmarked any of this yet.

Thanks again for your response.

Generally iterators are more optimizer friendly than hand rolled loops. For example iterators can allow LLVM to elide bounds checks when it can prove that they are unnecessary. This is harder to do in hand-rolled loops. Also if you are using fancier iterators, like iter_a.chain(iter_b) using Iterator::for_each can be more performant because it can just run through all of iter_a then all of iter_b. Using for loops here would mean that Chain would need to check if it can use iter_a every time, and LLVM may not be able to optimize that out.

2 Likes

Awesome, thanks for the outline on performance - it helps my perception of whats going on under the hood a bit.

I'll look into those other fancier iterators as well to see how I can start to use them!

Thanks!

1 Like

Just a tidbit that many people miss, on the side of of the docs you'll see a link for [src], this takes you to the source code of whatever you are looking at. I have found this to be very informative and helpful to see how things are done.

1 Like

Thanks for the tip! I have used this quite a bit to take a peek into the inner working of extern crates and core rust - alas I'm still working up the literacy to really get a feel for the big picture of what's going on. But it's a process!

The approach with building up an inner String is a bit wasteful, since there can only ever be one output char for each incoming char. So you'd rather write

fn complement(sequence: &str) -> String {
    let pairs = vec![['A', 'T'], ['T', 'A'], ['C', 'G'], ['G', 'C'], ['N', 'N']];
    sequence
        .to_uppercase()
        .chars()
        .filter_map(|base| {
            pairs
                .iter()
                .find(|[l, _c]| l == &base)
                .map(|[_l, c]| c)
        })
        .collect()
}

The next question is, of course, if you really need Strings to represent the bases. A small and easy change would be to go to Vec<u8>, which lets you keep incoming "junk" characters but gets rid of the UTF-8 de/encoding overhead when you do chars() and collect().

Working on bytes, you could also get rid of the linear search by creating a 256-entry array like
[None, ..., None, Some('T'), ...] (where the T is at index 65 etc.) and doing a lookup there. But the list for the linear search is so small that it probably doesn't matter...

1 Like

@birkenfeld Thank you for your advice. I've changed my function to reflect your code chunk and the iterator workflow looks and feels more direct and a lot less clunky. Also thanks for the introduction to filter_map in a personal use case. After seeing it in action and reading the docs again, I can see it utility much more clearly.

You're completely right regarding the Vec<u8> vs String argument. Coming from higher level languages where strings are artificially made more approachable, its always been my go to from handling sequence data. However, the downsides you note plus the type juggling I've had to do in my crate so far make it clear to me now that I have some worthwhile refactoring to do. I'm not as comfortable dealing with bytes (yet), but the whole reason I wanted to implement these functions in rust (and learn rust at all) was to get closer to the system and decrease execution times, so thanks for the ideas.

You're probably right that the linear search is fine in this small-scale case, but I really like the 256-entry array approach - I had never thought of this before. Would this array be best generated using a declarative macro?

Thanks again.

If you want to go a more Rust-y way you could consider using an enum for the Base and some conversion impls. I've written something for you (example).
I think this should even be faster once you converted from u8/str to Base as the complementary() is guaratneed to succeed which should better for a CPU's pipeline (probably).

4 Likes

This is great. I'd also change the signature of the question from

fn complement(&self) -> String;

into

fn complement(&self) -> impl Iterator<Item = Base>;

That might, depending on what OP wants to do with the result, save a large allocation.

@farnbams Thanks for your take and for writing an example up for me. I like your approach and it does feel a whole lot more rusty, which is great. I've been defining types to hold sequence data in String, and now moving toward Vec<u8> - following input from @birkenfeld .

I think this enum will make a nice addition, especially when it comes to converting bases to other bases and base-associated types to other base-associated types. There are a lot more sequence conversion functions I need to write, and this type of architecture looks like ideal and idiomatic rust, which is how I continue to try to think.

PS I've also never implemented TryFrom or Into for a type before, so thank you for demystifying that for me!

1 Like

@ThomasdenH I've seen this syntax before, but am still uncomfortable with it. Just to clarify, it's a kind of generic indicating that fn complement(&self); will return something the implements Iterator (ie some kind of iterator) that iterates over elements of type Base. Do I have that right?

Edit: So String would no longer be a valid return type? (which is fine)

Yes, exactly. In other languages you would be able to simply specify an interface here, but in Rust you have to specify how you want to return it:

  • impl means you will return the same type every time, but the exact type is hidden from the caller.
  • dyn means you can return different types implementing the same trait, but this adds some overhead. Since it can be anything, the size isn't known, so you'd have to use Box<dyn T> or use a reference.

You could convert a sequence back to a String by implementing Display on Base.

EDIT: The function could look like this:

fn complement(sequence: impl Iterator<Item = Base>) -> impl Iterator<Item = Base> {
    sequence.map(|base| base.complementary())
}
1 Like

@ThomasdenH Perfect, that makes sense - thanks for the clarification. In this last example of yours my sequence contained in Vec would first have to be converted to an iterator over corresponding Base enum variants. Which is exactly what @farnbams first std::iter::Iterator::map() call does in the example playground.

This one...

// <snip>
.map(|byte| Base::try_from(*byte).expect("Should in this example not happen"))
// <snip>

That's right. Although maybe the code is cleaner if it is put inline, since it is only one line anyway.

I just wanted to say thanks again to everyone for their insight into my question and advice on code. In case anyone's interested, the byte search iterator suggested by @birkenfeld outpaces the Base enum approach suggested by @farnbams by 5 milliseconds (6 ms vs 11 ms, respectively) when tasked with performing the complement method on an iterator over 938 sequences and a total of 26.94 million bases - both very quick. Thanks for all of your help, I'm keeping them both for now.

One last question I had tangentially-related to this (but let me know if I should start a new topic) is that in refactoring sequences to be contained in Vec<u8> instead of String, I noticed that I no longer have to Box<T> my sequence types. This doesn't make total sense to me since I'm reading the sequences in at runtime from separate files.

How can the compiler know the size of Vec<u8> when it doesn't know how many bytes are in each sequence prior to runtime?

I'm not certain where you were using boxes to begin with!

Vec<u8> is a type that's exactly 24 bytes large (assuming a 64-bit system); its representation is basically (pointer to [T], length, capacity). Similar is true for &[u8] and Box<[u8]> (these are 16 bytes large; fat pointers that store the length of the slice in their metadata).

The types whose size can't be known are [T] or str. I.e. the type that those pointers are pointing to.

@ExpHP Ohhhh, reading your question I now know that I should have known that! Thanks for bringing me back down to earth. So the only reason that the size of Vec<String> cannot be known at compile time is because the size of String cannot be known at compile time.

I was boxing at first because I was reading arrays of sequences in as Vec<String> - transitioning to Vec<u8> now because of good advice above.

No, String has a known size (it's nothing more than a wrapper around Vec<u8>, so 24 bytes). As does Vec<String>; still 24 bytes.

use std::mem;

assert_eq!(mem::size_of::<Vec<u8>>(), 24);
assert_eq!(mem::size_of::<Vec<String>>(), 24);
assert_eq!(mem::size_of::<String>(), 24);

Can you show the code that uses Box?

1 Like