Reproducinng the best of classical inheritance (from Python to Rust)

Here is some Python code for a typical bioinformatics library.

class Alphabet:
    symbols = set()

    def __init__(self, symbols: str, complement: str):
        self.symbols = set(symbols)
        self.cmplmnt = {s : c for s,c in zip(symbols, complement)}

    def is_word(self, sequence: str):
        return all(s in self.symbols for s in sequence)

    def complement(self, sequence: str):
        return sequence.translate(str.maketrans(self.cmplmnt))

DNA_ALPHABET = Alphabet("ATGC", "TACG")
RNA_ALPHABET = Alphabet("AUGC", "UACG")
PRT_ALPHABET = Alphabet("WXYZ", "WXYZ")

class Sequence:
    sequence : str | None
    alphabet : Alphabet

    def __init__(self, sequence: str, alphabet: Alphabet):
        self.sequence = sequence if alphabet.is_word(sequence) else None
        self.alphabet = alphabet
    
    def is_valid(self):
        return self.sequence is not None
    
    def reverse(self):
        return Sequence(self.sequence[::-1], self.alphabet)

    def complement(self):
        return Sequence(self.alphabet.complement(self.sequence), self.alphabet)

    def reverse_complement(self):
        return Sequence(self.alphabet.complement(self.sequence[::-1]), self.alphabet)

class DNA(Sequence):
    quality: list[int]

    def __init__(self, sequence: str):
        super().__init__(sequence, DNA_ALPHABET)

    def to_rna(self):
        pass

    def to_protein(self):
        pass


class RNA(Sequence):
    quality: list[int]

    def __init__(self, sequence: str):
        super().__init__(sequence, RNA_ALPHABET)

    def to_dna(self):
        pass

    def to_protein(self):
        pass

class Protein(Sequence):

    def __init__(self, sequence: str):
        super().__init__(sequence, PRT_ALPHABET)

What is good about this code:

  1. It is very concise.
  2. I can create both predefined types of sequences (DNA, RNA, Protein) and custom ones (Sequence) with an arbitrary Alphabet.
  3. I only need to change the Sequence class in case I need to add more functionality to work with self.sequence, e.g., find a substring, extract a substring etc... This would then apply to all child classes.
  4. I can add features (e.g., quality) and methods (e.g., to_rna) to specific types of sequences.

So what is the best strategy for porting this type of OOP setup to Rust? Here are my thoughs...

  1. If I make a struct Sequence and then use it as a field in another struct, e.g., struct DNA { sequence: Seqeunce }, then the sequence filed would be accessed differently in Sequence and DNA, i.e., Sequence::sequence and DNA::Seqeunce::sequence. This complicates sharing methods between these two structs.
  2. I can declare struct Sequence { sequence : String } and struct DNA { sequence : String }. I would then need to declare a trait, e.g. Seq, that works with self.sequence. However, this trait should be specified for all predefined types of sequences, i.e., DNA, RNA, Protein and Sequence. That's too much.
  3. I can use blanket implementation, namely, implementing some dummy trait SeqDummy for each sequence types, then, declaring trait Seq { ... } and implementing it for all classes with dummy trait at once, i.e., impl<T: SeqDummy> for T { fn complement(&self) -> T { ... }; }. This approach has its own overhead also, however, not that much.

I personally could not come up with a better solution than the 3rd one. Thus, I would like to listen what the community suggests.

Can you give some examples of how you'd use this code?

Sure. In Python, it would be used like this.

dna = DNA("AAGATAGCT")
rna = dna.to_rna()
prt = rna.to_protein()
rev_prt = prt.reverse()

alphabet = Alphabet("QWE", "QWE")
seq = Seqeunce("QQQ", alphabet)
rev_seq = seq.reverse()

Here, DNA and Sequence have a big chunk of shared functionality when it comes to working with self.sequence, e.g., reversing the string. The only difference is that DNA already has a predefined Alphabet (and maybe some specific functions like to_rna) while we should specify Alphabet for Sequence each time.

You should probably create a trait with default-implemented fns, and define the alphabet as a type parameter or as an associated type/constant.

My initial thought would be to organize things along these lines:

trait Alphabet: Copy + Eq + TryFrom<char> { 
    fn complement(self)->Self
}

struct Sequence<A> {
    seq: Vec<A>
}

#[derive(Copy,Clone,Eq,PartialEq)]
enum Dna { A, T, G, C }

impl Alphabet for Dna {
    fn complement(self) -> Self {
        match Self {
            Self::A => Self::T,
            Self::T => Self::A,
            Self::G => Self::C,
            Self::C => Self::G,
        }
    }
}

impl<A:Alphabet> Sequence<A> {
    // Generic sequence fns here
}

impl Sequence<Dna> {
    // Dna-specific fns here
}
2 Likes

Yeah, this kind of optimization is also possible, however, my main concern is not about the Alphabet.

Could you please provide a sketch for this solution? Making default implementations would require accessing the field self.sequence that is prohibited, i.e.,

pub trait Seq<T>
{
    fn sequence(&self) -> T { self.sequence }
}

impl Seq<DNA> for DNA {}

wouldn't work because it is not known in advance that the field sequence is a part of T.

The usual trick is to make the sequence() method non-default.

My point wasn't really about optimizing the alphabet implementation, but rather pointing out another option for how you handle sequences. If you make your struct Sequence generic over an alphabet type, then you can implement some methods that are generic over all possible alphabets and others that are only valid for specific alphabets, like Dna. The key part unfortunately ended up at the bottom of the code block:

With this approach, your original DNA type would be spelled Sequence<Dna>.

3 Likes

Can you please provide an example?

Here's a sketch.

// Separate trait for things that are more concrete-type specific
pub trait NewSequential: Sized + AsRef<Sequence> {
    fn new_using(&self, sequence: Option<String>) -> Self;
}

// Everything else that can be defaulted from `NewSequential`
pub trait Sequential: NewSequential {
    fn is_valid(&self) -> bool;
    fn reverse(&self) -> Self;
    fn complement(&self) -> Self;
    fn reverse_complement(&self) -> Self;
}

impl<Seq: NewSequential> Sequential for Seq {
    fn reverse(&self) -> Self {
        let this = self.as_ref();
        let sequence = this.sequence.as_ref().map(|word| {
            word.chars().rev().collect()
        });
        
        self.new_using(sequence)
    }
   // ...
}

The AsRef<Sequence> supertrait bound is basically "know[ing] in advance that the field sequence is a part of T".

Then everything can share the default implementation if they implement AsRef<Sequence> and NewSequential... including Sequence itself.

#[derive(Debug, Clone)]
pub struct Sequence {
    sequence: Option<String>,
    alphabet: Alphabet,
}

impl NewSequential for Sequence {
    fn new_using(&self, sequence: Option<String>) -> Self {
        Self { sequence, alphabet: self.alphabet.clone() }
    }
}

impl AsRef<Sequence> for Sequence {
    fn as_ref(&self) -> &Sequence {
        self
    }
}

You could have a generic wrapper for fixed-alphabet sequence types too...

#[derive(Clone, Debug)]
pub struct TypedSequence<Ty> {
    marker: Ty,
    sequence: Sequence,
}

impl<Ty: Clone> NewSequential for TypedSequence<Ty> {
    fn new_using(&self, sequence: Option<String>) -> Self {
        let sequence = self.sequence.new_using(sequence);
        Self { sequence, marker: self.marker.clone() }
    }
}

impl<Ty> AsRef<Sequence> for TypedSequence<Ty> {
    fn as_ref(&self) -> &Sequence {
        &self.sequence
    }
}

And why not, how about a way to construct them and reduce a little more boilerplate...

pub trait ConstructableTypedSequence: Sized {
    const ALPHABET: [&'static str; 2];
    fn new<S: Into<String>>(self, sequence: S) -> TypedSequence<Self> {
        let alphabet = Alphabet::new(Self::ALPHABET[0], Self::ALPHABET[1]);
        let sequence = Sequence::new(sequence, alphabet);
        TypedSequence { marker: self, sequence }
    }
}

Then to make a new constructable typed sequence, you need

// This is all you need to define for the `Sequential` stuff
// and the constructor
#[derive(Copy, Clone, Debug)]
pub struct Dna;
impl ConstructableTypedSequence for Dna {
    const ALPHABET: [&'static str; 2] = ["ATGC", "TACG"];
}

// Extra stuff goes here
impl TypedSequence<Dna> {
    pub fn to_rna(&self) -> TypedSequence<Rna> {
        // todo...
        Rna.new("")
    }
}

If TypedSequence is in a different crate, you'll need to use your own traits instead of native implementations for the "extra stuff", but it's still doable.


1 Like

Oh, thanks!

I was trying to use Sequence<Dna> myself, but got stuck with the impl part. Now I see that impl<A: Alphabet> Sequence<A> solves this. Will try that one.

You literally just don't provide a body for it, so the type implementing the trait has to. E.g. this compiles:

pub trait Seq<T> {
    fn sequence(&self) -> T;

    fn reverse(&self) -> T
    where
        T: IntoIterator + FromIterator<T::Item>,
        T::IntoIter: DoubleEndedIterator,
    {
        self.sequence().into_iter().rev().collect()
    }
}

But at this point, @2e71828's solution is cleaner and you should probably just use that.

Thanks. That's a lot of code, will study it soon.

Thanks for the trick anyway, didn't know that.

Using the ideas from discussion above:
Rust Playground

Thanks everyone for assistance. I was able to design structs as follows:

// This macro creates a boolean map for a given alphabet
const ALPHABET_SIZE: usize = 128;
macro_rules! alphabet_map 
{
    ($($symbol:expr),*) => {
    {
        let mut arr = [false; ALPHABET_SIZE];
        $(
            arr[$symbol as usize] = true;
        )*
        arr
    }};
}


pub trait Alphabet
{ 
    const SYMBOLS: [bool; ALPHABET_SIZE];
    const ALLOWED: [bool; ALPHABET_SIZE];

    fn is_word(sequence: &String) -> bool
    {
        sequence.chars().all(|s| Self::SYMBOLS[s as usize])
    }
}

pub struct Sequence<A: Alphabet>
{
    pub sequence : String,
    seq_type     : A,
}

pub struct DNA {}
impl Default for DNA 
{
    fn default() -> Self { DNA {} }
}

pub struct RNA {}
impl Default for RNA 
{
    fn default() -> Self { RNA {} }
}

impl Alphabet for DNA
{
    const SYMBOLS: [bool; ALPHABET_SIZE] = alphabet_map!['A', 'T', 'G', 'C'];
    const ALLOWED: [bool; ALPHABET_SIZE] = alphabet_map!['N'];
}

impl Alphabet for RNA
{
    const SYMBOLS: [bool; ALPHABET_SIZE] = alphabet_map!['A', 'U', 'G', 'C'];
    const ALLOWED: [bool; ALPHABET_SIZE] = alphabet_map!['N'];
}

impl<A: Alphabet + Default> Sequence<A> 
{
    pub fn new(sequence: String) -> Option<Self>
    {
        if A::is_word(&sequence)
        {
            Some( Sequence { sequence : sequence, seq_type : A::default() } )
        }
        else { None }
    }

}

impl Sequence<DNA>
{
    pub fn to_rna(&self) -> Sequence<RNA>
    {
        Sequence::<RNA> { sequence: self.sequence.replace("T", "U"), seq_type: RNA::default() }
    }
}

impl Sequence<RNA>
{
    pub fn to_dna(&self) -> Sequence<DNA>
    {
        Sequence::<DNA> { sequence: self.sequence.replace("U", "T"), seq_type: DNA::default() }
    }
}

This runs as:

fn main() 
{
    let dna = Sequence::<DNA>::new("ATGC".to_owned()).unwrap();
    println!("DNA: {}", dna.sequence);
    let rna = dna.to_rna();
    println!("RNA: {}", rna.sequence);
    let dna = rna.to_dna();
    println!("DNA: {}", dna.sequence);
}

and prints

DNA: ATGC
RNA: AUGC
DNA: ATGC

as expected.

I'm almost comfortable with this solution. I prefer to have sequence as String becuase I'll be heavily using string methods. Also, I need two sets of characters for each Alphabet, one set for valid symbols (e.g., ATGC for DNA) and another set for allowed characters (e.g., N). Thus, I decided to store the alphabet using boolean presence/absence maps that are filled with a macro.

What really bothers me is the introduction of a dummy field seq_type: A, into pub struct Sequence<A: Alphabet>. Without this field, there was an error due to the "unused parameter A". To fill this new field, I implemeted a default trait, that definitely adds some overhead and makes code less concise.

Well, this seq_type might have some purpose of storing the sequence type (DNA, RNA, Protein, etc), however, I'm not sure what is the best way to implement this. Should I keep it struct or using enum or something else?

Any suggestions on how to improve the code?

1 Like

The other option would be std::marker::PhantomData<A>, which is guaranteed to be zero-sized.

Alternatively, you could lean into the idea of storing the alphabet by moving the consts to method returns and adding &self to the trait methods. This would let you define custom alphabets at runtime in addition to the specialty ones DNA, RNA, etc:

pub trait Alphabet
{ 
    fn symbols(&self)->[bool; ALPHABET_SIZE];
    fn allowed(&self)->[bool; ALPHABET_SIZE];

    fn is_word(&self, sequence: &String) -> bool
    {
        let symbols = self.symbols();
        sequence.chars().all(|s| symbols[s as usize])
    }
}

Why don't you just #[derive(Default)]?

Yes, that's a good idea, however, the dummy field is still there and I don't know how to properly use it.