Here is some Python code for a typical bioinformatics library.
class Alphabet:
symbols = set()
def __init__(self, symbols: str, complement: str):
self.symbols = set(symbols)
self.cmplmnt = {s : c for s,c in zip(symbols, complement)}
def is_word(self, sequence: str):
return all(s in self.symbols for s in sequence)
def complement(self, sequence: str):
return sequence.translate(str.maketrans(self.cmplmnt))
DNA_ALPHABET = Alphabet("ATGC", "TACG")
RNA_ALPHABET = Alphabet("AUGC", "UACG")
PRT_ALPHABET = Alphabet("WXYZ", "WXYZ")
class Sequence:
sequence : str | None
alphabet : Alphabet
def __init__(self, sequence: str, alphabet: Alphabet):
self.sequence = sequence if alphabet.is_word(sequence) else None
self.alphabet = alphabet
def is_valid(self):
return self.sequence is not None
def reverse(self):
return Sequence(self.sequence[::-1], self.alphabet)
def complement(self):
return Sequence(self.alphabet.complement(self.sequence), self.alphabet)
def reverse_complement(self):
return Sequence(self.alphabet.complement(self.sequence[::-1]), self.alphabet)
class DNA(Sequence):
quality: list[int]
def __init__(self, sequence: str):
super().__init__(sequence, DNA_ALPHABET)
def to_rna(self):
pass
def to_protein(self):
pass
class RNA(Sequence):
quality: list[int]
def __init__(self, sequence: str):
super().__init__(sequence, RNA_ALPHABET)
def to_dna(self):
pass
def to_protein(self):
pass
class Protein(Sequence):
def __init__(self, sequence: str):
super().__init__(sequence, PRT_ALPHABET)
What is good about this code:
- It is very concise.
- I can create both predefined types of sequences (
DNA
,RNA
,Protein
) and custom ones (Sequence
) with an arbitraryAlphabet
. - I only need to change the
Sequence
class in case I need to add more functionality to work withself.sequence
, e.g., find a substring, extract a substring etc... This would then apply to all child classes. - I can add features (e.g.,
quality
) and methods (e.g.,to_rna
) to specific types of sequences.
So what is the best strategy for porting this type of OOP setup to Rust? Here are my thoughs...
- If I make a
struct Sequence
and then use it as a field in another struct, e.g.,struct DNA { sequence: Seqeunce }
, then thesequence
filed would be accessed differently inSequence
andDNA
, i.e.,Sequence::sequence
andDNA::Seqeunce::sequence
. This complicates sharing methods between these two structs. - I can declare
struct Sequence { sequence : String }
andstruct DNA { sequence : String }
. I would then need to declare a trait, e.g.Seq
, that works withself.sequence
. However, this trait should be specified for all predefined types of sequences, i.e.,DNA
,RNA
,Protein
andSequence
. That's too much. - I can use blanket implementation, namely, implementing some dummy trait
SeqDummy
for each sequence types, then, declaringtrait Seq { ... }
and implementing it for all classes with dummy trait at once, i.e.,impl<T: SeqDummy> for T { fn complement(&self) -> T { ... }; }
. This approach has its own overhead also, however, not that much.
I personally could not come up with a better solution than the 3rd one. Thus, I would like to listen what the community suggests.