Mutually dependent factories

Hello.

I'm experimenting with certain aspects of Rust to choose the best API for a bioinformatics library. I've stopped at the following interface to work with sequences:

// the arguments are alphabets and transcription tables, see below
let (DNA, RNA) = make_nuc_pair(...); 

let dna: Sequence<'_> = DNA("ATGC");
let rna: Sequence<'_> = dna.transcribe();

Here, DNA and RNA are factories that create instances of Sequence. In conjunction with a sequence string, it is obligatory for each Sequence to be associated with an alphabet and also with the information on how to transcribe each Sequence into its pair, i.e., DNA to RNA and vice versa. In order to hide these details, I'm using factories that take &str as an input and return a Sequence object with all fields being set according to the recipe that was established during the creation of these factories when make_nuc_pair was called.

So how does the Sequence look like?

struct Sequence<'a> 
{
    sequence: String,
    alphabet: &'a Alphabet,
    tsc_data: &'a TranscriptionData<'a>,
}
struct Alphabet { /* details are not important */ }

The transcription data field tsc_data refers to the following structure

struct TranscriptionData<'a>
{
    factory : Option<Box<dyn Fn(String) -> Sequence<'a>>>,
    table   : &'a TranscriptionTable,
}
struct TranscriptionTable { /* details are not important */ }

Namely, when .transcribe() method is called the sequence is processed using the table and then passed to factory to create a new Sequence object that is a result of transcription

impl<'a> Sequence<'a>
{
    fn new(sequence: &str, alphbaet: &'a Alphabet, tsc_data: &'a TranscriptionData<'a>) -> Self
    { 
        Sequence { sequence : sequence.to_string(), alphabet : alphbaet, tsc_data : tsc_data }
    }

    fn transcribe(&self) -> Sequence<'a> 
    {
        let sequence = self.tsc_data.table.transcribe(&self.sequence);
        (self.tsc_data.factory.unwrap())(&sequence)
    }
}

And there is a problem with this approach. Namely, a sequence with a DNA alphabet should refer to a corresponding RNA factory and vice versa. This a classical headache in Rust.

Here is a (non-working) sketch of logic that initializes factories:

pub fn make_nuc_pair<'a>(dna_alphabet : &'a Alphabet, 
                         rna_alphabet : &'a Alphabet,
                         dna_to_rna   : &'a TranscriptionTable,
                         rna_to_dna   : &'a TranscriptionTable) -> (impl Fn(String) -> Sequence<'a>, impl Fn(String) -> Sequence<'static>) 
{
    let mut tsc_data_dna = TranscriptionData { factory : None, table : dna_to_rna };
    let mut tsc_data_rna = TranscriptionData { factory : None, table : rna_to_dna };

    let DNA = Box::new(move |seq: String| Sequence::new(&seq, dna_alphabet, &tsc_data_dna));
    let RNA = Box::new(move |seq: String| Sequence::new(&seq, rna_alphabet, &tsc_data_rna));

    tsc_data_dna.factory = Some(RNA); // this is the place where
    tsc_data_rna.factory = Some(DNA); // everything breaks

    (RNA, DNA)
}

So the idea is to first create TranscriptionData objects with factory set to None, then create the corresponding factories (they are closures), and finally to cross-reference them. This works in Python, but as we know, not in safe Rust.

What I would like to have is some way (most likely, with the unsafe block) to cross-reference these factories such that the resulting structure is (1) thread safe (Sequence objects are expected to be sent between threads) and (2) is not a weird Option<Arc<RefCell<&'a ... type of thing. I'm OK to work with raw pointers.

  • What is the range of possible solutions and their pros and cons?
  • Is there a way to change closures to functions?
  • Is there a way to get rid of Option and initialize with null pointer?

P.S. All of this is needed because I want to work with dynamic (i.e., created at run-time) alphabets and translation tables.

I looked at this for a while, but I'm still not sure what the goal is. Is there something more complex to transcription than replacing each character with its corresponding character in the other alphabet?

Regardless, you should start with the following:

  • By default, assign every stored reference its own lifetime.
  • Derive Copy (and Clone) for every struct that is capable of it.
  • Don't make references to Copy types that store references.
1 Like

Closures may have state (captured variables), but functions may not. To turn a closure into a function you'd have to remove its state, e.g. you could pass &Alphabet and &TranscriptionData as arguments every time you call DNA and RNA. At that point, they'd lose any advantage over calling Sequence::new directly. Once you do that, problems start to disappear.

1 Like

Sometimes stacking these types is quite useful. In other cases they're a warning that you might be abstracting too far, like JavaBeans.

I get a feel of OO overengineering, as @tbfleming indicates, you are building a interconnected web of "objects", the excessive usage of lifetimes is a clear sign. it's probably translated from other language like python you mentioned. rewriting a piece of code in different language should not be done at the syntax level, but instead at the semantic level.

let's take a step back and have a look. since you mentioned the alphabets and transcription table need to be runtime information, let's not look at solutions involving complex type level encoding. there's ton of redundant information in your code.

from your description and the snippets, I think the information needed to transcribe a sequence is all encoded in the TranscriptionTable, is that correct? if so, the TranscriptionData type serves no real purpose, other than "abstractions for the sake of abstractions".

then, Sequence already has a constructor, the "factory" closures add no value other than "why not because we can".

next, the TranscriptionTable, I would assume you can calculate the inverse of a table on demand, so you only need two Alphabets and one TranscriptionTable. if this assumption doesn't hold, you can extend the TranscriptionTable, e.g. to have two different transcribe methods, or you can simply bundle two TranscriptionTables together as a tuple or struct.

in summary, when defining a data type, think what data you own, not what data you might reference later in some operation, and don't use lifetime if you don't need them. in other words, think data, don't think "objects", this is often referred as data oriented design.

in my mind, I would imagine the API can be reduced to something like:

/// I assume Alphabet is relative small, just make it `Copy`
#[derive(Clone, Copy)]
struct Alphabet {}

struct TranscribeTable {
	from: Alphabet,
	to: Alphabet,
}
impl TranscribeTable {
	fn new(from: Alphabet, to: Alphabet) -> Self {
 		TranscribeTable { from, to }
	}
	fn transcribe(&self, sequence: &str) -> String { todo!() }
}

struct Sequence<'a> {
	sequence: String,
	table: &'a TranscribeTable,
}
impl<'a> Sequence<'a> {
	fn new(sequence: String, table: &'a TranscribeTable) -> Self {
		Sequence { sequence, table }
	}
	fn transcribe(&self) -> Sequence<'a> {
		Sequence::new(self.table.transcribe(&self.sequence), self.table)
	}
}

// you can return a struct instead of tuple
fn make_table_pair(dna_alphabet: Alphabet, rna_alphabet: Alphabet) -> (TranscribeTable, TranscribeTable) {
	let dna_to_rna = TranscribeTable::new(dna_alphabet, rna_alphabet);
	let rna_to_dna = TranscribeTable::new(rna_alphabet, dna_alphabet);
	(dna_to_rna, rna_to_dna)
}

if you insist to use a pair of "factories", you can do it on top of the pair of tables, something like:

fn make_factory_pair<'a>(
	dna_to_rna: &'a TranscribeTable,
	rna_to_dna: &'a TranscribeTable,
) -> (
	impl Fn(String) -> Sequence<'a> + 'a,
	impl Fn(String) -> Sequence<'a> + 'a,
) {
	let dna_factory = |sequence| Sequence::new(sequence, dna_to_rna);
	let rna_factory = |sequence| Sequence::new(sequence, rna_to_dna);
	(dna_factory, rna_factory)
}
2 Likes

The goal is to develop a bioinformatics library in Rust and Python (written in Rust) with roughly the same API. So I'm experimenting on what the common denominator between these two (apparently, different) languages could be.

Wrapper factories such as DNA and RNA are needed to make the API look "good" and compact. It's my choice.

I intentionally use TranscriptionData, because other levels of complexity are envisioned.

It is imperative not to clone/copy alphabets and transcription tables because

  1. The number of sequences can be quite large (millions)
  2. These sequences can be as short as the alphabet itself (short k-mers)
  3. Other data structures (larger than alphabets) are envisioned where the same design would apply.

I appreciate the effort to make me learn "correct Rust design", but I would still insist on my original question, namely, unsafe cross-referencing of two lifetimed objects. I can make everything be "correct" Rust, however, this will be "incorrect" Python. So I have to make trade-offs, and for the purpose of current round of experimentation my choice is to go unsafe (the questions is, what's the "best" way?).

If you try to make identical APIs in Rust and Python, both will suffer. Don't; let them differ where the differences are natural.

Cases like this are often a good fit for using Arc (rather than a reference or Box). This way the data is not copied, but you also don't end up with lifetimes on your structs.

7 Likes

I'm perfectly aware that both will suffer. Actually, it's me who is suffering :slight_smile: Should that stop me from experimenting on what the common denominator would look like? It can easily be some core library with two different wrappers (one for Rust, one for PyO3/Python), for example. I just want to explore my options, including the ones that consists in doing basically C/C++ (i.e., raw pointers).

Speaking of Arc, if the same alphabet (or other structure) is accessed via Arc (for reading purposes only) from multiple threads (this is expected to happen very frequently), would it decrease the performance significantly if compared with plain & reference?

For reading memory, you're only going to pay a performance hit for Arc when you update the reference count.

A couple of observations:

  • why is the transcription table from dna to rna and from rna to dna separate if you'll always need them together?
  • why is the dna and rna alphabets provided separately from the transcription table?
  • why should transcription data store a closure? Can't you just make factory a method on it? A closure is just going to hide its references/dependencies.

I would argue that these kind of "cross-references" are a bad practice in every language, since they make it difficult to reason about what references what due to the presence of multiple cyclic paths between them.

The best way is "don't use unsafe", at least until you can provide an explanation of why your usage of it would be fully safe and can't possibly result in Undefined Behaviour. unsafe is not a magic wand that will fix your program, rather it will actually create more problems and harder to debug.

If you plan to use PyO3 then you'll have to avoid having any lifetime parameter in user facing types, since PyO3 doesn't allow to expose them to python.

If you use Arc though you'll have to be careful not to introduce cycles that aren't broken by Weaks, otherwise you'll leak memory.

8 Likes

Not to mention that OP is definitely misunderstanding something, because the combination of Arc and RefCell is useless. RefCell is not threadsafe, so Arc<RefCell<T>> isn't, either – which makes the Arc no more than an unnecessary synchronization overhead.

Unrelated, but never once in my life did I have a problem to solve and thought, "I know! I'll use a Factory!"

1 Like

It's embarrassing to admit this, but I once did. Same with the other stuff in the Gang of Four book.

1 Like

is not a weird Option<Arc<RefCell<&'a ... type of thing

That's why I mentioned it as "type of thing" (and not the actual solution I had in mind with unsafe refs/pointers) by putting just a random sequence of wrappers to demonstrate the type of construction I don't want.

why is the transcription table from dna to rna and from rna to dna separate if you'll always need them together?

For generality. Once again, it's a REASEARCH project. I'm EXPLORING my options, not LIMITING them.

why is the dna and rna alphabets provided separately from the transcription table?

Sometimes, TranscriptionData will be missing (it'll be Optioned later).

why should transcription data store a closure? Can't you just make factory a method on it? A closure is just going to hide its references/dependencies.

Can you show an example? What I want is something called DNA/RNA that looks like a function, i.e., used like DNA("ATGC"), that generates a Sequence object with a defined Alphabet and TranscriptionData,

I would argue that these kind of "cross-references" are a bad practice in every language, since they make it difficult to reason about what references what due to the presence of multiple cyclic paths between them.

Good luck with graphs. Anyway, in this case, all of this is limited to a single cross-reference. Both alphabets and transcription tables are not mutable. And, in the majority of use cases, they are expected to be instantiated in the beginning of the program.

If you plan to use PyO3 then you'll have to avoid having any lifetime parameter in user facing types, since PyO3 doesn't allow to expose them to python.

For now, I need a working Rust version according to my goals. Later, we'll see. And, as it frequently happens with research, goals may change in the process.

P.S. Why is so hard just to show the best unsafe implementation?

Because unsafe is never the best implementation.

5 Likes

Because using unsafe isn't a magical button to make the problems go away and your problem isn't necessarily one that unsafe could even solve.


Reading through this thread, it sounds like there may be a bit of a design mis-match where the current approach doesn't line up with the way Rust pushes people to write their code. That can feel really frustrating because you want to write your solution one way and the compiler or other Rustaceans keep rejecting you.

You may have heard this referred to as "fighting the borrow checker".

Usually, when I design something in Rust I'll think about what my major concepts are and the way data changes as it flows through. Maybe even sketch out an example of how you'd like end users to use your code and see how different objects need to interact.

This is a massive over-simplification, but generally some guidelines are:

  • Make sure you have a clear ownership story where A owns B, which owns C, and so on
    • Avoid reference cycles or webs because they make the ownership story less clear
    • Avoid putting references in structs because it's easy to tie yourself in knots with lifetimes
    • Avoid interior mutability (RefCell) if you can because it's often a device used to muddy the ownership story
  • Try to write your code in a functional style and stick to immutability
  • Avoid GoF design patterns - they were created for a very different kind of programming language and a literal translation will cause you a lot of frustration
  • unsafe is not the answer - Rust Koans​​​​​

Regarding Alphabet and other things that could potentially be quite large, Arc is the thing you want to reach for because it lets you create a single Alphabet instance and have a bunch of other things reuse it. For more advanced situations, the im crate might be helpful because it contains types which allow you to do structural sharing.

2 Likes

unsafe in rust comes with a tough challenge: don't allow safe code to cause UB. C and C++ don't have that. In C, it's common for a library's doc to list a bunch of esoteric rules that clients must follow or kaboom. C++ started to abandon that approach (the Modern C++ movement), and came up with a bunch of practices which separate ergonomic libraries from non-ergonomic ones. It's tough to fit your intended design into the Modern C++ ergonomic model without adding a couple more layers of complexity to sort out ownership. Once you do that, it's likely possible to reproduce it in Rust without using unsafe.

Even Python has an issue: your model has circular references which cause leaks.

This might be an acceptable solution for you. Rc everything and accept that it will leak.

Python does collect cycles.

2 Likes