Module for Python using PyO3

I'm writing a module for Python using PyO3. One of the basic pieces of functionality I would like to have is the following.

from module import Alphabet, Sequence

alph  = Alphabet(["a", "b", "c"])
seq_1 = Sequence("aaa", alph)
seq_2 = Sequence("bbb", alph)
...
seq_n = Sequence("ccc", alph)

Namely, I want to define an alphabet and then use it to create sequences. Obviously, I want only one instance of Alphabet to be there, thus, it should be accessed by reference. Both Sequence and Alphabet are [#pyclass] structs. They're never exported to Python's objects and should stay inside Rust's memory.

I tried to define them as follows:

#[pyclass]
pub struct Alphabet { ... }

#[pymethods]
impl Alphabet
{
    #[new]
    pub fn new(...) -> Self
    {
        Alphabet { ... }
    }
}

#[pyclass]
pub struct Sequence<'a> 
{
    alphabet: &'a Alphabet,
    sequence: String,
}
  
#[pymethods]
impl<'a> Sequence<'a> 
{
    #[new]
    pub fn new(sequence: String, alphabet: &'a Alphabet) -> Self 
    {
        Sequence { alphabet : alphabet, sequence : sequence }
    }
}

The problem here is with 'a lifetime. PyO3 explicitly bans lifetimes (and generics also) for [#pyclass] classes.

What options do I have to fix this? Can it be saved with Arc? I definitely don't want to export it to Python's object and then import back.

Since Python can't track lifetimes statically, there's no chance it could enforce the same kind of safe usage patterns that Rust does. So naturally, one can't just expose references to Python – the validity of references must be upheld dynamically, hence yes, reference counting is needed. That's exactly what Python does internally as well.

4 Likes

Youcan use the Py type to hold a python reference directly inside of Sequence.
Like this

#[pyclass]
pub struct Sequence
{
    alphabet: Py<Alphabet>,
    sequence: String,
}
  
#[pymethods]
impl Sequence
{
    #[new]
    pub fn new(sequence: String, alphabet: Py<Alphabet>) -> Self 
    {
        Sequence { alphabet : alphabet, sequence : sequence }
    }
}

This will use python's native ref counted pointer.

4 Likes

Yeah, but I want an instance of Alphabet to be a Rust object, not a Python object.
Will alphabet be stored in Rust's memory?

I was able to make it with Py<...>. Now my question is on how to access the Alphabet object from Sequence inside Rust?

The alphabet is now a Python's reference to Rust's object. What is the best strategy in such cases? Should I acquire GIL through Python::with_gil(py) and then do as_ref? Other options?

#[pyclass]
pub struct Alphabet
{
    correct: HashSet<char>,
    allowed: HashSet<char>,
}

#[pymethods]
impl Alphabet
{
    #[new]
    pub fn new(correct: Vec<char>, allowed: Vec<char>) -> Self
    {
        Alphabet 
        {
            correct: correct.into_iter().collect(),
            allowed: allowed.into_iter().collect(),
        }
    }

    pub fn is_allowed_seq(&self, sequence: &str) -> bool
    {
        sequence.chars().all(|char| self.is_allowed_char(char))
    }

    pub fn is_allowed_char(&self, character: char) -> bool
    {
        self.allowed.contains(&character) || self.correct.contains(&character)
    }
}

#[pyclass]
pub struct Sequence
{
    alphabet: Py<Alphabet>,
    sequence: String,
}
  
#[pymethods]
impl Sequence
{
    #[new]
    pub fn new(sequence: String, alphabet: Py<Alphabet>) -> PyResult<Self>
    {
        // [!!!] HERE IT FAILS
        if !alphabet.is_allowed_seq(&sequence)
        {
            return Err(PyValueError::new_err("Incorrect sequence"));
        }

        let seq = Sequence 
        {
            sequence: sequence,
            alphabet: alphabet,
        };
        Ok(seq)
    }

    pub fn seq(&self) -> String 
    {
        self.sequence.clone()
    }
}

There isn't separate memory for Rust and Python. There may be separate allocators, but that doesn't matter the least bit when merely accessing a value through igs address.

That's one option, you can also take a Python parameter like so

    pub fn new(py: Python, sequence: String, alphabet: Py<Alphabet>) -> PyResult<Self>
    {
        // [!!!] HERE IT FAILS
        if !alphabet.as_ref(py).is_allowed_seq(&sequence)
        {
            return Err(PyValueError::new_err("Incorrect sequence"));
        }
    ...

On the python side, the Python parameter will be ignored
Either will work

I understand this. I distinguish Rust's and Python's "memory" on the basis oh what piece of code is responsible for deallocation.

So, from what I understand, once I return an Alphabet object, it instantly becomes wrapped by PyO3/Python into an Rc-type of smart pointer, i.e., Alphabet(["A", "T", "G", "C"], ["N"]) is <Alphabet at 0x7f25997a5830>. Here, 0x7f25997a5830 is the address of Python's smart pointer that points to the Alphabet object created by Rust code.

And so when inside Python's code I pass it to the constructor of Sequence, this will be the wrapped object (i.e., Rc<Alphabet>), thus, dereferencing is needed. But given that Python objects are GIL-protected, one should dereference them only after acquiring the lock.

Am I getting the full picture right?

Btw, how costly this whole business of using Py<...> wrappers? To me, it looks like one additional indirection, i.e., relatively inexpensive.

Well, my final goal is to make both a Rust library and a Python library (written in Rust) with roughly the same API and roughly the same codebase. So the option with Python::with_gil looks more attractive.

There is also another problem. Some alphabets (and some other objects also) need to be initialized just once and be globally available from both Rust and Python code. This is needed to use shortcuts for several use cases.

For example, DNA sequences can be initialized (in Python) the following way:

DNA_ALPHABET = Alphabet(["A", "T", "G", "C"], ["N"])
dna_sequence = Sequence("ATGC", DNA_ALPHABET)

However, for typical alphabets (e.g., DNA) I want to have a shortcut like

dna_sequence = DNA("ATGC")

Here, the DNA function is defined within Rust codebase

#[pyfunction]
pub fn DNA(sequence: String) -> PyResult<Sequence>
{
    let seq = Python::with_gil(|py| 
    {
        let alphabet = Alphabet::new(vec!['A', 'C', 'G', 'T'], vec!['N']);
        let alphabet: Py<Alphabet> = Py::new(py, alphabet).unwrap();
        Sequence::new(sequence, alphabet)
    });
    seq
}

The problem here is that each time a new Alphabet object is being created. But I want a single object DNA_ALPHABET that is shared between Rust and Python. Ideally, it should be registered as

#[pymodule]
fn module(py: Python<'_>, m: &PyModule) -> PyResult<()>
{
    let DNA_ALPHABET = Alphabet::new(vec!['A', 'C', 'G', 'T'], vec!['N']);    
    m.add("DNA_ALPHABET", DNA_ALPHABET)?;
    Ok(())
}

If I do it this way then the object is available in Python as module.DNA_ALPHABET, however, is outside of the global scope (i.e., not available in the DNA function). Moreover, it's not a Py<Alphabet> object, meaning that Py::new should be called for each new sequence that is not good.

If I declare it using lazy_static then m.add("DNA_ALPHABET", DNA_ALPHABET)?; says the trait IntoPy<pyo3::Py<PyAny>> is not implemented for DNA_ALPHABET.

So what is the best strategy to work with global predefined objects that are shared between Python and Rust and are easily available from both Python and Rust?

Ok, that gives me a much better picture of what you're doing.

A few things to note before I present a possible solution.

  1. Every value passed to python will be allocated to Python's heap. So using Py doesn't incur an additional cost from Python's perspective
  2. The cost of an allocation is kinda overblown, yes in tight loops it can matter, but otherwise it's more than fast enough
  3. Trying to keep a library ergonomic for both Rust and Python while keeping the same data structures is difficult, and will neccesarily require some amount of duplicated work.

In order to minimize the duplication, I reccomend you create a lower level library which doesn't store the Alphabet inside the Sequence, and instead has it passed in for all required methods. Then write two thin wrappers for this core library. One for Python which stores a Py<Alphabet> in it's Sequence type and the Rust one which could store either an Arc, a reference, or whatever you need.

This way there isn't any duplication in the core logic and each language can have a nice API.

1 Like