How do I model composable stochastic processes in rust?

Hi - I originally posted this question on Stack Overflow and was directed here.

I'm writing a family of Markov Chain Monte Carlo (MCMC) algorithms to use as inference methods for a program induction system by translating the MCMC algorithms implemented here in C++ into rust.

The rough dynamic of MCMC is that we have some process which randomly moves around a space of objects and generates a stream of the objects which it discovers. The randomness is essential. Also, this stream never terminates (though you can stop taking objects from the stream). More complex forms of MCMC can be defined as compositions of this basic process. For example, we might create a pool of threads that samples from each thread in a round-robin fashion.

My question: how do I translate these algorithms idiomatically? More specifically, how do I to think about this problem from a rustacean perspective? What are the right tools/techniques to bring to the job?

Here's what I've considered:

  1. A direct translation seemed awkward, because the C++ code makes heavy use of callbacks, and that doesn't seem particularly idiomatic compared to making a stream of samples lazily available to the user to use as they see fit.

  2. I then thought that iterators might make sense, but:

  • The objects I'm sampling are complex data structures representing mini DSLs. It'd be nice to make a reference available that can be cloned if the user decides it's a useful sample. Naively, the chain only keeps around a single sample at a time, and these samples are independent as opposed to being parts of some larger structure, so I'm not entirely sure how to do this. Is there a way?
  • These algorithms all need a source of randomness (i.e. &mut R where R: Rng), and access to a simple control structure for keeping track of various statistics. Baking these directly into whatever struct implements Iterator then means that I can't compose the algorithms in a way that would share the control structure or source of randomness, right? If not, how would I do this?
  1. Generators/Coroutines seem to solve these problems, but seem like a relatively fringe/new area of rust. I began to wonder if I was making things harder than they need to be.

  2. I currently provide the struct for each algorithm with a function similar to the following, where H is a hypothesis and C is the control structure:

    impl<C, H> MCMCChain<C, H> {
        // ...
        pub fn next_sample<R: Rng>(&mut self, control: &mut C, rng: &mut R) -> Option<Self>
        // ...
    }
    

    Ideally, next_sample would return Option<&H>, but this is what I have working right now.

How would you implement this?

I think the most essential feature of a Markov process is that it updates its own state and produces results. Seems like describing it using an Iterator is the best fit — and probably the simplest to get started with amongst the alternatives you enumerated.

As for sharing the source of randomness, you can wrap your (P)RNG into an Rc<RefCell<_>> (if you don't need to share it across threads) or an Arc<Mutex<_>> (if you do), and pass clones of the Arc around.

Thanks, this is also why I leaned toward Iterators.

In the past I've steered away from Rc<RefCell<_>> and its cousins since they incur runtime performance costs. Is the hit not really a reason to be concerned?

One other thing I'm not entirely sure how to get working with Iterators is passing the results back by reference to avoid unnecessary cloning.

Yeah, this will be difficult to do if you try to have the Iterator own the data and yield references into it. The alternative is to have a type X that owns the results, and then do this:

impl X {
    pub fn iter(&self) -> Iter<'_> {
        // ...
    }
}

where Iter<'a> is a type that holds a &'a X (or the moral equivalent) and implements Iterator<Item = &'a ResultData>. This is a common pattern, including in the standard library.

(I hope I'm understanding your issue correctly here -- apologies if I'm not.)

1 Like

My approach was to simply define a method that does one iteration of the MC:

    fn move_once(&mut self);

See code on GitHub

Then any accessors are separate methods. It's simple and flexible and easy to use.

1 Like

Bumping a reference count will increment an integer which, assuming you've used the item inside the Rc recently, will probably be in cache. Borrowing a RefCell will involve a normal integer increment/decrement and a branch. You are the only person who can say whether this is acceptable, but unless you are doing something extreme it'll probably blend into the noise.

Keep in mind that sharing your RNG between threads means it will no longer be deterministic, regardless of whether you seed the RNG or not.

Can't you just pass them by value (i.e. std::move() from C++)? Otherwise if each node in the stream is immutable but depends on the previous node, you can just wrap it in an Rc to get some sort of singly linked list.

You may want to take inspiration from rust-analyzer. Their needs will obviously be different to yours, but rust-analyzer needs to deal with lots of tree structures where memory usage and performance is a massive concern.

1 Like

I hadn't thought of this approach - this is a neat trick. I think you could tell a conceptually reasonable story about MCMC using this approach.

Yes - this is closest to what I'm doing now. It seems fairly flexible (e.g. I've figured out how to implement Iterator in terms of something like this), though it doesn't take advantage of any existing traits, etc.

Thank you for this - Having the details of Rc<RefCell<_>> spelled out concretely makes it easier to think about the tradeoffs of different approaches.

Thanks - that's a great idea!

1 Like

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.