Modeling embedded hardware in Rust and how to have multiple mutable references cleanly?


#1

I’m modeling a system that has a CPU, GPU, MMU, APU, and maybe other
things. The CPU will have mutable references to the GPU, MMU, and APU. I
also would like the MMU to have a way to be able to call specific
functions on the GPU and APU. Where this comes into play is when I’m
mapping memory to different locations. The MMU takes care of that, and
will dispatch to GPU or APU if the memory request is in those devices.

Here is how I modeled it using Arc and Mutex. I was wondering if there is a cleaner way of achieving what I did here, or if this is the correct method.

use std::sync::{Arc, Mutex};

trait MMU {
    fn read(&self, addr: usize) -> u8;
    fn write(&mut self, addr: usize, value: u8);
}

#[allow(dead_code)]
struct Cpu {
    apu: Arc<Mutex<Box<MMU>>>,
    mmu: Box<Mmu>,
    gpu: Arc<Mutex<Box<MMU>>>,
}

struct Mmu {
    map: Vec<(usize, usize, Arc<Mutex<Box<MMU>>>)>,
}

impl Mmu {
    fn new() -> Mmu {
        Mmu { map: vec![] }
    }

    fn add_mapping(&mut self, start: usize, end: usize, cb: Arc<Mutex<Box<MMU>>>) {
        self.map.push((start, end, cb));
    }

    fn read(&self, addr: usize) -> u8 {
        // See if the addr is in a range that is mapped, then
        // call read on it.
        for i in self.map.iter() {
            if i.0 <= addr && addr <= i.1 {
                let d = i.2.clone();
                let d = d.lock().unwrap();
                return d.read(addr);
            }
        }

        println!("Mmu.read: {}", addr);
        0
    }

    fn write(&mut self, addr: usize, value: u8) {
        // See if the addr is in a range that is mapped, then
        // call write on it.
        for i in self.map.iter() {
            if i.0 <= addr && addr <= i.1 {
                let d = i.2.clone();
                let mut d = d.lock().unwrap();
                d.write(addr, value);
                return;
            }
        }

        println!("Mmu.write: {} {}", addr, value);
    }
}

struct Gpu;
impl MMU for Gpu {
    fn read(&self, addr: usize) -> u8 {
        println!("Gpu.read: {}", addr);
        0
    }

    fn write(&mut self, addr: usize, value: u8) {
        println!("Gpu.write: {} {}", addr, value);
    }
}

struct Apu;
impl MMU for Apu {
    fn read(&self, addr: usize) -> u8 {
        println!("Apu.read: {}", addr);
        0
    }

    fn write(&mut self, addr: usize, value: u8) {
        println!("Apu.write: {} {}", addr, value);
    }
}

fn main() {
    let apu = Arc::new(Mutex::new(Box::new(Apu) as Box<MMU>));
    let gpu = Arc::new(Mutex::new(Box::new(Gpu) as Box<MMU>));
    let mut mmu = Box::new(Mmu::new());

    // If a memory read/write occurs at 0x300-0x400, then the
    // GPU should handle it.
    mmu.add_mapping(0x300, 0x400, gpu.clone());
    // If a memory read/write occurs at 0x100-0x200, then the
    // GPU should handle it.
    mmu.add_mapping(0x100, 0x200, apu.clone());
    // Otherwise the MMU will handle it.

    let mut c = Cpu {
        apu: apu,
        gpu: gpu,
        mmu: mmu,
    };

    c.mmu.read(0);
    c.mmu.write(0, 5);

    c.mmu.read(0x150);
    c.mmu.write(0x150, 5);

    c.mmu.read(0x350);
    c.mmu.write(0x350, 5);
}

Rust play URL

I notice that my solution isn’t ideal because the apu and gpu stored in the Cpu are references to MMU, and not to the original Apu or Gpu struct. It would need to be changed so that Cpu can call other functions on Gpu or Apu that aren’t defined in the MMU trait.


#2

Ha, I have nearly the same problem in my emulator. I have several chip emulators (Z80 CPU, PIO, CTC) which are owned by a “System” object, so I wanted to do this:

struct System {
    pub cpu: CPU,
    pub pio: PIO,
    pub ctc: CTC,
}

And there’s a trait called “Bus” (basic idea taken from rustzx emulator), which would be implemented on System, since this knows all the chip objects. The methods on the Bus trait are called from inside the chip implementations when the chips needs to talk to the ‘outside world’ (e.g. other chips in the system). When I’m calling into the CPU, I’d like to pass &mut System since this implements the Bus trait, and the System object knows all the relevant child objects.

impl CPU {
  ...
  pub fn step(&mut self, bus: &mut Bus);
  ...
}

But this would need to be called like this, and creates borrow checker errors:

  let mut system = System::new();
  ...
  system.cpu.step(&mut system);

As a workaround I’m putting the chip objects into a RefCell, and make all refs to the Bus non-mutable. In the Bus implementation I then need to .borrow_mut() the chip objects to manipulate them. If a multiple mutable borrow happens, this would now panic at runtime:

struct System {
    pub cpu: RefCell<CPU>,
    pub pio: RefCell<PIO>,
    pub ctc: RefCell<CTC>,
}

I think this is less overhead than your Arc<Mutex<Box< >>> approach, but it still has runtime overhead, and I’m still thinking that I’m missing something and that there must be a better solution.

PS: the work-in-progress code is here if you want to snoop around a bit: https://github.com/floooh/rz80


#3

Hm, so the core problem seems to be that there are several things, A, B, C, etc, which want to mutate each other? I think one possible approach is to introduce some kind of messaging infrastructure. That is, when A wants to mutate B, it’s just publishes a message “please mutate B”. And it’s the job of B to accept the message and to mutate itself:

struct A;
struct B;

enum Message {
    TweakA, FrobB
}

trait Actor {
    fn act(&mut self, messages: &[Message]) -> Vec<Message>;
}

impl Actor for A {
    fn act(&mut self, messages: &[Message]) -> Vec<Message> {
        vec![Message::FrobB]
    }
}

impl Actor for B {
    fn act(&mut self, messages: &[Message]) -> Vec<Message> {
        vec![Message::TweakA]
    }
}

struct System {
    a: A,
    b: B
}

impl System {
    fn run(&mut self) {
        let mut message_queue = Vec::new();
        loop {
            let mut new_queue = Vec::new();
            new_queue.extend(self.a.act(&message_queue).into_iter());
            new_queue.extend(self.b.act(&message_queue).into_iter());
            message_queue = new_queue;
        }
    }
}

fn main() {
    let mut system = System {
        a: A,
        b: B,
    };

    system.run()
}

This also might be relevant: http://prog21.dadgum.com/189.html


#4

Almost, except the B doesn’t know how to react to a specific manipulation message, this must be user-provided code living outside of the A and B implementation (thus the Bus trait). This is basically the higher-level emulator stuff with the system-specific glue code (basically what differentiates a ZX Spectrum from a CPC even though both are Z80 systems).

I think the core issue is that with such a ownership structure:

struct D {
  a: A;
  b: B;
  c: C;
}

impl Bus for D {
  ...
}

D is the owner of A,B,C which need to be manipulated independently from each other. But D also implements the Bus trait which on one hand has all the code which manipulates A,B,C, but must also be called from inside A,B and C. So a method is called on a &mut of D (or rather one of its owned objects A,B,C), but it must also hand itself down to make the Bus trait implementation available. Thus, 2 mut&'s to the same thing.

RefCell works around this by moving the exclusive-access check to runtime, and I think given the alternatives it is probably the most efficient way apart from having everything checked at compile time (as far as I understand it, nothing is on the heap, but there’s a smart-pointer-like object created at borrow_mut() to reset some internal borrow flag when the current scope is left.


#5

You can store an Fn<&mut B> in the message.


#6

PS: the ‘moving tank’ problem described in the link http://prog21.dadgum.com/189.html is usually solved in games by not processing game object logic for each object autonomously, but by ‘inverting the problem’ and process simple steps for all objects, e.g.:

  • all objects decide where to move (AI)
  • all objects move
  • collisions are checked for all objects
  • collision overlaps are resolved for all objects
  • all objects are rendered
  • etc…

This is much simpler, (and much more performant, since those phases are tight loops over linear arrays and can be made very cache-friendly) than treating game objects as fully autonomous, individual actors with message processing.


#7

I’m not exactly understanding how the tank problem applies to this. Could you give an example showing cpu/gpu/apu/whatever communicating with each other using this model?

As a side note, I changed Arc to Rc and Mutex to RefCell in my latest code.


#8

I’ve done some experiments here: https://github.com/floooh/rz80/pull/2


#9

I’ve run now into a grinding halt with my emulator code and was forced to think everything over and thanks to all the different feedback and input (thanks @matklad et al) I’ve come up with an idea how to untangle the processing and communication between emulated hardware components in an embarrassingly simple way (I’m expecting that I missed something), maybe this also helps the OP:


#10

Yes, this is what I had in mind when I was talking about “some kind of messaging infra”. And “snake bites it’s own tail” is a nice way to test if certain architecture is unrepresentable in Rust. Also this reminds me of yet another blog post: http://250bpm.com/blog:24


#11

I’m still not convinced this will work well, because then the timings will be completely hairy to calculate.

The problem I see with just having a simple shared mutable data structure, is let’s say we have a CPU instruction
that writes to a memory address contained in register R. Say also this memory address could map to audio, video, or some other region of the memory. The instruction expects to write to that memory address in a certain amount of cycles, and has it be synchronous. I’m not entirely sure how having a shared struct solves this, because you would first need to write into the shared state that you want some memory address, and then on the next CPU step when it calls the memory mapper it would get the memory and put it into the shared struct again, and it wouldn’t be until the following CPU step where you could resume the previous instruction.


#12

Yes, synchronizing the various components can be hairy. The way I solved this is probably not perfect but good enough for an 8-bit emulator:

  • the CPU has a step() function which executes 1 instruction and returns how many cycles that took, but it runs that instruction as fast as it can
  • this per-instruction cycle count is the basis for the timing and synchronizing everything else in the system (for instance CTC counters and timers are decrement by this cycle count and must carefully manage and preserve remainders)
  • for each complete emulator frame (usually around 1/60 seconds) the number of emulated CPU cycles is computed (for instance a 1.75 MHz Z80 can run around 30k cycles per frame), and the CPU and chips are stepped in a tight loop and as fast as possible until this cycle count is reached, again the remainder is kept for the next frame so that no cycles ‘get lost’
  • the video decoding has to be synchronized line by line, again I’m not doing this in realtime, but whenever the CPU has executed enough cycles for one PAL line I’m decoding one line from the emulator video memory into the RGBA8 output texture
  • this was running fine, but I had problems with synchronizing the audio, since this has be to cycle-perfect (the systems I’m emulating don’t have sound chips but change the output frequency through the CPU with just a few cycles between frequency changes), I finally solved this by having the audio playback ‘rate-limit’ how fast the emulator runs. If the audio buffer playback is running slightly ahead or behind, the emulator will try to catch up or wait a few cycles.

So the emulator itself isn’t synchronized with the real-world ‘wallclock’ time cycle-by-cycle, instead it runs it’s number of 1/60 sec per-frame cycles as fast as it can and than hangs around for the rest of the frame idling until the vsync kicks in, but every single component in the emulator is synchronized to the emulated system clock, how fast this system clock runs in relation to the real-world clock isn’t really relevant.


#13

PS: I would probably handle memory accesses as a special case and let each chips read and write that directly. Also the Z80 doesn’t have memory-mapped I/O, so I may be missing some things there.


#14

I’ve just reading this thread and not looked at anyone’s code yet but thought I’d put out a couple ideas I had here. I think at least one of you is start to get to the idea but not there yet but the most important component in any computer is the most overlooked one is the clock signal as it connects to everything and sync everything else. Everything depends on it including all the other busses you need to simulate. People like to think that the CPU is in charge do to the name but its really just another component which has to do its thing to the beat of the clock cycles.

Once again I haven’t looked at any code but basically it seem like you are trying as close as possible to emulate the hardware in software which is a problem that was solve before. I started out more as a hardware guy so maybe that’s why it come to me vs other but basically you want to do what SPICE software was made to do. Now you probably aren’t looking to go all the way to the transister level but good ones also let you just do the higher level of chips and busses which is what you need. There’s open source SPICE software out there you might want to look at how they solve some of these problems.

Also remember rust is made to make threaded things easier so if you aren’t run each component as its own thread you are missing out on letting it keep everything in sync for you.

Anyway hopefully that give you guys some new ideas how you might solve the problems you are running into.


#15

Advancing the emulator by single cycles would make for a more correct emulation, but now you suddenly need to break the CPU’s (and every other chip’s work) down into single per-cycle-steps, and that brings in a whole new set of complexity since you need to store a lot more state in each chip emulation. This would be interesting if I’d want to do something like http://visual6502.org/ :slight_smile:

The focus (at least for me) is to get a fast emulator first, and correctness only “as much as needed to run a system and as much software as possible”.

IMHO it’s better to treat one CPU instruction as one atomic thing from the outside and use this as base how the rest of the system is advanced (MAME/MESS does the same and I guess they spent a lot of thought about this to find the right middleground between correctness and speed).


#16

Understand wanting to keep thing simple and getting them working I’m just thinking about how to get the timing issues solved and since the Z80 was a CISC vs RISC where different instructions took a different number of clock cycles to complete that instead just one by using the clock cycles you can solve a lot of the mut conflicts in the same way the original chip designers did it. Also can take advantage of design ideas like microcode that was introduced later which might let you have a core that can be re-used for other CPUs etc if you’d thought of doing something like that.

I’m mostly just thinking that going the one level deeper it’ll make some of the more complex problems smaller and simple problems to be solve even if the number of pieces ends up higher. Basic OOP ideas if a problem seems complex and hard to solve you object model isn’t right and you need to break it down some more :wink:
I also just look at this as you aren’t trying to solve the problem , the chip designers already did that, you are just make a different implementation of it using different tools.