Learning Rust by writing a Z80 emulator (blog post series)

Hi, I have started to dive into Rust (coming from C/C++) by writing a Z80 CPU emulator (which will hopefully turn into a full 8-bit home computer emulator one day).

The friendly person behind the @rustlang twitter handle suggested to post this here, so that there's a central place for gathering feedback, so here it goes:

The latest post is Z80 emulation in Rust, Milestone 1, this is more about general CPU emulation and a bit light on Rust details.

This is first post in the series: First Steps in Rust

I'm only allowed to post 2 links at once here, but if you scroll a bit down on the blog there's a "The Amazing Zilog Z80" post which might be useful as a general introduction to the Z80 CPU.

I'm looking for some feedback and Rust advice, mainly where I'm starting to run into the wrong direction, or where I have overlooked some obvious simple solution where my C/C++ brain gets in the way of Rust.

Thanks for reading and I'm looking forward to your suggestions :slight_smile:

-Floh.

10 Likes

@floooh Thanks for share. I'm learning Rust coming from Python, I'm not advanced in learning as you but I want, in the future, write a emulator, probably a Z80 emulator as this is a base for vídeo-game emulators like: SEGA Master System, Nintendo Game Boy (Color), SEGA Genesis (co-processor), Neo Geo etc, certainly I'll be reading your posts.

So pumped to see this! A few random thoughts/comments:

The double cast ‘as u8 as char’ is a bit wacky, but Rust doesn’t seem
to allow casting directly from an i64 to a char.

This is because char is a unicode scalar value, so not all i64 values are valid chars: 0x0 to 0xD7FF and 0xE000 to 0x10FFFF inclusive only. The code might be better written as

                let c = cpu.mem.r8(addr) as u8;
                addr = (addr + 1) & 0xFFFF;
                if c != b'$' {
                    print!("{}", c);
                }

Note the b there, which is for a byte string literal: The Rust Reference has moved That way, you still get to see that it's a $, but you're just comparing the u8 values.

Also note the ‘manual 16-bit overflow fix’ in addr = (addr + 1) & 0xFFFF. By
default, integer overflow in Rust is a runtime error, and this is one possible
workaround (more on that later, this is basically also the reason why
registers are accessed via setter/getter methods, and not directly exposed
as data members).

I am not sure this fixes it, after all, its the addr + 1 itself that will overflow, no? But, then reading below, addr isn't a u16, it's a larger integer type, so this isn't acutally an overflow, but is because you're treating an i64 as a u16? I found this stuff slightly hard to follow through just the snippets of code; I don't think that's a failing of the post, though, it's just something that's tough.

Rust doesn’t do cross-module inlining,

Rust should do cross-module inlining, but won't do cross-crate inlining without the annotation.

But all in all, the current speed (without much tweaking) of the Rust
implementation is really not too shabby, especially since there’s not
a single line of unsafe code in the whole thing.

:confetti_ball:

1 Like

@floooh Have you seen rustzx?

On the other hand, IBM's Z800 seems such a low hanging fruit :wink:

For a full dive into emulator tech, even if it is a bit overwhelming at first I recommend studying MAME/MESS a bit (https://github.com/mamedev/mame), especially how they emulate different chips and connect them together via callbacks. All the hard work is in the chip emulation, once these work and are tested, building a complete system emulator is more or less connecting the right chips, adding input and write a video decoder function.

1 Like

Thanks a lot for the feedback! Yes, the 'addr' is a wide integer (currently i64), so an integer overflow wouldn't happen since it wraps around after the & 0xFFFF at 64k. I should have mentioned that a bit earlier, it's only late in the post where I'm talking about the integer type differences between the C++ and the Rust version.

I will also play around with the wrapping data types and see if this simplifies the code and how it affects performance (the C++ code has to widen and narrow integer types quite often, which probably also isn't very good).

I'll play around with the inlining and also LTO a bit. If small functions should already be inlined across modules than the current behaviour is a bit strange :slight_smile:

I haven't looked at rustzx yet, but definitely will! My main inspiration was Krzysztof Kondrak's C64 emulator: https://github.com/kondrak/rust64

Writing an emulator to get into Rust seems to be quite popular :wink:

2 Likes

Are you using codegen-units? I believe that can prevent cross-module inlining.

If it's not on by default, then I'm not using codegen-units. Thanks for the hint though :slight_smile:

Note that, if you want to use the native u16 type, you can then use wrapping_add to force a wrapping addition that is well defined and will not panic at runtime. It's up to you if you find that more readable however, and I don't know if it has any runtime or optimization benefit.

1 Like

This generally applies to any native type. I've been using it extensively in wrapped u8 operations in my emu as well and it will definitely come in handy in case of any oldschool emulator. Wrapping_add (and wrapping_sub) is as of now the least obfuscated way to get safe overflow arithmetics.

1 Like

Wouldn't you be able to create a sugar macro to do the wrapping?
I can think of something like: wrapping!(addr + 1).

A sample on how to deal with operators:

macro_rules! expr_identity {
    ($e:expr) => { $e }
}
macro_rules! apply_op {
    ( $op:tt ) => { expr_identity!(10 $op 5)};
}

fn main() {
    println!("{}", apply_op!(/));
    println!("{}", apply_op!(*));
}

And about macros, take a look at this one, it may give you some ideas.

2 Likes

What I have done now is treat the 8-bit register bank as an array of u8's and narrow and widen through casting in the setters and getters. 16-bit register accesses are still constructed with a (r0<<8) as i64 | r1 as i64 (that's where the union would be used on C++).

This is replacing the (& 0xFF)'s that I used before. I also don't need to use 'manual wrapping', since all operations that would lead to an overflow are happening on i64.

This brought a little speedup. Together with some other minor tweaks (enable LTO, move some computation to happen only in places where they are needed...), I'm now at about 900MHz (vs 850 before). I think this is now good enough that I can move on to the PIO and CTC chip emulation, the most interesting thing there will be the callbacks that connect the chips together (I've been running into some confusing problems with closure lifetimes, but I'll have to experiment and understand more).

That 6502 macro thing is pretty crazy, I'll reserve this for later :slight_smile:

I know that this may be pointless in practice for an Z80 emulator, but I would like to have more details about the performance costs in Rust. I think there is a valid comparison to be made and I would appreciate a more in detail insight on performance penalties when using Rust (even maybe some detail if specifically something being idiomatic is causing a performance penalty or gain).

Personally I like this (and need) because I'm still not able to read a Rust code and see the generated assembly. This knowledge takes time, lots of reading and experimentation to build up and is something that I think is missing in the Rust community blogs/tutorials, etc.

So, maybe an step after finishing your emulator would be a series of posts about optimizing it trying to catch up with the C++ implementation. I may be biased, but I see this difference from 900Mhz to 1500Mhz bigger than I hoped for.

1 Like

I found it best to experiment and output the generated assembly now and then when I want to verify that it's doing what I expect to do. If something is out of place I usually ask about it. Like here Code generation for iterator chains

1 Like

I agree, but reading others experiences helps a lot. See, your own account of it is a good example. :slight_smile:

Thanks for linking your post. Iterator chaining is not something that I've used and now I'm a bit aware of how it performs (and that it may have been fixed).

A lot of the difference is coming from the 2 different implementations (Rust is currently the slower "Algorithmic Decoder", and C++ is the fast-but-dumb Giant-Switch-Case). I'll try to get something more comparable in the end, at the moment I also find it a bit unsatisfactory that the 2 implementations can't be directly compared. I'll probably end up with a code-generation-solution in Rust, where the 'Algorithmic Decoder' is running at build time an generates a 'Giant Match' source code, this would be similar to what I do on the C++ side where a python script running at build time generates the C++ source with the bulk of the CPU instruction decoder in it.

What would be interesting would be to know if and where Rust does hidden runtime checks (e.g. boundary checks for indexed array accesses? I think this can't be done all at compile time right?).

Yes, but the compiler should optimize the obvious cases and there are tricks to use to avoid where the compiler does not optimize.

1 Like