Little Brainfuck interpreter in D/Rust for a benchmark


#1

I have found this page with various benchmarks, some of them are in Rust:

So I’ve written various improved versions of them, in Python, D and Rust.

Here I’ve just explained some of my improvements to the Brainfuck D version:

I have tried to port some of the D improvements to Rust, but I am not yet very good in Rust, and I think Rust offers less ways to improve this kind of code:

use std::io::{ stdout, Read, Write };

struct Tape {
    pos: usize,
    tape: Vec<u32>
}

impl Tape {
    fn new() -> Tape { Tape { pos: 0, tape: vec![0; 100_000] } }
    fn get(&self) -> u32 { self.tape[self.pos] }
    fn getc(&self) -> char { self.get() as u8 as char }
    fn inc(&mut self) { self.tape[self.pos] += 1; }
    fn dec(&mut self) { self.tape[self.pos] -= 1; }
    fn advance(&mut self) { self.pos += 1; }
    fn devance(&mut self) { if self.pos > 0 { self.pos -= 1; } }
}

struct Program {
    code: Vec<char>,
    bracket_map: Vec<usize>
}

impl Program {
    fn new(content: String) -> Program {
        let mut code = Vec::new();
        let mut bracket_map = vec![0; content.len()];
        let mut leftstack = Vec::new();
        let mut pc = 0;

        for c in content.chars() {
            match c {
                '+' | '-' | '.' | ',' | '<' | '>' => (),
                '[' => { leftstack.push(pc); },
                ']' => match leftstack.pop() {
                    Some(left) => { bracket_map[left] = pc;
                                    bracket_map[pc] = left; }
                    None => () // Error?
                },
                _ => { continue; } // ',' ignored.
            }
            code.push(c);
            pc += 1;
        }
        Program { code: code, bracket_map: bracket_map }
    }

    fn run(&self) {
        let mut pc = 0;
        let len = self.code.len();
        let mut tape = Tape::new();

        unsafe {
            while pc < len {
                match *self.code.get_unchecked(pc) {
                    '+' => tape.inc(),
                    '-' => tape.dec(),
                    '>' => tape.advance(),
                    '<' => tape.devance(),
                    '[' => { if tape.get() == 0 { pc = *self.bracket_map.get_unchecked(pc); } },
                    ']' => { if tape.get() != 0 { pc = *self.bracket_map.get_unchecked(pc); } },
                    '.' => { print!("{}", tape.getc()); stdout().flush().unwrap() },
                    _ => () // ',' ignored.
                }
                pc += 1;
            }
        }
    }
}

fn main() {
    use std::fs::{ File };
    use std::env::{ args };

    let file_name = args().nth(1).unwrap();
    let mut code = String::new();
    File::open(&file_name).unwrap().read_to_string(&mut code).unwrap();
    Program::new(code).run()
}

Currently D language unfortunately doesn’t have computed gotos, but I’ve found a workaround. In Rust how do you write a fast interpreter? (https://en.wikipedia.org/wiki/Threaded_code ). The final D version is much faster than this improved Rust version. Help and suggestions are welcome.

Edit: the main point of this thread is to ask for ideas for possible improvements of Rust to implement simple but “fast enough” interpreters and FSMs. A possible idea is some restricted version of goto like in my last D version, but more explicit, so the optimization is deterministic.


#2

One thing: You don’t need valid UTF8 for a valid BF Programm, so using Vec<u8> instead of String seems appropriate.


#3

Take a closer look at the Rust code above: it actually uses a Vec (32 bit for each Brainfuck command), this seems better for performance.


#4

I tried to optimize it further but didn’t get anything useful.

It is entirely CPU-bound and the benefit of disabling bound checking was minimal. I managed to write the code in a way, that it was able to use a branch table but that was actually worse than before. Seems like you need a branch for each case and not just one in the loop.


#5

Like in the last D version in my blog?

Thank you for your answer.

(What’s the chance of adding some kind of restricted jump tables to Rust to implement simple but sufficiently efficient interpreters?)


#6

Yes. The single branch actually doubled the execution time from 15 of the naive solution to 27 seconds, just by increasing the amount of branch mispredictions.


#7

I remember seeing this code which uses an enum instead of char (which should be only a byte in size instead of 4 bytes for the char).


#8

I have done many benchmarks to produce the code in that blog post. Using 32 bits for each BF instruction seems to produce a faster interpreter compared to using 8 bits.


#9

Some ideas I would try:

  1. Look at the disassembly and is if something ‘weird’ is going on.
  2. Make sure functions in Tape gets inlined (mark with #[inline])
  3. Perhaps try to make a jump table instead of having a match in the run code. That being said I haven’t done that in Rust myself so I’m not really sure how to do that.

#10

#11

Those instructions are actually 16 bytes. 1 byte for the tag, 8 bytes for the jump position, and 7 bytes of padding. However, it is twice as fast because it doesn’t use a btreemap as a jump table.


#12

Right, I didn’t read the code closely.

The Rust code in my original post is faster, probably because handling tags at run-time (and 16 bytes for instruction) is not efficient… compared to the alternative of representing the jump table with a plain dynamic array.


#13

Sorry, I didn’t see your code. FYI, you’re using 12 bytes per instruction due to your bracket_map. And what do you mean “handling tags at run-time”? Both are functionally equivalent.


#14

Right! I’ve tried to use make bracket_map a Vec, but this doesn’t improve the performance of the program, it just decreases the memory used, and forces me to add several casts in the code.

You’re right again. In the Rust code above there’s a match on chars from the Vec “code” of 32 bit chars. They are equivalent to the dynamic match on the enum tags.
The difference is that only the [ and ] instructions need to touch the bracket_map vector, while in the Rust program based on matching on enum, the commands are larger and this increased traffic slows down the interpretation. Apparently 32 bits is the best size for the commands.

Programmers that create Forth interpreters have found ways to increase the performance of code grouping the most common operations. I have not yet tried this.


#15

This version with an enumerated commands is about twice slower, I don’t yet know why, this seems weird (if you replace all the commands with something like Increment([u8; 3]) for padding the performance is about the same as this. But if you use an u8 for each command instead of a 32 bit char, the code gets only a little slower compared to my original Rust post).

// Compile with: rustc -C opt-level=3 -C lto brainfuck.rs
use std::io::{ stdout, Read, Write };

#[allow(dead_code)]
enum Command { Increment, Decrement, Advance,
               Devance, JumpForward, JumpBack, Output, Input }

struct Tape {
    pos: usize,
    tape: Vec<u32>
}

impl Tape {
    fn new() -> Tape { Tape { pos: 0, tape: vec![0; 100_000] } }
    fn get(&self) -> u32 { self.tape[self.pos] }
    fn getc(&self) -> char { self.get() as u8 as char }
    fn inc(&mut self) { self.tape[self.pos] += 1; }
    fn dec(&mut self) { self.tape[self.pos] -= 1; }
    fn advance(&mut self) { self.pos += 1; }
    fn devance(&mut self) { if self.pos > 0 { self.pos -= 1; } }
}

struct Program {
    code: Vec<Command>,
    bracket_map: Vec<usize>
}

impl Program {
    fn new(content: String) -> Program {
        let mut code = Vec::new();
        let mut bracket_map = vec![0; content.len()];
        let mut leftstack = Vec::new();
        let mut pc = 0;

        for c in content.chars() {
            match c {
                '+' => code.push(Command::Increment),
                '-' => code.push(Command::Decrement),
                '>' => code.push(Command::Advance),
                '<' => code.push(Command::Devance),
                '[' => {
                    code.push(Command::JumpForward);
                    leftstack.push(pc);
                },
                ']' => {
                    code.push(Command::JumpBack);
                    match leftstack.pop() {
                        Some(left) => { bracket_map[left] = pc;
                                        bracket_map[pc] = left; }
                        None => () // Error?
                    }
                },
                '.' => code.push(Command::Output),
                ',' => (), // Command::Input ignored.
                _ => { continue; }
            }
            pc += 1;
        }
        Program { code: code, bracket_map: bracket_map }
    }

    fn run(&self) {
        use Command::*;
        let mut pc = 0;
        let len = self.code.len();
        let mut tape = Tape::new();

        unsafe {
            while pc < len {
                match *self.code.get_unchecked(pc) {
                    Increment => tape.inc(),
                    Decrement => tape.dec(),
                    Advance => tape.advance(),
                    Devance => tape.devance(),
                    JumpForward => { if tape.get() == 0 { pc = *self.bracket_map.get_unchecked(pc); } },
                    JumpBack => { if tape.get() != 0 { pc = *self.bracket_map.get_unchecked(pc); } },
                    Output => { print!("{}", tape.getc()); stdout().flush().unwrap() },
                    Input => ()
                }
                pc += 1;
            }
        }
    }
}

fn main() {
    use std::fs::{ File };
    use std::env::{ args };

    let file_name = args().nth(1).unwrap();
    let mut code = String::new();
    File::open(&file_name).unwrap().read_to_string(&mut code).unwrap();
    Program::new(code).run()
}

#16
 Performance counter stats for './brainfuck3 mandel.brainfuck':
     [...]
     1,083,271,106      branch-misses             #    2.43% of all branches        

The same problem I had yesterday, when I tried to optimize it.


#17

I don’t understand what I am seeing. Are you willing & able to explain better what’s going on?


#18

I profiled the code with perf. In the “optimized” version, llvm can realize the match-block with a branch table, in the naive implementation it uses some sort of binary search. Actually the naive version is much faster and when I searched for the problem, I found that the naive code has just 0.3% branch misprediction whereas the optimization has 2.4% or more.


#19

I know I read a good article about this and it’s not this one http://eli.thegreenplace.net/2012/07/12/computed-goto-for-efficient-dispatch-tables but fairly close. This is for C/C++ but ideas may apply.


#20

Looks like we got a bug in llvm here.

I rewrote the same code in C and it was about as fast as the naive code. Then I added __builtin_unreachable() to tell llvm that the default case is unreachable (which rust does for exhaustive matches) and suddenly it was a lot slower.