Learn Assembly by Writing Entirely Too Many Brainfuck Compilers in Rust

Article: Learn Assembly by Writing Entirely Too Many Brainfuck Compilers in Rust
Source code: brainfuck_compilers

My motivations for writing the article, to quote a snippet from the intro:

I've procrastinated on learning assembly for a long time but after I recently picked up Rust and have been hanging out in a lot of online Rust communities it's given me the kick in the butt to dive in. Rustaceans use fancy words and acronyms like auto-vectorization, inlining, alignment, padding, linking, custom allocators, endianness, system calls, LLVM, SIMD, ABI, TLS and I feel bad for not being able to follow the discussions because I don't know what any of that stuff is. All I know is that it vaguely relates to low-level assembly code somehow so I decided I'd learn assembly by writing entirely too many brainfuck compilers in Rust. How many is too many? Four! My compile targets are going to be x86, ARM, WebAssembly, and LLVM.

It's undoubtedly my most ambitious article to date, as I sunk in over 100 hours doing all the research, coding, debugging, writing, and editing. I had a lot of fun and learned a lot. I hope you all enjoy it, and please feel welcome to ask any questions or share any feedback with me as I'm always looking for ways to improve the article. Thank you!


Nice article! I can answer some of the questions you have:

You may have noticed the new .p2align directive. Why does that need to be there? I have no clue, but if it's not there the program will segfault immediately.

ARM requires instructions to be aligned to 4 bytes. Your .p2align directive aligns it to 2^12 bytes which is more than enough. Without it you would get a 2 byte alignment as 30006 before it, which is not a multiple of 4.

The compiled aarch64 program is only a little faster than the interpreter, and much slower than the compiled x86_64 program, however that's to be expected since I'm running the aarch64 program on an x86_64 machine and using QEMU as an aarch64 CPU emulator. If I was running the aarch64 program on an aarch64 machine I imagine it'd be just as fast as the compiled x86_64 program.

Did you run the interpreter using QEMU too?

Re the WASM benchmark: It could be interesting to see the perf improvements of running wasm-opt.

Given how incredibly simplistic brainfuck programs are I'm surprised the LLVM IR optimizer still found so much room for improvement.

It was probably able to replace part of the memory accesses with register accesses. You could look at the optimized llvm ir to see if this is the case.

Apparently if the data in your program is aligned everything is faster and if it's unaligned it's either slow or completely unusable. But why? What is it with all this magical alignment stuff?

A processor divides several internal structures into fixed size bins. For example the cache consists of cache lines that on most x86 systems are 64 bytes big. These internal structures are always aligned to their own size. This means that it can avoid storing the least significant bits of the address and means that there is overlap if and only if the address is equal. If your data is not aligned, it would be necessary to access two cache lines to perform a single load or store. On x86 this is simply extra work, slowing things down, but on arm for simplicity reasons many processors don't support it at all and simply trap. The OS may be able to emulate support, but this is of course terrible for performance. x86 also doesn't allow it for vector loads/stores unless you explicitly use the unaligned instruction variant. Another reason that you must keep your data aligned is that LLVM exploits the alignment requirements rustc tells it to generate faster code. https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

If you have more questions, feel free to ask here or PM me on rust-lang.zulipchat.com. I have had to learn about these low-level concepts myself over the past two years as part of writing rustc_codegen_cranelift.


Tiny correction, but .p2align 12 aligns the section to 2^12 bits which I now understand is the same thing as .balign 4 which aligns the section to 4 bytes (since 2^12 bits = 4 bytes) so I think both would work. I think I'll update my article to use .balign 4 since it's easier to explain.

What I still don't get is why the ARM instructions must be aligned to 4 bytes and also why don't my program instructions begin at the address 30000 if I define a 30000 byte array before the instructions? Where is the extra 6 bytes in the 30006 coming from?

The interpreter is being compiled with the --release flag and is run locally on my machine. The compiled aarch64 programs are being run inside of a Linux docker container with QEMU.

I just tried using wasm-opt and surprisingly there was no improvement! Running time wasmtime mandelbrot.wasm and /binaryen/bin/wasm-opt -O4 mandelbrot.wasm -o mandelbrot4.wasm && time wasmtime mandelbrot4.wasm both consistently produce the same execution times, which come out to about ~1.5s on my machine. I convert the compiled .wat files to .wasm using wat2wasm but I don't think that tool performs any optimizations.

Ohh, you're probably right, that makes a lot of sense.

How does anyone learn this stuff? Is there like some secret book or website I don't know about or are people just straight-up reading the 5000 - 8000 page programming manuals on x86/ARM on Intel/AMD/ARM's websites?

Oops, sorry, I was wrong and you were right, .p2align 12 does actually align the section to 2^12 bytes and it's the same as .balign 4096 not .balign 4. However, the bad news is that I tried substituting .p2align 12 with .balign 4 and it didn't work... well, it kinda worked, but only for some of my test brainfuck programs but not for others, the only alignment that works for all the brainfuck programs in my test suite is .balign 4096 which is equivalent to .p2align 12. Now I'm more confused than ever... I'm using the same exact header boilerplate for all of the compiled aarch64 brainfuck programs and some work with .balign 4 but others require as much as .balign 4096 ???