Nice article! I can answer some of the questions you have:
You may have noticed the new
.p2align directive. Why does that need to be there? I have no clue, but if it's not there the program will segfault immediately.
ARM requires instructions to be aligned to 4 bytes. Your
.p2align directive aligns it to 2^12 bytes which is more than enough. Without it you would get a 2 byte alignment as 30006 before it, which is not a multiple of 4.
The compiled aarch64 program is only a little faster than the interpreter, and much slower than the compiled x86_64 program, however that's to be expected since I'm running the aarch64 program on an x86_64 machine and using QEMU as an aarch64 CPU emulator. If I was running the aarch64 program on an aarch64 machine I imagine it'd be just as fast as the compiled x86_64 program.
Did you run the interpreter using QEMU too?
Re the WASM benchmark: It could be interesting to see the perf improvements of running wasm-opt.
Given how incredibly simplistic brainfuck programs are I'm surprised the LLVM IR optimizer still found so much room for improvement.
It was probably able to replace part of the memory accesses with register accesses. You could look at the optimized llvm ir to see if this is the case.
Apparently if the data in your program is aligned everything is faster and if it's unaligned it's either slow or completely unusable. But why? What is it with all this magical alignment stuff?
A processor divides several internal structures into fixed size bins. For example the cache consists of cache lines that on most x86 systems are 64 bytes big. These internal structures are always aligned to their own size. This means that it can avoid storing the least significant bits of the address and means that there is overlap if and only if the address is equal. If your data is not aligned, it would be necessary to access two cache lines to perform a single load or store. On x86 this is simply extra work, slowing things down, but on arm for simplicity reasons many processors don't support it at all and simply trap. The OS may be able to emulate support, but this is of course terrible for performance. x86 also doesn't allow it for vector loads/stores unless you explicitly use the unaligned instruction variant. Another reason that you must keep your data aligned is that LLVM exploits the alignment requirements rustc tells it to generate faster code. https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
If you have more questions, feel free to ask here or PM me on rust-lang.zulipchat.com. I have had to learn about these low-level concepts myself over the past two years as part of writing rustc_codegen_cranelift.