Pointer Micro Optimization Help

I was hoping someone maybe had some thoughts on how to micro optimize this following snippet. I'm working on a little programming language / vm and have one benchmark that shows reading the next bytecode instruction can account for up to 40% of the runtime using cargo flamegraph. Here is the stripped down example

struct Vm (*const u8);

pub fn main() {
    // stand in for bytecode
    let v: Vec<u8> = vec![1, 2, 3, 4, 5, 6, 7];

    // the instruction pointer
    let ip: *const u8 = &v[0];

    // vm typically holds a bunch of other stuff
    let mut vm = Vm(ip);

    // stand in for execution loop
    loop {
        let instruction = vm.read_byte();
        if instruction == 4 {
           break
        } 
    }
} 

impl Vm {
    /// in essence just read the current byte and move the pointer up
    /// one element and return the byte value
    #[inline]
    fn read_byte(&mut self) -> u8 {
        let byte = unsafe { *self.0 };
        self.update_ip(1);
        byte
    }

    /// just bump the pointer 1 byte
    #[inline]
    pub fn update_ip(&mut self, offset: isize) {
        unsafe { self.0 = self.0.offset(offset) };
    }
}

So in this example this could be replaced an iterator but in the real code we'll call update_ip with values other than 1 such as -15 or 23 when we encounter jump instruction. I'm completely willing to accept that this is as fast as this is going get, but I don't know x86 assembly at all to know if the generated code could somehow be improved.

Some thoughts I had were to get rid of these functions and just write macros to ensure they are inlined at the callsite. I also thought to potentially add another update_ip that was specialized to explicitly be self.0.offset(1). Any suggestions would be greatly appreciated :slight_smile:

So for further clarification here we can assume the bytecode is correct as this functionality isn't exposed externally.

For real implementations see the follow on github
update_ip: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L589-L591
read_byte: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L619-623
main execution loop: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L474-573

You can use https://rust.godbolt.org/ to see what these functions generate. Just remember to add -O and perhaps -C target-cpu=native to the flags.

#[inline] should be enough — no need for macros. You can use #[inline(always)] if you find a case where the compiler fails to inline.

These functions don't check bounds, so they are unsafe. update_ip(-9999); read_byte() would crash. If they're not declared as unsafe, they create a loophole in Rust's safety.

1 Like

Thanks for the tip with -O on godbolt, definitely getting different results now. It seems my stripped down example may be too simple. This is what is generated now. It looks like it moves a pointer into a register and just bumps it in there reading and comparing to 4.

example::main:
        push    rax
        mov     edi, 7
        mov     esi, 1
        call    qword ptr [rip + __rust_alloc@GOTPCREL]
        test    rax, rax
        je      .LBB0_4
        mov     dword ptr [rax], 101188101
        mov     rcx, rax
        add     rcx, 1
        mov     word ptr [rax + 4], 1285
        mov     byte ptr [rax + 6], 7
.LBB0_2:
        cmp     byte ptr [rcx], 4
        lea     rcx, [rcx + 1]
        jne     .LBB0_2
        mov     esi, 7
        mov     edx, 1
        mov     rdi, rax
        pop     rax
        jmp     qword ptr [rip + __rust_dealloc@GOTPCREL]
.LBB0_4:
        mov     edi, 7
        mov     esi, 1
        call    qword ptr [rip + alloc::alloc::handle_alloc_error@GOTPCREL]
        ud2

Is there a way to indicate / hint to the compiler I want a field to be held in a register?

What is the source of the byte code sequence? In particular, can the source be trusted, i.e. the byte code sequence always contains the necessary instruction to break out of the loop? When writing a minimal example, you shouldn't neglect to leave a comment explaining what your code does behind the scenes, most importantly user input validation before doing unsafe stuff, like not checking for out-of-bounds.

So just for more context I'll just link to the real implementations on Github

update_ip: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L589-L591
read_byte: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L619-623
main execute loop: https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L474-573

The bytecode is trusted as it's run through a compiler I've written for my language. Obviously there may be bugs in the compiler but, the assumption is the code bytecode produced is free of errors. Here we avoid executing unless the compiler gives us an Ok https://github.com/Laythe-lang/Laythe/blob/master/laythe_vm/src/vm.rs#L221-232.

There is currently no public facing way to just feed in arbitrary bytecode sequence and execute it. I only expose a repl method and a run which takes a filepath

May be I'm missing a point but why are you reaching for using pointers wrapped in "unsafe" to simply access the elements of a byte array?

Why not just use good old fashioned array indexing:

let v: Vec<u8> = vec![1, 2, 3, 4, 5, 6, 7];
...
let pc: usize = 0;
...
let byte = v[pc];

So another quick point I should make is currently each function holds it's own bytecode. So over the course of an execution I'll be pointing to many different vectors depending on which function I'm currently executing code in.

I didn't start with a straight pointer. I essentially started with what you have pc: usize then and normal indexing. After some benchmarks / profiling I thought why not try using get_unchecked which gave me a speedup. After some more profiling thought why go with just a straight pointer which again gave me a speedup. So I suppose the point is just maximum performance. If I can do my vm bookkeeping with less overhead that allows more cpu time for the actual script.

Personally I'd sacrifice a few percent in performance to keep use of "unsafe" out of my code and keep it more readable. But that is your call.

Am I to understand that each function has it's own vector of byte code? I can't see how that hangs together when calling from function to function. How does a call know which vector to select for further execution?

1 Like

Definitely get that sentiment. If this was for work I'd have left it as an array index and never looked at it again. To answer your question effectively the steps are this

When we call a function

  • decode ByteCode::Call
  • load function constant
  • assert argument count
  • store current instruction pointer on current call frame
  • create a new closure object with loaded constant ip pointed to instructions[0]
  • push new call frame with new closure

When we return from a function

  • decode ByteCode::Return
  • store value on top of stack
  • pop current call frame
  • load instruction pointer from what is now the current call frame
  • push return value onto stack

This can be a big performance issue, because switching the vector also means you may invalidate the cache, i.e. your code may be full of cache misses, which can easily turn a 1 ns (load from the L1 data cache) instruction into a 50+ ns (load from RAM) instruction. In addition, if you cross the memory page boundary, you can expect this to get even worse.

You basically want all your byte code to live in a single contiguous memory region, just like the instructions for a regular compiled program. You can then take out slices from that big memory chunk to work with.

1 Like

I don't see any big problem with using pointers and unsafe for the inner loop of a byte-code interpreter -- it could hardly get more time-critical. However the core unsafe code needs to be kept as contained as possible. Also you probably need #[inline] all over the place, and you need to be checking assembler output all the time to make sure that constructions and calls are optimising down as you expect (e.g. see cargo asm). As I'm sure you're aware, every single branch in critical sections needs examining to see if it can be eliminated.

One thing you could do is write a byte-code verifier to make sure that the assumptions made by the interpreter are correct (e.g. all branches go to an instruction start, all instructions are complete, code doesn't run off the end, etc). Perhaps you could have a big instruction-set spec that you pass to a macro that generates both the inner loop instruction decoder code and also part of the bytecode verifier code, to avoid copy/paste errors between the two sides. So this could be a single-use macro_rules! macro that generates two function definitions at once, one of which decodes and calls without checks, the other which decodes only and sanity-checks everything. That will help you be more confident that everything is sound despite the unsafe.

I think the main concern some of us have is @jonnyboyC's incomplete knowledge about unsafe, in particular when it's safe to drop the unsafe keyword from function signatures.


Back to the optimization issue, now that I've looked at the original code:
I think the main reason of the performance issues is the inherent branch prediction failure you get on every bytecode match, because the CPU cannot possibly predict which branch is taken, as it is completely dependent on arbitrary user input, i.e. the CPU fails to prefetch the next instruction when the next instruction depends on the result of the match.

This is a catastrophe, because you have to imagine a CPU, that doesn't ever prefetch instructions. A good real life comparison is a car that is stuck in a traffic jam. The CPU, just like a car, needs time to accelrate until it hits maximum speed. Each time a branch is mispredicted, the CPU has to stop and wait until the previous instructions are finished. This will cost a lot of time, because the frequency of you matching bytecodes is huge and the CPU is constantly accelerating anew due to the misprediction.

There is not much you can do about that other than taking a similar approach as Java Hotspot, which is to compile the bytecode into valid CPU instructions, that don't need to be matched on. If the code is never executed more than once, you don't gain anything out of that, of course.

I didn't read everything, but I point out this code:

    let ip: *const u8 = &v[0];

is problematic under current Rust's formal model candidate. Correct way to obtain a pointer is v.as_ptr().

2 Likes

Thanks for the quick call, good to know in the future. Is there clippy setting to point this sort of issue out?

Good question. I would suggest it to clippy if so. That said, it's only incorrect if you use the pointer to access the other elements in the array. When you only access the first element, it is correct.

@jimuazu you would think I would know about cargo asm but I actually haven't. From googling it I've been passing flags to rustc to spit out asm which is sort of useless since it's 2+ millions lines. I went ahead and installed it because it looks super useful. I like the idea of these extra verification tests to better assert the generated code is actually safe. Right now I'm effectively doing black box test, with a couple hundred scripts.

1 Like

@Phlopsi do you know of any good tools, techniques to measure how frequently you're missing cache? I know that the Fun structs that hold the bytecode tend to be allocated close together from debugging, but I'm not sure where exactly the Vec's are that hold the bytecode. It would defintely be good to get an idea how much an issue this separate vec implementation costs me.

For the instruction pre-fetch, I'd assume I essentially always get a miss at each top level bytecode match, but inside the my instruction I'd assume it's fairly predictable, but definitely something I should see if I can improve.

Valgrind (specifically cachegrind) does this

I haven't had to use any measuring tools, so far, so I'd prefer to let experienced users suggest tools for this specific task. You may also want to open a new thread specifically for this question. Perhaps, someone even already asked about this and you can find a fitting thread with the help of the forum search.

As for techniques, the only way is to experiment and measure general performance.

I think the best way to assure cache locality is to use a custom allocator, that pre-allocates a big contiguous chunk of memory and hands that out. The advantage over the global allocator is, you know exactly who is getting that allocated memory, giving you the control you need to avoid memory fragmentation.

One such custom allocation library is bumpalo, which uses bump allocation and might fit your needs. It comes with its own Vec type and is best suited for allocations that would usually be deallocated together, because you can only deallocate the complete memory chunk at once. I guess, that would be true for your VM application, where bytecode vectors are likely deallocated at the end of the application in immediate succession.