Cached Interpreters

The most straight-forward way to write a CPU emulator is with an interpreter: fetching, decoding and executing each instruction as they are needed. The fastest way is with a dynamic recompiler which translates blocks to native machine-language code.

A cached interpreter serves as an intermediate approach between these two extremes, and it turns out to be a decent springboard to writing a dynamic recompiler, as much of the same concepts apply to both. Cached interpreters are surprisingly fast and can offer significant speed gains.

A cached interpreter reduces the fetch->decode->execute sequence to just execute, removing most of the bottleneck of the traditional interpreter.

Overview

At a high level, what a cached interpreter does is execute blocks of instructions at a time. A block is a sequence of linear instructions up to either a certain number of instructions, or up to the next conditional branch, whichever is lesser.

Whenever a new, uncached address is encountered, the fetch and decode sequence is performed for not only the current instruction, but every subsequent instruction in the block as well. The block is then executed. Very quickly, enough blocks start appearing that most blocks are executed using the cached blocks rather than having to fetch and decode instructions again. To support self-modifying code (or code that's copied to RAM and then executed), blocks must be marked dirty when their addresses are written to.

The reason blocks stop after a branch condition is because blocks are executed in linear order. You could, if you wanted, encode support for branching into your blocks, but in practice, you get most of the benefits without the complexity by executing sequences of instructions linearly and stopping at any possible branches. The reason for a limit on the number of instructions to cache is obvious: if the interpreter reaches the end of a code block, it could get lost executing a sea of non-instructions unless we place some sort of limit on it.

To implement such a system, you must have fast memory allocation for your blocks, O(1) lookup to see if a block exists for a given program counter (PC) address, and O(1) invalidation to mark blocks as dirty. I'll describe one possible approach to implementing these components that's effective for the Sony PlayStation, but note that there are many variations of this design you could use.

Bump Allocator

Having to call new or malloc each time we need a new block, and free each time we release an old block, would be far too slow.

The fastest and easiest memory allocator to use would be a bump allocator. This involves allocating a very large block of memory (128MB - 512MB is ideal), and having the caching process request chunks of memory from it. The blocks then aren't freed on invalidation, but simply marked as dirty. When the bump allocator runs out of memory, the entire cache is discarded, and the free space remaining in the allocator is set back to its original size.

It may sound like a performance pain point to discard every cache block and start over, but in practice this rarely happens, and the emulation speed very quickly overtakes a standard interpreter again, to the point a framerate counter wouldn't even notice it happening.

Emitter

The next component needed is the emitter. When you fetch an instruction, you have to decode it and cache that decoded state. There are two ways you can this.

Instruction Objects

If you want portable code, then create a class object that contains the decoded fields and a pointer to a handler of the instruction that should be executed.

struct Instruction {
  bool (*function)(Instruction*);
  u8 rd, rs, rt;
};

bool CPU_ADDI(Instruction *instruction);

Code Generation

If you want more speed and don't mind your code being platform-specific, you can use code generation to build the cached blocks. This approach is nicer if you're planning to move to a dynamic recompiler later on.

Each time an instruction is fetched and decoded, native CPU code is generated and appended to the cached block in order to execute said instruction.

//this example is modeled on the Linux SystemV ABI
//a slightly different ABI will be needed for Windows
sub(rsp, imm8{0x08});  //amd64 requires 16-byte stack alignment
mov(rax, imm64{&CPU_ADDI});  //the opcode we will call later
mov(rdi, imm8{rd});  //the parameters to pass to the function
mov(rsi, imm8{rs});
mov(rdx, imm8{rt});
call(rax);  //execute the opcode
add(rsp, imm8{0x08});  //fix the stack alignment
test(rax, rax);  //do we need to stop executing the block now?
jz(imm8{1});  //if not, skip over the ret to the next cached instruction
ret();  //the PC changed, so break out of the cached block of code
//the next instruction continues here
//the final instruction would just end with a ret() that is always executed

In the above code example, what's happening here is that an amd64 emitter class will be generating the equivalent bytecode for the instructions as requested. The bump allocator will need to give the allocated memory execute permissions (PAGE_EXECUTE) using mprotect or VirtualProtect, and then you can jump directly into a block by casting it to a function pointer.

The bump allocator is especially nice here, because you may not know the exact number of bytes needed to emit code for a given block, and erring on the side of caution by rounding up would leave gaps in your cache that hurt performance. With a bump allocator, you can acquire a pointer to the start of the unused memory, and then mark the size of the cached block as allocated after recompiling. Just treat the bump allocator as having 1MB less memory that it does, so that you don't have to keep checking if there is enough space remaining. The next call to cache a block will see the bump allocator has less than 1MB of RAM remaining and can flush the entire cache and start over then. No cached block is going to be larger than 1MB.

Block Termination

It's not enough to only break out of the cached blocks on branches, normal instructions can also cause the program counter to jump somewhere else, for instance by throwing exceptions such as arithmetic overflow. And if you want per-instruction interrupt testing (usually not required for systems as fast as the PSX), an interrupt that fires in the middle of the block could also result in the PC branching to a new address.

Hinted at above, a simple way to do this is to have the instruction handling functions return a bool to indiate whether we should stop executing further instructions. Branches will always return true, and other instructions will only return true if an exception was raised by the instruction handler, or optionally if an interrupt should be triggered immediately.

Trie Lookup

We're almost finished. The last piece you need is a fast O(1) lookup to see if a cached block exists for a given address, as well as a fast O(1) method to free a block that's been written to. The PSX address bus is 32-bits. We can discard the upper 3-bits as part of the segment selector, and the lower 2-bits because MIPS instructions must all be word-aligned. But that still leaves us with 2^27 more possible addresses that code could theoretically execute from, and a table of pointers that big would be infeasible.

The idea with a trie is to divide and conquer. Let's say we limit each block to a maximum of 64 instructions, and enforce that as a boundary so that the code emitter stops not only on branches, but when (PC&63) = 0 as well. Now we can divide one large 2^27 array into two smaller arrays of 2^21 and 2^6. The larger 2^21 array is allocated outside of the bump allocator and is always valid. Each entry in the array points to a smaller 2^6 table for each block, one for each address within said block. Most of the 2^21 array will be null pointers (meaning invalid), and only the areas where code executes will get filled in. On the PSX, this will mostly just be RAM from 0x000000-0x7fffff and the BIOS from 0x1fc00000-0x1fffffff.

Performing a lookup to execute a code block can now be done in O(1) time:

using Block = void (*)();
struct Pool { Block *blocks[64]; }
u32 pc = PC & 0x1fffffff;  //remove the segment selector
auto &pool = pools[pc >> 8];
//if there's not a 2^6 pointer block for this address, allocate it now
if(!pool) pool = (Pool*)allocator.acquire(sizeof(Pool));
auto &block = blocks[pc >> 2 & 63];
//if this exact PC address doesn't have a block cached, do so now
if(!block) block = (Block*)emit(pc);
block();  //begin executing the block

Similarly, invalidating a block on memory writes can be done in O(1) time:

pools[(address & 0x1fffffff) >> 8] = nullptr;

Since we have a bump allocator, we don't have to actually free the memory (and deal with the subsequent fragmentation of the memory space to try and reuse it again later), instead we can just mark the entire block of 2^6 PC addresses as invalid at once.

This design has another very useful property: imagine that you have a cached block that starts at PC=8 and goes to PC=48. Now imagine that memory at address 20 is written to. With the write occurring in the middle of a cached block, it becomes hard to know which blocks the write may invalidate. Rather than having to keep track of the length of each block and to scan all 2^6 blocks on memory writes, we can take advantage of the forced 64-instruction alignment mentioned earlier in this section, and simply mark the entire 2^6 block as dirty by replacing the bump allocator pointer with a null pointer.

Self-Modifying Blocks

There is one last edge case that will very rarely come up, but has been observed in the libcrypt code in Spyro the Dragon 2: what if, in a single block of 64 instructions, the block writes to itself to change an instruction, executes that instruction, and then writes to change it back? Invalidating the block doesn't help us because we're already inside the block executing the instructions.

Unfortunately, there's only one practical way to handle this case: we not only return true from instructions that branch and that raise exceptions (and possibly interrupts), we also return true if the current block has been invalidated.

if(exceptionRaised) return true;
if(pools[(address & 0x1fffffff) >> 8] == nullptr) return true;
return false;

But we can do a little better by modifying our memory write invalidation line:

pools[(address & 0x1fffffff) >> 8] = nullptr;
if((PC >> 2 & 63) == (address >> 2 & 63)) cacheShouldBreak = true;

We share a cacheShouldBreak flag that's set when not only for branches and exceptions/interrupts, but also when the block currently executing has been written to, so that they will break out early, and the block can be recompiled with the new instruction before it is executed.

Dynamic Recompilation

Now that you have a working cached interpreter, if you've chosen the native code emitter design, it's possible to start inlining instructions. So for instance with the example to call ADDI(rd, rs, rt), that could be changed to inline instructions that perform the equivalent of executing the ADDI instruction, in-place. Each block will become quite a bit larger, but you remove the rather significant overhead of calling the ADDI instruction handler with its three function arguments. You can do this piecemeal, one at a time, perhaps even for only the hottest (most frequently executed) instructions.

Once you've done this for all instructions, you could then start employing more sophisticated optimizations like register caching (keeping the emulated registers in native CPU registers) and unused flag elimination (for CPUs that use flags; the MIPS-I in the PSX does not)

Credit

Credit for most of this design goes to the Ares emulator by Near, which uses a similar design for the N64 which I adapted for the PSX and documented here.