Delay Slots

The CPU has two types of delay slots: load delay slots and branch delay slots. These are an artifact of the 5-stage pipeline the MIPS-I architecture employs, and emulating the CPU in software without fully simulating the pipeline can be very nuanced.

Load Delay Slots

Whenever the LB, LBU, LH, LHU, LW, LWL or LWR instructions are executed, a fetch from the data bus is required. The value fetched does not immediately write to the target register. Instead, the write happens one cycle later. The causes side-effects if the subsequent instruction accesses the register targeted in the previous load instruction.

Here is an example of a load operation followed by a direct register read:

liu  t0,1      ;load t0 with the value of 1
lw   t0,0(at)  ;load t0 with the value at memory address (at) (assume 0(at) = 2)
addu t1,t0,0   ;load t1 with the value of t0
addu t2,t0,0   ;load t2 with the value of t0

Without a load delay slot, you should expect both t1 and t2 to be loaded with the value of 2. But because there is a delay slot after the lw instruction, t0 does not get set to 2 until after the first addu instruction has completed. Thus, t1 is set to 1 here. t0 is then assigned to from the load delay slot between the two addu instructions, and finally, t2 is set to 2.

Here is an example of a load operation followed by a direct register write:

lw    t0,0(at)  ;load t0 with the value at memory address (at) (assume 0(at) = 1)
addiu t0,2,0    ;load t0 with the value of 2

Here, you might expect that the delay slot load that happens after the addiu instruction would override t0 and change the value from 2 to 1, but that would be incorrect. The direct write to t0 in the addiu instruction cancels (or overrides the value of) the load delay slot, and the result after these two instructions is that t0 is set to 2.

Branch Delay Slots

Whenever the CPU performs a branch (whether conditional or not), the following instruction is executed whether the branch is taken or not. This is meant to avoid a bubble in the pipeline.

In the most simple case:

addiu t0,1,0   ;set t0 to 1
beq 1,1,.next  ;always branch to .next
addiu t0,t0,2  ;increment t0 by 2
.next:         ;t0 is now 3 here

Or with a jump instead:

addiu t0,1,0   ;set t0 to 1
j .next        ;always branch to .next
addiu t0,t0,2  ;increment t0 by 2
.next:         ;t0 is now 3 here

Or even with a conditional branch that is not taken:

addiu t1,0,0    ;set t1 to 0
addiu t0,1,0    ;set t0 to 1
beq t1,1,.next  ;t1 != 1, so the branch is not taken
addiu t0,t0,2   ;increment t0 by 2 even though the branch is not taken
.next:          ;t0 is now 3 here

You might wonder what happens if a branch is placed inside the delay slot. Put simply, it works exactly the same way, with the end result being that it will execute one instruction after the first branch, before jumping to the second branch.

An example is certainly warranted for this one:

addiu t0,1,0   ;set t0 to 1
j .next1       ;unconditionally branch to .next1
j .next2       ;branch within a delay slot
addiu t0,t0,2  ;add 2 to t0
nop            ;this instruction won't be executed
.next1:
addiu t0,t0,4  ;add 4 to t0
nop            ;this instruction won't be executed
.next2:        ;t2 is 7 at this point

When j .next1 is encountered, it must execute the next instruction in the delay slot. When j .next2 is encountered in that delay slot, it still needs to execute the next instruction before processing the second jump. But since the previous instruction took the brach to .next1, that means it executes the addiu t0,t0,4 instruction. Next it executes the addiu t0,t0,2 instruction before the next branch is taken to .next2. We end with t2 set to 7 here, and execution resumes at .next2.

Exceptions and Interrupts

It is possible that exceptions or interrupts may occur inside branch delay slots, or even worse, that the delay slot contains a branch where an interrupt has triggered. It is up to the exception handler to record this information so that the exception handler can react appropriately when returning from the handler. Some undocumented bits of the SCC come into play here.

The SCC Cause register (register 13) contains the following bit fields:

2-6 contain the exception code, or the event that triggered the exception
8-15 contain the bits for the interrupt(s) pending, if any
28-29 indicate which coprocessor, if any, triggered the exception
30 indicates if a branch delay slot was taken on the previous instruction
31 indicates if the previous instruction was a branch delay slot (whether taken or not)

28-29 is not actually set based on which coprocessor triggered an exception. It is actually set via bits 26-27 of the last executed instruction, unless the exception type was a bus data error (code 7), in which case it is always set to zero.

The SCC Error PC register (register 14) is set to the current program counter.

Next, if the previous instruction was a branch delay slot (whether taken or not), 4 is subtracted from the EPC register. If the previous instruction's branch was taken, the undocumented SCC Target Address register is set to the address said branch was targeting. And if it was not taken, then it is set to the next instruction to execute.

At this point, the program counter is set to 0xbfc00180 if the SCC Status register bit 22 (BEV) is set, or 0x80000080 if not. Next, if this was a breakpoint exception (code 9), then the program counter is overridden to 0x80000040. I am not presently aware if this is only when BEV is clear, or if with BEV set it should be 0xbfc00140 (or some other value)