Computer Architecture© Tim Margush 2006
IJVM
As an example of a microarchitecture (where multiple instructions are used to interpret machine level instructions). we will study a subset of the Java Virtual Machine called IJVM. The task at hand is to create a machine that will execute Java Byte Codes (a bytecode is an IJV Machine Instruction consisting of an opcode, and sometimes an operand). It will do this by executing microinstructions that will drive the gates of the actual machine's hardware. A typical microprogram carries out a fetch execute cycle on the microinstructions under the direction of the machine instructions to be executed. The microprogram uses and modifies bits representing the state of the processor.
First we examine the CPU data path - that part of the CPU design that moves data through the ALU and registers. These components would be implemented using gates (or higher level components) as described earlier. The registers are 32-bit and are accessible only to the microarchitecture. Some of these correspond to registers at the ISA level.
Internally there are three 32-bit busses, A, B, and C. Most of the registers are able to output their contents onto B and accept the data on C. The A bus is very short, only connecting H (Holding register) to the ALU; we will be expanded this bus when we make revisions to this design. You can see that each register has an output enable and an input enable line. The output enable controls a tri-state buffer so the register contents can be "disconnected" or "connected" at will.
The ALU has 6 control lines, F0 and F1 select the function, ENA and ENB enable the A and B inputs, INVA inverts the A input, and INC provides an initial carry into the least significant bit during an add operation.
| F0 | F1 | Operation |
|---|---|---|
| 0 | 0 | A AND B |
| 0 | 1 | A OR B |
| 1 | 0 | ~B |
| 1 | 1 | A + B |
Since this system will use two's complement arithmetic, B-A can be computed by asserting INVA, ADD, and INC. The operation is B + (~A) + 1 (Subtracting is adding the negative, and the negative of A is calculated by flipping the bits and adding one). Various other combinations produce similar useful results.
The Holding register is loaded from any one of the other registers to setup a binary operation; By driving a register onto bus B, enabling the ALU B input and disabling the B input and ADDing (or ORing), the register value is passed unchanged through the ALU. It appears on the C bus and can then be stored in H.
The shifter unit can optionally shift the result from the ALU. The 2 control lines select one of the following functions, SLL8 (shift left logical 8 bits) and SRA1 (shift right arithmetic 1 bit).
The data path cycle, asserting one register onto bus A and Bus B, performing an ALU/Shifter operation, and storing the result into one or more registers, can be accomplished in one clock cycle. This is accomplished by splitting the clock cycle into discrete events or subcycles. These are not clocked, but each requires a positive time interval to be accomplished (due to delays associated with gate switching). The data path cycle is initiated on the falling edge of a short clock pulse. The events are described in sequence:
| Subcycle | Description of Activity | Time Interval |
|---|---|---|
| 1 | Control signals are asserted to drive the data path (ALU, shifter, etc) setting up the required data path. | delta w |
| 2 | Register data is asserted on A and B bus and travels to the ALU input . | delta x |
| 3 | ALU and shifter circuits produce results (shifter output stable). The adder may be the bottleneck here having a fairly long propagation time due to the carry computations. | delta y |
| 4 | Bus C data stabilizes and its data is available to be latched by one or more registers. | delta z |
| 5 | On the rising edge of the clock pulse, registers are loaded from C bus (and memory) and the registers asserting data on A and B are disconnected. Even though this may eventually cause the data on C to change, it will take some time for these changes to propagate through the system, and the correct data has already been latched by the registers receiving the results. | instantaneous |

Although this process is described as sequential, all of the data path signals are asserted at all times through the cycle. The time intervals indicate the actual time required for each of the activities to be completed once the inputs are stable, and may vary depending on the operation requested. The values on C will likely fluctuate as the outputs of the ALU and Shifter change. As long as the clock cycle is long enough to allow all of the subcycle steps to occur, a reliable circuit can be constructed. Components used in the construction of such a circuit have known propagation delays, and a careful analysis of the circuit can provide actual maximum delta values during the engineering process.
The only explicit timing signals are the falling and rising edges of the clock signal. The subcycle boundaries are determined by actual gate delays in the data path circuits.
The microarchitecture needs access to memory to get data for the operations and to store results. There are two interfaces to memory, one word oriented (32-bit) and one byte oriented. The registers involved in the word oriented transfer are MAR (Memory Address register) and MDR (Memory Data Register). The PC (Program Counter) register is used to read byte code instructions from memory. The byte is stored in the 8-bit MBR register.
The MAR always contains a word address (divisible by 4) while the PC is a byte address. Since the MAR's lower 2 bits are always zero, they are not actually stored in the register. Although byte addresses in the IJVM are 32 bits, the MAR needs only hold 30 bits. The lower 2 address bits are simply wired as 0. When the PC is used to fetch a byte however, all 32-bits of the address are needed.
The operands of instructions which will be read into MBR one byte at a time are sometimes part of a signed value, and other times part of an unsigned value. The microarchitecture shows two connections to the internal B bus. One is used for each of these situations. Enabling one of the connections zero-extends the byte to a 32-bit value; the other connection sign-extends the byte. The dotted box extending the MBR indicates this extension which is required when MBR is driven onto the 32-bit B bus.
The operation of the data path is controlled by a sequence of microinstructions. These will be fetched and executed by the microarchitecture in order to interpret the current ISA instruction. Each microinstruction will contain two parts. The first specifies the address of the next microinstruction. The second part encodes the control signals for the data path. Microinstructions are usually not arranged sequentially in memory. Implicit with every microinstruction is a JMP to the next microinstruction.
There are 29 control lines needed to control the data path. The value of each control line needs to be defined during each data path cycle. Each microinstruction must contain this information, either explicitly or in an encoded form. The microinstruction format we will use requires 36 bits. The first 12 are for microprogram control, the rest specify data path control signals. If a memory read is specified, the request is not issued until the end of the cycle, and the data will not be available until the end of the next cycle. The current cycle latches the address (or PC) value on the rising edge of the clock (end of cycle) and then asserts the read signal. This operation assumes the memory can deliver data within a single clock cycle, which may be true for an L1 cache hit, but probably not in the event of a miss. Even though the MDR and MBR require a whole clock cycle before use, their old values may be used while the read is taking place (during the intervening cycle). Thus a read can be initiated on every clock cycle, creating a two-stage pipeline. In addition, it is possible to utilize both word and byte reads simultaneously, however if a word is written and a byte read from one of its bytes, the results are undefined. The microinstruction includes three bits for memory control, WRITE, READ, and FETCH.

The first stage of the data cycle requires the selection of a register to be output on the B bus. Since only one should be selected at a time, a multiplexer is used to select one (of many) through an addressing scheme. The B bus section of the microinstruction is 4 bits long which, in conjunction with a multiplexer, allows the selection of one of 16 inputs for this bus, of which only 9 are needed. The other bits control the ALU/Shifter function and which registers latch data from the C bus on the rising clock edge.
The microcode processor (sequencer) is a mini CPU that fetches and executes microcode. It has an MPC (MicroProgram Counter) and MIR (MicroInstruction Register). In a microcode fetch-execute cycle, the microinstruction at MPC is fetched into MIR. The first 12 bits of the microcode instruction includes flow of control information. Microinstructions are not executed in physical sequential order, so each instruction contains the address of the next instruction. The 9-bit field indicates that the microcode storage (Control Store) can hold at most 512 K instructions. This is exactly the size of MPC (9-bits). The JAM bits are used to control how the next address data is used. A JAM value of 000 indicates that the 9-bits in the address field is in fact the actual address of the next microinstruction. In other words, MPC will be loaded with these 9 bits. The next fetch then gets the instruction at that address. Although the term "counter" is used, the MPC never counts.
The control store is functionally read-only memory. It can also be thought of as a combinational circuit. When given an address, it immediately asserts the data at that address as output.

Here is the sequence of operations, beginning at the start of a data path cycle.
The 2 flip-flops (N and Z) are needed to hold the ALU output past the rising clock signal (which removes the ALU inputs and might cause fluctuations in its output) until the address of the next microinstruction can be calculated.
The actual MPC "register" need not exist since the data needed to fetch the next instruction is completely determined by the logic connecting the JAM bits, N, Z, and MBR. A simple combinational circuit would suffice, thus the timing of latching the next address into MPC is really not important. Even though the register does not have to exist, documentation for programmers may show it as it is a useful concept in thinking about how the microarchitecture carries out its job. In this case, it would be a virtual register.
IJVM ISA
Next we turn our attention to implementing the ISA level through interpretative microcode. The macroarchitecture of the IJVM provides support for procedures which in turn support the concept of local variables. Each procedure call will reserve a block of storage on a stack for its local variables, called a Local Variable Frame. The LV register will point to the bottom of this frame on the stack. During the execution of the procedure code, the stack will often be used to hold operands and temporary results when computing expressions. For example, when computing a2+a3, the machine will place a2 and a3 on the stack, then pop both, add, and push the result. 
The part of the stack used for expression evaluation is called the operand stack. At any time, the stack may contain multiple operand stacks and multiple local variable frames.
IJVM Memory Model
Our model architecture is a machine capable of executing java byte code instructions. It uses a 4 GB memory. The instructions in the IJVM ISA do not address memory directly. Instead they access memory locations relative to a pointer. There are 4 memory areas
Note that CPP, LV, and SP are word addresses. Thus SP and SP+1 refer to two non-overlapping words. The byte addresses would be SP*4 and (SP+1)*4. PC, on the other hand, contains the address of a byte. PC+1 is the adjacent byte.
The IJVM Instruction Set (macroinstructions):

Each instruction has an opcode (shown in the column labeled Hex). The mnemonic might be considered an assembly language instruction for this architecture. The operands occupy one byte (specified as byte, const, or varnum), or two consecutive bytes (specified as disp, index, and offset). Logical and arithmetic operations always use the stack for operands and results. The branch instructions compute the new address by adding the displacement (signed) to the address of the current instruction.
These instructions are a subset of those used in the full JVM. The procedure call mechanism is simplified however to disallow dynamic linking. That is, a method call such as a.sample() will invoke the method in the class of the variable a only; it will never execute a method in a parent class. This allows the compiler to resolve the address of the actual procedure code. A procedure call/return consists of the following steps:
Push the reference to the object
The return address' location is calculated as ret_loc = stack[LV]These sequences of events that interpret the call and return ISA-level instructions will be carried out through a microprogram, as will all of the other IJVM instructions. The procedure call/return mechanism is an important characteristic of any processor architecture so it was covered in detail here.
The IJVM instructions listed above suggest an implicit assembly language for the IJVM ISA machine language. A Java program could be translated to IJVM Assembly Language and then assembled to byte code. Java compilers accomplish the translation in one step. The assembly language is only a virtual language (although versions of it have actually been implemented). Consider the following Java code and its possible translation to Virtual IJVM Assembly Language.
i = j + k; if (i==3) k = 0; else j = j - 1; |
![]() |
|---|
Tracing the execution of some of these statements illustrates how the stack is used to evaluate expressions. The ILOAD instruction places a variable's value on the stack. and ISTORE removes a word and stores it back into memory.
![]()
BIPUSH copies a constant onto the stack and the IF_ICMPEQ compares the two integers on the stack for equality. The operands are removed and status flags are set (N and Z) to indicate the result of the comparison. The next instruction will the ILOAD (if not equal) or the BIPUSH at L2 (if equal). The branch is taken when the values are equal so the assembly code shows the else part first.
![]()
The local variables, i, j, and k, are addressed as offsets into the Local Stack Frame. From the byte code versions of the instructions, it is apparent that i is at offset 1, j at offset 2, and k at offset 3. Note the displacements in the codes for the transfers. These are byte displacements calculated from the address of the opcode for the branch instruction. Verify that the correct displacements are 14 and 7 respectively.
Microprogramming
Each macroinstruction must be interpreted by a sequence of microinstructions. A MAL (MicroInstruction Language) can be used to describe these instructions. In fact, a MAL assemble can be designed to translate this language to the actual control store data.
A MAL statement indicates what must be done during a single clock cycle. It implicitly defines the state of all of the control signals for each cycle. Each instruction may specify which register is applied to bus B, the ALU/Shifter function, which registers will accept data from bus C, and what type of memory operations should be started.
Although we will write microinstructions in MAL in sequence, they may not actually be placed in the control store in physical sequential locations. Each microinstruction contains the address of the next microinstruction that should be executed. Unless a MAL statement specifies otherwise, the address field of the current instruction will be set to the address of the instruction sequentially following it in the MAL source code.
The assignment statement with a simple operation on the right will represent these actions. For example, SP = MDR or H = TOS = LV (indicating two receiving registers). The MBR register is specified by MBR (for the signed extension) and MBRU (for the unsigned extension). ALU operations can be specified in a simple expression on the right side of the assignment. For example: SP = SP + 1, TOS = H + MDR + 1, MDR = NOT H OR MDR, and H = MBRU << 8.
Memory operations can be specified as part of the same microinstruction by appending a simple memory operation: rd or wr for data memory, fetch for program memory. Remember that the memory operation is initiated at the end of the current cycle and the data (for read and fetch) will not be available until the start of the second cycle following.
To implement a flow of control that deviates from the default sequential ordering of the MAL statements, a goto command is introduced which explicitly specifies the address of the next microinstruction: goto label. Conditional goto's depending on N or Z are indicated by the familiar if-else syntax: if (N) goto label1; else goto label2. Because the hardware accomplishes this by OR'ing bit 9 of the Addr field with N, the following must be true: label1 == label2 + 256. Recall that N and Z are latched during the current instruction cycle, so some data path operation may also be specified as part of this instruction. To allow a value to be tested, but not stored, you may write N = register; or Z = register; along with the if clause. These two statements are equivalent, causing the indicated register to be run through the ALU setting both N and Z accordingly. No destination register is specified, so the C bus value is discarded. The other control flow possibility uses MBR to calculate the address of the next instruction using the value in MBR. The notation goto (MBR OR value), or simply goto (MBR) when value is 0, is used to indicate this operation. This address calculation could also involve the ORing of N or Z as well. The notation for this would look like goto (MBR OR value OR N). The value in MBR will have been determined by a fetch initiated at least two cycles earlier.
Implementation of IJVM using Mic-1
Microarchitecture Register Usage
| Register Name | Usage Convention |
|---|---|
| SP | Contains the address of the top word on the data stack |
| LV | Contains the address of the start of the current stack frame |
| CPP | Contains the address of the constant pool |
| TOS | At the end of each macroinstruction cycle, this contains a copy of the word found at the top of the stack It is also used as a temporary register |
| OPC | Temporary register |
| H | The holding register. Used for the left operand of an ALU operation. |
The heart of the microprogram is a loop to interpret the ISA fetch-execute cycle. This requires reading and interpreting a stream of bytes from the method area. The common element of this loop is PC = PC + 1; fetch. If we assume that this microinstruction is always executed with the opcode of the current instruction already in MBR, then it will be fetching the first byte of the current instruction's operand (or the next instruction's opcode) which will always be needed soon. This microinstruction also uses the computed goto (MBR) action to jump to a location in the control store after incrementing PC and initiating the fetch. Since MBR (at this time) contains an ISA instruction opcode, we can see that the opcode is actually the location of that instruction's interpreting micro(sub)program. Note that during the interpretation of an instruction, PC will not contain the address of the instruction. We will guarantee that when we return to Main1 that PC is the address of the opcode in MBR at that instant of time. This means that each section of the microprogram will need to initiate the proper fetch in time for this condition to be true. We will build a microinstruction to accomplish the above tasks and place it at a convenient location in the control store. For reference, attach the label Main1 to this instruction (the actual location will be determined later).
Main1: PC = PC+1; fetch; goto (MBR);
The first instruction to be interpreted has opcode 0x00; it is the nop instruction. Interpreting it is easy, just go back to Main1. This step will occupy one clock cycle allowing the next instruction opcode (whose fetch was initiated at the end of the previous cycle) to be latched into MBR. Thus, MBR will have the current instruction, and PC will contain its address when the microprogram returns to Main1.
nop1: goto Main1; #This instruction must be at address 0x000 in the control store.
To interpret the IADD instruction, we must add the top of stack values and place the result on the stack. Since this instruction is one byte, the value of PC and the fetched value that will shortly appear in MBR are exactly what they need to be when we return to Main1 - thus we can forget about that part of the task. Also, we guarantee that TOS already contains a copy of the top word on the stack, so we get both operands in one memory read operation. Here are the instructions needed to accomplish the addition:
| iadd1: MAR = SP = SP - 1; rd; | #Perform a POP and initiate the read of the next word on the stack. It will appear in MDR after the next cycle |
| iadd2: H = TOS; | #Operand 1 is transferred to H while second operand is read from memory |
| iadd3: MDR = TOS = MDR + H; wr; goto Main1; | #The sum is stored in TOS and MDR to be written to the top of the stack in memory. |
Note that iadd1 must be located at address 0x060 in the control store (the opcode of the IADD instruction). The subsequent instructions can be located anywhere. The other binary operations, ISUB, IOR, an IAND are similar. Next we examine the stack operations. Again, as these are single byte instructions, no more thought needs to be given to PC and MBR.
| dup1: MAR = SP = SP + 1 | #Increment stack pointer to new top and prepare address for writing top value |
| dup2: MDR = TOS; wr; goto Main1 | #TOS is a copy of the old top, so write it to the new top of stack location |
| pop1: MAR = SP = SP - 1; rd; | #Decrement stack pointer and initiate read of new top to synchronize TOS |
| pop2: | #Nothing to do; Must wait for read to complete |
| pop3: TOS = MDR; goto Main1 | #Copy the read value into TOS |
| swap1: MAR = SP - 1; rd; | #Read in the next to top value |
| swap2: MAR = SP; | #set address for the next write (to the top of stack location) |
| swap3: H = MDR; wr; | #Copy the new TOS value into H and write it out to the top of stack |
| swap4: MDR = TOS; | #prepare to write old TOS into the stack |
| swap5: MAR = SP - 1; wr | #Set the address for next to top location and write |
| swap6: TOS = H; goto Main1; | #Synchronize TOS |
The BIPUSH instruction is different from the previous instructions in that it is two bytes in length. The byte being fetched into MBR is the actual byte value that will be pushed onto the stack. We will need to increment PC and initiate the next fetch to comply with the preconditions that must be met at Main1.
| bipush1: MAR = SP = SP + 1; | #Get ready to write top of stack value |
| bipush2: PC = PC + 1; fetch | #Initiate fetch of next program byte (an opcode) |
| bipush3: MDR = TOS = MBR; wr; goto Main1 | #Sign extend the byte and write it into the stack, synchronizing TOS. The next instruction will be available in MBR at the end of this cycle |
The ILOAD instruction has two formats. The simpler uses a single byte index into the Local Variable Frame to specify the variable to be placed onto the stack. This allows only 256 variables (arguments and locals) to be named. The wide version allows a two byte index to address 65536 variables. Hopefully this will meet the needs of all procedures. The wide version is specified by a prefix operation code, WIDE. This indicates to the microarchitecture that 3 more bytes will be needed to process this instruction: opcode, indexHigh, and indexLow. Remember, the opcode is already on its way to MBR when the WIDE opcode is interpreted. Thus we need to examine how WIDE works.
| wide1: PC = PC + 1; fetch | #Initiate fetch of next program byte which will be indexHigh. We are waiting for the opcode of the instruction that will be followed by two index bytes instead of one. |
| wide2: goto (MBR OR 0x100) | #We implement the wide version of each opcode in a special area of the control store. i.e. 256+Opcode |
| iload1: H = LV; | #Get ready to calculate address of variable in the local variable frame |
| iload2: MAR = H + MBRU; rd | #MBR now has the index (and it is an unsigned byte); Initiate read of the word to be placed on the stack |
| iload3: MAR = SP = SP + 1; | #Update stack pointer and prepare to write to the new top of stack |
| iload4: PC = PC + 1; fetch; wr | #Initiate the fetching of the next opcode and since the data has been read into MBR, initiate the writing of it to the stack |
| iload5: TOS = MDR; goto Main1 | #Synchronize TOS |
The wide version looks like this:
| wide_iload1: PC = PC + 1; fetch; | #Waiting for high byte of index, initiate fetch of low byte |
| wide_iload2: H = MBRU<<8 | #Waiting for low byte of index, prepare to join the two bytes by shifting unsigned high byte to proper position. |
| wide_iload3: H = H OR MBRU | #Concatenate the two bytes of the index |
| wide_iload4: MAR = H + LV; rd; goto iload3 | #Calculate address of word and initiate the read. Since this step is equivalent to iload2, and the rest of the task is the same, we reuse the common code. |
The ISTORE and WIDE ISTORE instructions are similar. The LDC_W instruction has a 16-bit index, but this is added to CPP instead of LV. Otherwise, it is identical to WIDE ILOAD. It even reuses the final instructions of the ILOAD interpretation. The IINC instruction adds a constant to a variable in local storage. It is a three-byte instruction; the opcode is followed by an index and a constant.
| iinc1: H = LV; | #Prepare to calculate the address of the local variable while waiting for the index |
| iinc2: MAR = H + MBRU; rd | #Have index, calculate address of variable and initiate the read. |
| iinc3: PC = PC + 1; fetch | #Waiting for variable's value, initiate fetch of data to be added |
| iinc4: H = MDR | #Waiting for constant, copy variable's value into H in preparation for addition |
| iinc5: PC = PC + 1; fetch | #Get started fetching the next instruction |
| iinc6: MDR = H + MBR; wr; goto Main1 | #Calculate the sum (signed extension of MBR) and write it back to memory |
The next instructions are the branch instructions. Keep in mind that displacements (which are signed integers) are added to the address of the instruction's opcode. PC has already been incremented to the next byte when each of these microcode sections begin, so an adjustment must be made during the interpretation. The OPC (Old PC) register is used for this purpose.
| goto1: OPC = PC - 1 | #Waiting for high byte of displacement; Store the address of this instruction for later computation |
| goto2: PC = PC + 1; fetch; | #Initiate fetch of low byte of displacement |
| goto3: H = MBR << 8; | #Waiting for low byte; prepare for concatenation by sign extending high byte and shifting left one byte |
| goto4: H = H OR MBRU | #Assemble the 16-bit displacement (MBR is treated as unsigned here) |
| goto5: PC = H + OPC; fetch; | #Calculate address of next instruction and initiate the fetch of its opcode |
| goto6: goto Main1 | #We could not do this on the previous step as the opcode would not be ready in time, so this is a wait step. |
The three conditional jumps IFLT, IFEQ, and IFCMP_EQ are similar with additional details to determine whether the jump is taken.
| iflt1: MAR = SP = SP - 1; rd; | #We need to pop one word off an synchronize TOS to the new top; this initiates the read. Remember that TOS already contains a copy of the top of stack value. |
| iflt2: OPC = TOS; | #Here, OPC is used as a temp register to get the TOS value out of the way. |
| iflt3: TOS = MDR; | #This just synchronizes TOS to the new top of stack value |
| iflt4: N = OPC; if (N) goto T; else goto F; | #The old TOS value is run through the ALU to set N (and V) since they were destroyed when TOS was reset. If the value was negative, goto T(rue) where the transfer will be effected using the offset in the bytes following the opcode; The F(alse) section will skip over the offset. |
IFEQ is the same, except the Z flag is used. This next section of code is shared by the three conditional branches. It handles the TRUE and FALSE alternatives.
| T: OPC = PC - 1; goto goto2; | #This duplicates goto1, and then continues with the logic of the unconditional goto; the branch will be taken |
| F: PC = PC + 1; | #This sets PC to point to byte 2 of the displacement which we do not need so it is not fetched. |
| F2: PC = PC + 1; fetch | #Now PC is the address of the next instruction - the branch was not taken; the fetch is initiated |
| F3: goto Main1 | #Once again, we find an extra cycle is required to allow the fetch to complete before returning to Main1 |
Finally, the IFCMP_EQ instruction, which compares two words on the stack to decide to branch or not.
| ifcmp_eq1: MAR = SP = SP - 1; rd; | #This retrieves the second word from the stack. The first is already in TOS |
| ifcmp_eq2: MAR = SP = SP - 1 | #Decrement SP to pop the second word from the stack. We will need to read this to synchronize TOS, so MAR is also set. |
| ifcmp_eq3: H = MDR; rd; | #H is used to hold the first operand (just obtained from the stack). The second operand is still in TOS. Another read is initiated to be used to set the new TOS. |
| ifcmp_eq4: OPC = TOS; | #OPC is used here as a temp register. We need to set the new TOS value before making the conditional jump. |
| ifcmp_eq5: TOS = MDR; | #TOS is now synchronized with the top of stack value in memory |
| ifcmp_eq6: Z = OPC - H; if (Z) goto T; else goto F; | #This uses the ALU INVA and INC signals to subtract. The N and Z flags are set, allowing the conditional branch to be determined. |
All that remains are the procedure call and return statements.
| invokevirtual1: PC = PC + 1; fetch | #The high byte of the index is coming; this initiates the fetch of the next byte |
| invokevirtual2: H = MBRU << 8; | #MBRU is the high byte of the 16-bit unsigned index into the constant pool. |
| invokevirtual3: H = H OR MBRU; | #Concatenate the two bytes of the index into the constant pool |
| invokevirtual4: MAR = H + CPP; rd; | #Now we get the address of the method (header) and read it from memory |
| invokevirtual5: OPC = PC + 1 | #This saves the return address which will be stored in the stack a bit later |
| invokevirtual6: PC = MDR; fetch | #The address we read is the start of the method code (actually the 4 byte header). Since it need not be on a word boundary, we need to fetch it a byte at a time. This initiates the first fetch. |
| invokevirtual7: PC = PC + 1; fetch | #We need several bytes, so this overlaps the fetch. We are waiting for the high byte for the number of arguments. |
| invokevirtual8: H = MBRU << 8 | #In preparation for assembling the 16 bit argument count |
| invokevirtual9: H = H OR MBRU | #Now we know where LV must point (SP - H + 1) |
| invokevirtual10: PC = PC + 1; fetch | #Get started retrieving the next number - local variable count |
| invokevirtual11: TOS = SP - H | #TOS is used as a temp for the new LV value which is adjusted in the next step |
| invokevirtual12: TOS = MAR = TOS + 1 | #The MAR is also set as we will be writing a new value into this stack location (Link_ref) |
| invokevirtual13: PC = PC + 1; fetch | #Initiate the fetch of the last byte of the number_of_local variables. |
| invokevirtual14: H = MBRU << 8 | #Build another 16-bit unsigned value from two bytes |
| invokevirtual15: H = H OR MBRU | #H now is the number_of_local variables. This is used to calculate the end of the local variable frame |
| invokevirtual16: MDR = H + SP + 1 ; wr; | #This will be the stack address where the return address is stored and this is the Link_ref value so it is written to the first word of the local variable frame. MAR was set earlier in preparation for this action. |
| invokevirtual17: MAR = SP = MDR | #The new SP value is set to the location where the return address will be stored, effectively allocating local storage and a word for the return address. MAR is set in anticipation of writing the return address (from OPC) at this spot. |
| invokevirtual18: MDR = OPC; wr; | #Insert the return address just above local storage. |
| invokevirtual19: MAR = SP = SP + 1; | #Get ready to save the current LV just above the return address |
| invokevirtual20: MDR = LV; wr; | #Copy the current LV value into the stack; this will be the top of stack value when the method starts. |
| invokevirtual21: PC = PC + 1; fetch; | #Begin fetching the opcode of the first instruction in the method |
| invokevirtual22: LV = TOS; goto Main1; | #Set LV to point to the start of this methods local stack frame. We saved this value in TOS earlier. Note that we now go to Main1, but TOS does not have a copy of the current top of stack value. This is OK as the method starts with what is considered an empty stack, thus TOS can be undefined. |
| ireturn1: MAR = SP = LV; rd; | #This will read the Link_ref value (location of the return address) from the stack. It also deallocates the entire stack frame (except for the single extra word that is needed for the return value). |
| ireturn2: | |
| ireturn3: LV = MAR = MDR; rd; | #Now we are using the Link_ref pointer to read the return address. We also set LV to this value as a temporary register as we need to compute MAR + 1 (but MAR cannot output onto bus B) to get the old LV value from the stack. |
| ireturn4: MAR = LV + 1; | #Setup the address of the old LV value that is still in the stack storage area. |
| ireturn5: PC = MDR; read; fetch | #The return address was read and is now copied into PC so execution can continue at the instruction following the INVOKEVIRTUAL instruction. The fetch initiates the fetch of this instruction's opcode. Also initiate the reading of the old LV value. |
| ireturn6; MAR = SP; | #Prepare to store the return value which is already in TOS. Remember, it was the top of stack value just before the IRTEURN was executed. |
| ireturn7: LV = MDR | #Restore the original LV vale just read from the stack |
| ireturn8: MDR = TOS; write; goto Main1 | #Store the return value at the new top of stack location. TOS is already synchronized with it. |
That is all there is to implementing the ISA in interpretive microcode. Just a little over 100 instructions are needed to interpret 20 ISA-level instructions.
There are many design decisions that have to be made when deciding what microarchitecture is needed and how the microinstructions can work efficiently. To increase speed, we might make the hardware a little more sophisticated so we can do more in a single clock cycle, perhaps reducing the number of microinstructions needed to interpret many of the macroinstructions. We could also shorten the clock cycle if we optimize some of the circuitry, such as inserting a faster adder or removing the bus B decoder by adding individual control bits to the microinstruction format. One of the most effective ways to increase speed is to overlap processing of instructions. The microarchitecture could be executing instructions and fetching/reading/writing in parallel if some of the memory circuitry were separated from the main data path. Waiting for (and assembling) multibyte data on the program memory port is a significant bottleneck in the current architecture.
Unfortunately, adding hardware to increase speed results in more expense (sometimes). With integrated circuits, the cost increases when the technology is pushed to its limit, requiring higher density components. However design time and working out timing problems may also add cost.
Improving Performance
There are two ways to improve performance. One addresses implementation, the other, architecture. Implementation changes allow backward compatibility. Architectural changes may require new software to take advantage of the changes. These improvements can occur at different levels. Changing the microarchitecture may leave the ISA level unchanged, so this is an implementation change. Examples of architectural changes include adding new (ISA level) registers or adding new instructions. Either of these changes will require changes to software (at least recompilations with updated compilers) to be effective. The first improvements we will consider will all be at the microarchitecture level (implementation changes).
Optimization 1 - Tweak the microcode (Implementation)
Some of the interpreter sections have underutilized cycles. If we can resequence some of operations, and overlap some of the sections, we may be able to reduce the number of clock cycles needed to interpret a typical program. The easiest place to do this in our code is at the main loop. Rather than ending each of the sequences with a goto Main1, we try to jump ahead and transfer directly to the next section of code, to get started immediately on the next macroinstruction.
Main1: PC = PC + 1; fetch; goto (MBR);
In many cases, the next opcode is already in MBR before transferring to Main1. If we can issue the PC update and initiate the fetch before returning to Main1, we can replace that branch with the indirect branch using MBR. For example. Examine the two versions of POP:
| Original POP | POP merged with Main1 | ||
|---|---|---|---|
| pop1: MAR = SP = SP - 1; rd; | #Remove a word from the stack, read new top to synchronize TOS | pop1: MAR = SP = SP - 1; rd; | #Unchanged |
| pop2: | #This cycle is wasted | pop2: PC = PC + 1; fetch | #MBR already has the opcode for the next instruction. This initiates the fetch of the next next instruction (merging the actions previously done at Main1) |
| pop3: TOS = MDR; rd; goto Main1 | #MBR already contains the next instruction, but we return to main to update PC and initiate the next fetch, and perform the indirect branch | pop3: TOS = MDR; rd; goto (MBR) | #MBR will change at the end of the next cycle, after we have used it for the transfer |
If we could do this for every group of microinstructions, we could eliminate the Main1 label (and microinstruction) completely.
Optimization 2 - Add another bus (Implementation)
Having one bus connecting registers to the ALU makes the holding register (H) a bottleneck and sometimes requires extra shuffling of data. We can sometimes save cycles if we extend the A bus (making it a real bus) to allow any pair of registers to be input into the ALU. The ILOAD instruction can be shortened by one instruction using this approach.
| Original ILOAD | New ILOAD utilizing 3 busses | ||
|---|---|---|---|
| iload1: H = LV; | #H is used here as a temp register | iload1: PC = PC + 1; fetch | W#e must wait for MBR, so this is a chance to get started on the fetch |
| iload2: MAR = H + MBRU; rd | #MBR now has the index (and it is an unsigned byte); Initiate read of the word to be placed on the stack | iload2: MAR = LV + MBRU; rd | #LV is placed on the A bus avoiding the use of H |
| iload3: MAR = SP = SP + 1; | #Update stack pointer and prepare to write to the new top of stack | iload3: MAR = SP = SP + 1 | #No changes |
| iload4: PC = PC + 1; fetch; wr | #Initiate the fetching of the next opcode and since the data has been read into MBR, initiate the writing of it to the stack | iload4: TOS = MDR; wr; goto Main1 | #Done one cycle early! |
| iload5: TOS = MDR; goto Main1 | #Synchronize TOS | ||
We could also merge the Main1 instruction into this sequence, but it requires another microinstruction (there are no underutilized cycles), so does not affect the speed in any way. This approach requires a new microinstruction format. We need to add 4 more bits to control the decoder for the A bus (just as we controlled the B bus).
Optimization 3 - Add an instruction pre-fetch unit (MIC-2) (Implementation)
All of the instruction sequences require incrementing the PC and initiating a fetch; In some cases, this is repeated and delays are needed while the fetch is completed. An instruction pre-fetch unit would take over this repetitive task, autoincrementing the PC and fetching bytes from the instruction stream. A small shift register would be included as a queue for the prefetched data. Provision for building the 16-bit displacements from a pair of bytes in this queue would be included. This is a diagram of a possible IFU (Instruction preFetch Unit):

The prefetch unit keeps the original MBR and MBRU connections to the B bus (renaming them MBR1 and MBR1U), and adds 2 16-bit inputs, MBR2 and MBR2U. Note that this 16-bit access swaps the low and high bytes as required to correctly build the 16-bit value. It then sign extends (MBR2) or zero fills (MBR2U). Rather than fetching only one byte at a time (using PC to select an address), a second address register is added (IMAR) that provides word access; thus 4 bytes are fetched at a time. Both PC and IMAR are equipped with autoincrementers. PC can be incremented by 1 or 2 (to reflect the use of a16-bit operand). IMAR is incremented by 1 to select the next memory word.
Reading from MBR1 causes a shift by one byte in the instruction queue. Reading from MBR2 causes a shift of two bytes. When there are 4 empty bytes, the next word from memory is read into those bytes. As long as memory requests keep pace with requests, the IFU can fill sustained requests for data through MBR1 or MBR2 on consecutive clock cycles.
This table shows how ILOAD can be shortened when the IFU tasks are removed under MIC-2.
| iload1: MAR = LV + MBR1U; rd; | #Initiate the read of local variable using pre-fetched offset |
| iload2: MAR = SP = SP + 1 | #Update the stack pointer in preparation to store the word |
| iload3: TOS = MDR; wr; goto (MBR1); | #Sync TOS and use the pre-fetched opcode to start the next instruction |
The PC value can still be loaded from the C bus (required for transfers). It is apparent that IMAR is loaded in parallel with PC, however the lower two bits are always 0. The operation of the IFU and the task of keeping the byte stream synchronized with PC, especially when PC is loaded with a new value, is a fairly complex task. A FSM (Finite State Machine) is a good way to define its actions. The inputs to the FSM are the reads of MBR1 or MBR2, and the fetch of a word into the shift register. The state numbers correspond to the number of bytes left in the shift register.
When a word is fetched, the state of the machine would control which bytes of the register the word is deposited into. The diagram is incomplete in that it does not take into account the loading of PC with a new value. This could occur at any state, and should be indicated with a new arrow leaving each node. These arrows would all go to state 0 as the current stream would be invalid. In addition, when the new word is fetched, there would be another state change, to state1, 2, 3, or 4, depending on the last two bits of the new PC value. This would be indicated with 4 new arrows leaving state 0.
Note that an MBR2 request cannot be fulfilled when in state 0 or 1. The hardware may need to indicate availability of data to allow the rest of the system to wait until the request can be fulfilled, or careful timing of microinstructions to ensure sufficient time passes to fill the queue before using the data from MBR1 or 2.
Optimization 4 - Pipelined Design (MIC-3) (Implementation)
Adding an IFU created a basic parallel execution capability. A pipeline exploits parallelism during the decoding and execution of instructions. Pipelining does not decrease the time needed to execute a single instruction, but by starting on the next instruction before the previous one is finished, the throughput can be dramatically increased. The basic data path has three easily separated, and highly sequential components: the A and B busses, the ALU, and the C bus. Each of these needs to wait for the signals of the previous component to propagate through their circuitry before they can accomplish their task. We can separate these components logically using a set of latches to hold the output of each stage. Assuming each stage can be accomplished in roughly 1/3 of the original total path time, we can increase the clock speed by a factor of 3, and feed each instruction through this three stage pipeline. This utilizes each stage to its full potential, allowing it to process the result of the previous stage while the previous stage begins to process the next input.

We assume that the memory/IFU can keep pace with one instruction per clock cycle even with the clock speedup (possible if all of the data comes from L1 cache which is technically still part of the same circuitry as the CU), this design can realize a threefold increase in processing power. To implement this architecture, we can add control lines to the latches and call them registers. Each microinstruction will now be divided into three microsteps. The pipeline is effected by using three MIR's. At each clock pulse, the MIR forwards its value to the next MIR. MIR1 drives the A and B busses (the control bits for the ALU and C bus are ignored). MIR2 uses only the bits that control the ALU. MIR3 uses only the C bus control bits and memory signals. The address portion of the microinstruction would be used from MIR1 to enable the next microinstruction to be brought into the pipeline with each cycle.
| MIC-2 ILOAD code | Stage 1 (Load) | Stage 2 (ALU) | Stage 3 (Write back) |
|---|---|---|---|
| iload1: MAR = LV + MBR1U; rd; | B = MBRU1; A = LV; | ||
| iload2: MAR = SP = SP + 1 | B = SP; | C = A+B | |
| C = B+1 | MAR = C; rd; | ||
| (MDR = memory) | MAR = SP = C | ||
| iload3: TOS = MDR; wr; goto (MBR1) | B = MDR; | ||
| (memory = MDR) | C = B | ||
| TOS = C; wr; goto (MBR1) |
The first few cycles go smoothly, but in the third step of Stage 1, we encounter a problem. The memory read initiated at the end of iload1 (which occurs at microstep3) will not be available until after the next cycle. This situation is called true or RAW (Read After Write) dependence, and is sometimes referred to as a hazard. If nothing else can be done during these cycles, a stall occurs. The availability of data in the MBR (and the completion of the write operation) are indicated in italics to emphasize the need to wait for the completion of these tasks. In this pipelined model, either a stall in the pipeline occurs, or nop microinstructions are inserted into the microcode to introduce the necessary delay.
The pipelined implementation requires 7 microsteps to execute the 3 microinstructions. However, with an increased clock speed, this might be equivalent to 2.33 original microinstructions. Unfortunately, the stall in this sequence led to a fairly small speedup. Other instructions may see greater benefits with the pipelined approach.
We have also not considered the overlap with the previous and next instructions. The opcode will be available immediately after the third microstep, so stage 1 could begin executing that instruction assuming the TOS value is not immediately needed in that instruction. The uncertainty of what instruction comes next makes overlap rather difficult in this design. For example, what if the next microinstruction fetched started with A = TOS or B = MDR?
Optimization 5 - Deep Pipelining (MIC-4) (Implementation)
By breaking the pipeline into even smaller (and simpler) steps, the frequency of the clock signal can be increased even more. This type of design may realize greater efficiency over the simpler pipeline discussed above. The MIC-4 uses a seven-stage pipeline, and a different organization for the microcode. The main feature of this design is the use of an instruction decode unit. The decoder maps each opcode to a sequence of instructions in the micro-operation ROM. The decoder determines instruction length, relieving the microcode of this responsibility. Thus, offsets and displacements are immediately available to the data path. In addition, the structure of the microinstructions is simplified by assuming they always sequence (well, almost always). This design also separates the memory cycle to its own stage of the pipeline. Three steps are still used to drive the data path. Four MIR's are needed.

The decoding unit reads the Opcode into a register that selects the address lines of an internal ROM. The entry for that opcode tells the decoder if the instruction is 1, 2, or 3 bytes and the index in the micro-operation ROM where the sequence of microinstructions will be found. This index is forwarded to the third stage, the queuing unit. The microinstruction format now includes a final and goto bit in place of the longer (12-bit) address/JAM field. The layout and number of bits for each field might look like this.
| Final | Goto | ALU | C Bus | Memory | A Bus | B Bus |
| 1 | 1 | 8 | 9 | 2 | 4 | 4 |
These microinstructions are grouped in the queuing unit's ROM. The final bit is used to mark the last instruction in each sequence. The Goto bit is used for the conditional branches. The queuing unit looks up the first microinstruction (based on the index it receives) and copies it and all of the following (up to the one with the stop bit) into the instruction queue. The queuing unit may have to wait until there is room in the queue. When the copy is complete, it signals the decoding unit for the next index. One variation occurs when an instruction with a Goto flag is placed in the queue. The queuing unit must stall here until the branch is resolved. When the branch is resolved, the queue is reloaded. In some cases, this will require the IFU to fetch a different instruction which must feed back into the decode unit. Thus, branches taken will introduce a significant delay into the pipeline. This is an important topic that is treated in many different ways in attempts to reduce such delays.
Optimization 6 - Cache Memory (Implementation)
Memory performance is measured with two parameters. Latency is the time required to supply a value when a read operation occurs. Bandwidth measures the flow of data (bytes per time unit). Bandwidth can be increased by pipelining techniques, but this may increase latency since the requested value must pass through all stages of the pipeline to become available. Memories capable of fulfilling requests within one or two clock cycles are either small or very expensive. Cache memory attempts to minimize cost and maximize performance by providing a small, very fast memory to hold the most frequently accessed memory values.
A split cache provides separate areas for memory and instructions. This allows parallel access to the instructions and data. Processors taking advantage of this organization will have separate ports for each memory (two MAR's and two MDR's). Each cache also has independent access to main memory to write back changes and to supply requests for uncached data.
Current systems often utilize a multi-tiered memory system with two or more levels of cache memory. Level 1 cache is usually part of the processor chip. Other levels are provided through external connections, some of which are dedicated high-speed busses. In general, a level n cache will contain the full contents of the level n-1 cache's data.
Improvements in performance realized by cache memory is based on two principles of locality. Spatial locality refers to words with similar addresses. Temporal locality refers to memory locations accessed at similar times. Spatial locality is realized by instructions when sequentially executed or by fields in an object. Temporal locality is realized when procedure calls occur repetitively or a program utilizes structures such as parallel arrays. Spatial locality is handled by reading blocks of data; temporal locality is handled by discarding blocks that have not been recently used.
The common organizational tool used in cache storage is the cache line: 4-64 consecutive bytes of memory. Each line is numbered and a simple scheme exists for determining what bytes are in what line. For example, if 64 byte lines are used, cache line 5 would contain bytes 5*64 through 6*64-1. In other words, memory is organized as an array of cache lines. Data is transferred between cache and memory in line-sized blocks. The system must know what lines are in cache, and where they are located.
Direct-Mapped Cache
The cache is organized as an array of line objects. Each line object contains a valid bit, a tag field, and the line data. The valid bit indicates whether that line is being used. The tag identifies the location in memory corresponding to this line. The data is of course, the current data for that line of memory. If the line data changes, it must be stored back into memory before being discarded.
The term direct-mapped comes from the logical organization of this cache. Each byte of memory is mapped directly to one location in the cache. Each cache data byte corresponds to one of n different memory locations, scattered across memory at regular intervals (the cache size). As an example, consider the 32-bit address 0x00457824 and a direct-mapped cache of 128K organized as 4096 32-byte lines. The address is broken into fields:
| 31-17 (Tag value) | 16-5 (Cache line number) | 4-2 (Word offset within line) | 1-0 (Byte offset within word) |
| 0x0022 | 0xBC1 | 0x01 | 0x0 |
The tag line is determined by 12 bits of the address. This allows direct selection of one of the 4096 (212) cache lines. This line can correspond to 32768 (215) different blocks of memory. The tag field (15-bits) would tell which one this is. There are 8 words in this line, so 3 bits suffice to select the word. If the address is not a multiple of 4, the last two bits indicate the byte number within the word. Alternatively, you could view bits 4-0 as a byte address (0-31) in the selected line. When a memory request for this address is made of the cache, it is a hit if the selected line (line number 0xBC1) is valid and the tag value matches the tag field of the address (0x0022). If these conditions are met, the data (a word) is found in the line at word offset 1.

If a miss occurs (the line is not valid or has a different tag value) the cache controller must load the line from memory (storing the current line first if it is valid and if it has changed).
This cache organization is very simple and fast. Compilers can maximize the use of this type of cache by placing concurrently used data at specific offsets (not at multiples of the cache size). The cache may suffer performance degradation if a large array happens to be used in a way such that successive element accesses require the same physical cache line.
Set-Associative Cache
This uses a closed-table hash-based storage. The direct mapping technique above is a hash-based storage, where the Line number extracted from the address is used as the hash value. Collisions are handled by replacing the data (i.e. there is no collision resolution strategy - we simply store a new value in the table). An n-way set-associative cache provides n possible cache lines at each hash location. Checking for a hit requires a sequential search of the n lines at the selected location. Usually only a 2 or 4-way associative cache is needed to see maximum gains in performance and this does not appreciably slow down the access. However, the choice of which line to be discarded when a new line must be loaded presents more interesting variations. One common choice is the LRU (least recently used). This requires keeping track of the order of access, so the LRU line can be easily selected.
Memory update
Whatever cache organization is used, the problem of updating actual memory must be addressed. When a word is written to memory, it can be stored just in the cache (write deferred or write back), written to memory only (invalidating the associated cache line), or updated in both locations (write through). Of course, if the word written is not in cache, a choice to write directly to memory, or to load the required cache line (write allocation) must be made. None of these design choices have clear advantages over the others.
Optimization 7 - Branch Prediction (Implementation)
In a pipelined design, the pipeline can remain full (achieving maximum throughput) only as long as the sequence of instructions can be predicted in advance. This is clearly not possible in conditional branching situations which depend on a result that has not yet been calculated (it is still in the pipeline). Even an unconditional branch, the address of the branch would have to be determined at the beginning of the pipeline to keep things moving efficiently.
Some processor designs always execute the instruction following a branch. The reason is that this instruction will always be fetched, even if a branch is to be taken. This fetch is due to the nature of the pipeline in that the decoding of the instruction occurs while the next instruction is already being fetched. The instruction following a branch is called the delay slot. Organizing machine instructions to conform to this hardware requirement is complex (of course a simple solution is to insert a nop, but this defeats the purpose). The UltraSPARC III uses this branch convention. Compilers targeting this machine reorganize code to utilize this delay slot whenever possible.
The problem of conditional branches is worse, since the action may not be determined until late in the pipeline. Branch prediction refers to a strategy that predicts whether a branch will be taken or not, placing the best guess for the next instructions into the pipeline. Strategies vary, but branches to lower memory locations might be loops and be correct more often than forward branches. If the prediction is correct, the pipeline continues unhindered. If the prediction is wrong, there are multiple problems to consider. At all stages of the pipeline, changes have occurred to the state of the processor. All of these will have to be rolled back to correctly execute the next instruction when it is finally fetched and enters the pipeline. Two strategies are used. One is to use temporary registers that are used until the correctness of the prediction is known. These then replace the real registers if the prediction is correct. The alternative is to store all of the registers, and then use these values to restore them to their earlier state in case the prediction is wrong. If the pipeline holds multiple pending branches, the complexity of these solutions increases.
Dynamic Branch Prediction utilizes some type of history for each of the conditional branch instructions. Branch prediction can then utilize statistical results to improve performance. A hash table is used to store the branch instruction history. This organization is similar to the cache organization. The idea is to not waste too much time in looking up an instruction to determine which branch choice is more likely. One prediction model uses a finite state machine to predict the branch outcome. In this case, the hash table will store the state of the finite state machine.
Some branch instructions include a dynamically determined branch addresses (such as an indirect jump). The prediction must then guess not only will the branch be taken, but the address it will jump to. Storing the last several branch addresses and choosing between them might be a feature of the dynamic branch prediction algorithm. Branch prediction is a current topic of research.
Static Branch Prediction is accomplished by the compiler (instead of hardware). It does however require some architectural cooperation since normally the compiler can affect only the placement of instructions in memory. If the ISA level includes dual versions of the conditional branch instructions, one for branches that commonly sequence, the other for those that commonly branch, then the compiler could choose one over the other based on an analysis of the code, allowing a simple predictive scheme to be implemented in hardware. Another form of static branch prediction requires running the program to gather statistical information of actual branch patterns (profiling). This branch data is fed back to the compiler, and is used to make better predictions as to which version of the conditional branch should be inserted.
Optimization 8 - Out-of-Order Execution and Register Renaming (Implementation)
Keeping instructions flowing through the pipeline is a challenge even when the instructions are sequential. It is common for an instruction to depend on the result of the previous instruction. This Read After Write (RAW) dependence introduces delays in the pipeline. To utilize the pipeline to full potential, other instructions need to be inserted into these time slots. Reordering the program must, however, not affect the end result (the end justifies the means?).
An example will be sued to illustrate the performance gains that can be attained using these techniques. We begin with an 8 register superscalar pipelined machine. We will focus on the decode, execute, and writeback stages which occur in consecutive clock cycles with the exception of a multiplication which requires and extra cycle to calculate the result.We will also assume that at most two instructions can be issued per clock cycle. To see the problem associated with in-order execution, we will require that instructions are issued in order and are completed in order. Later we will relax these requirements.
The processor includes a scoreboard that tracks register usage to detect dependencies. The scoreboard contains information about executing instructions and the resources they are using. The rows of the following table show the contents of the scoreboard during each clock cycle.

The # column is the instruction number and is next to the decoded instruction. The Iss column indicates when an instruction is issued. You can see that instruction 4 cannot be immediately issued. The Ret column indicates when an instruction is retired (completed). Our first model requires that these columns be kept in order. Issuance increases the resource counts and retirement decreases them.
Cycle 1: Two instructions are issued. The first increments the scoreboard entries for registers R0, R1, and R3 under the appropriate section. The second instruction can immediately be issued as it does not depend on R3 (the only one that will be modified by a currently executing instruction). The counters R0, R2, and R4 are incremented.
Cycle 2: Two more instructions are decoded. The second cannot be issued because the sum depends on R4 which is pending a write by instruction 2. This is a RAW dependence. Instructions 1 and 2 are executed during this stage.
Cycle 3: Instruction 2 completes its execution here, but the in-order-completion requirement delays its retirement until instruction 1 finishes. This requires another clock cycle because of the multiplication. No change occurs on the scoreboard.
Cycle 4: Instructions 1 and 3 finish here. All 3 instructions are retired, their results being written to their respective registers. The scoreboard reflects the release of all of the register resources.
Cycle 5: Instruction 4 can now be issued, along with the next (5) since it has no conflicting resources.
Cycle 6: Instruction 6 is decoded, but a conflict occurs since it is writing into a register used by instruction 5 (R1). This is called a WAR (write after read) dependence. Without delving deeper into the pipeline, we cannot tell if I5 has used the value in R1 yet. Writing a new value there could corrupt the result of the multiplication (I5).
Cycle 7: Instruction 4 retires, instruction 5 is still executing.
Cycle 8: Instruction 5 retires, releasing the register needed for instruction 6.

Cycle 9: Instruction 6 can now be issued. Instruction 7 is decoded, but cannot be issued. It needs register 1 (being written by I6) - RAW
Cycle 10: Waiting for I6 to execute
Cycle 11: I6 retires, releasing the register needed by I7
Cycle 12: I7 is issued, but I8 must wait on R1 (WAR)
The rest is straightforward. The entire process required 18 cycles to execute 8 instructions. Not very good for a machine that could theoretically issue 2 per cycle (all instructions in the pipeline in 4 cycles) and execute each in 2 or 3 cycles (possibly finishing in a total of 6-7 cycles).
The benefit of the in-order-completion restriction is not apparent from the processing model described above. However, if an interrupt occurs and out-of-order completion has occurred, the state of the registers/memory is not necessarily consistent with a logical point in the program. Interrupt routines may incorrectly use or update information assuming in-order-completion occurred. A precise interrupt is one that occurs between consecutive instructions, one completed, the next being or about to be executed. An imprecise interrupt would occur between consecutive instructions, but the first is still being executed and the next has already finished. This imprecision may cause sporadic failures in software.
By relaxing the in-order requirements to allow out-of-order execution and out-of-order completion, some performance gains are seen. The entire sequence of instructions can be completed in just 9 clock cycles.
The first difference occurs in Cycle 3 where instructions 5 and 6 are decoded and issued while I4 is waiting. There are two problems that might occur here. First, the stalled instruction is going to modify R6. We cannot start an instruction that reads R6 (RAW). The scoreboard must be extended to include write dependencies for skipped instructions.
The second complication is more serious, because the stalled instruction (I4) needs R1, and Instruction 6 (R1=R0-R2) is going to store a new value here. Register renaming is a technique that can be used to overcome these WAR and WAW (Write After Write) dependencies. By renaming R1 with a secret register (S1, not available at the ISA level) in instruction 6 and subsequently in instruction 7 where it is used, the stalling can be eliminated. When R1 becomes free, the value of S1 can be copied into it in time for its next use.
Register renaming is used again in cycle 4 to allow instruction 8 (R1=R4+R4) to be issued, even though R1 is still tied up by stalled instruction 4. For this particular sequence of instructions, allowing out-of-order execution and completion doubled throughput.
Register renaming is managed in hardware. A table mapping ISA level registers to n+s real registers is commonly used, so the physical location of the register may change, but it is always accessed using the ISA Level register description.
Optimization 9 - Speculative Execution (Architectural)
By dividing programs into basic blocks, another method for speeding execution can be employed. A basic block is a section of code that contains no branches internally, and is always entered from the top (and therefore exited from the bottom). A program can then be represented as a directed graph, with edges showing how one block can follow another. Within each block, the reordering and register renaming techniques can be easily accomplished. Unfortunately, most blocks are relatively short. Allowing reordering to span blocks is the next logical step. However, reordering when branch instructions are part of the sequence adds a great deal of complexity.
Hoisting is when an instruction that follows a branch is issued before the branch instruction is retired. This is the essence of speculative execution. Code is executed before it is known if the branch will lead in that direction. Accomplishing this requires cooperation between the compiler and the hardware. It is even possible to begin execution on both branch alternatives (in parallel) and discard the one not needed. For example, if one branch uses variable x, and the other y, both values might be loaded before the branch decision is made.
Speculative execution cannot commit changes to registers or memory until it is certain that the code is to be executed. Scratch registers can be used for this purpose, increasing the use of the scoreboard. Another, more significant, problem occurs with speculative execution. If an exception occurs during the execution of of a speculative instruction (cache miss, page fault, or divide by zero) or if the instruction causes a time intense operation such as a disk access. If it turns out the instruction is not needed, these actions could introduce significant drags on the system.