THE MICROARCHITECTURE LEVEL


 Microinstructins
 Stacks
 Integer Java Virtual Machine
 Compiling Java to IJVM
Microarchitecture Level Design
 Improving Performance

The next level is the microarchitecture level which implements the Instruction Set Architecture (ISA), the level above it.  This contains a microprogram which fetches, decodes and executes instructions (in this case Integer Virtual Java Machine instructions).  Each instruction has one or two fields to serve some specific purpose.  The first is the opcode (short for operation code), which identifies the instruction such as ADD or BRANCH.  The second field specifies the operand (which variable).

For more information of the Java Virtual Machine click below:

     Java Virtual Machine

The data path contains the ALU, its inputs, and its outputs.  It also contains a number of 32-bit registers.  There are three busses, "A", "B", and the "C" bus.  The ALU has six inputs, F0, F1, ENA (Enable A), ENB (Enable B), INVA (Inverse A), and INC (Increment).  We can get up to 26 or 64 possible outputs, such as A, B, NOT A, NOT B, A + B, A + B + 1, A + 1, B + 1, B - A, B - 1, -A, A AND B, A OR B, 0, 1, AND -1.

A short pulse, derived from the main clock, is produced at the start of each clock cycle.  On the falling edge of the pulse, the bits that drive all the gates are set up in finite and known time.  Then the register needed on the B bus is selected and driven onto the B bus.  After this is stable, the ALU and shifter begin operation on valid data.  After that is stable, the results are propagated along the C bus to the registers, where they can be loaded on the rising edge of the next pulse.

The machine has two different ways to communicate with memory:  a 32-bit, word-addressable memory port, controlled by two registers, MAR (Memory Address Register) and MDR (Memory Data Register), and an 8-bit, byte-addressable memory port controlled by one register, PC (Program Counter), which reads 1 byte into the low-order 8 bits of MBR (Memory Buffer Register).  This port can only read data from memory; it cannot write data to memory.

All registers are driven by one or two control signals.  MAR does not have a connection to B bus, only to main memory.  H is connected through A bus to ALU.  Fore more details of registers click below:

 Registers

Microinstructions

We need 29 signals to control the data path.  They are:

9 Signals to control writing data from the C bus into registers.  They are:
    H (Holding register), OPC and TOS (Extra registers), CPP (Constant Pool Pointer), LV(Local Variable), SP(Stack           Pointer), PC, MDR, and MAR.
9 Signals to control enabling registers onto the B bus for ALU input.  They are:
    MDR, PC, MBR, MBRU(An additional signal from MBR), SP, LV, CPP, TOS, and OPC.
8 Signals to control the ALU and shifter functions. They are:
    SLL8(Shift Left Logical 8 bits), SRA1(Shift Right Arithmetic 1 bit), F0 and F1 (Control Signals), ENA, ENB, INVA, INC
2 Signals to indicate memory read/write via MAR/MDR.
1 Signal to indicate memory fetch via PC/MBR.

With a small increase in circuitry, we can reduce the number of bits needed to select among the possible sources for driving the B bus.  We can encode B bus information in 4 bits and use a decoder to generate 24 or 16 control signals, 7 of which are not needed.  So the total number of signals required is now 24 which is the microinstruction format Mic-1..

The table below shows the microinstruction format for Mic-1.
 
 
 
 
 

Bits
9
3
8
9
3
4
 
NEXT_ADDRESS
JAM
ALU
C
Mem
B

NEXT_ADDRESS contains the address of a potential next microinstruction.  It has 9 bits.  So, 29 or 512 addresses can be located.  JAM determines how the next microinstruction is selected.  The signals are: JMPC (Jump to Micro Program Counter), JAMN(Jump if negative) and JAMZ(Jump if zero).  ALU are ALU and shifter functions.  Mem is Memory functions.  They are WRITE, READ, and FETCH.

The largest and most important item in the control portion of the machine is a memory called the control store, which holds the complete microprogram.  It contains 512 words, each of 36-bit microinstruction.

Stacks

All programming languages support the concept of procedures (also called methods, functions, or subroutines), which have local variables, which are accessed from inside the procedures, but cease to exist once the procedures have returned.  Where should these variables be kept in memory?  Giving each an absolute memory address does not work in case of recursive methods which can call themselves.  Even if it is called twice, the second invocation will interfere with the first.

So, an area of memory called the stack is reserved for variables, but individual variables do not get absolute addresses in it.  Instead, a register LV is set to point to the base of the local variables for the current procedure.  The figures below explain the process.

Stack Pointer
Stack location
Stack Address
SP ->
a3
108
 
a2
104
LV ->
a1
100

While A is active.


Stack Pointer
Stack Location
SP ->
b4
 
b3
 
b2
LV ->
b1
 
a3
 
a2
 
a1

After A calls B

Stack Pointer
Stack Location
SP ->
c2
LV ->
c1
 
b4
 
b3
 
b2
 
b1
 
a3
 
a2
 
a1

After B calls C

Stack Pointer
Stack Location
SP ->
d5
 
d4
 
d3
 
d2
LV ->
d1
 
a3
 
a2
 
a1

After C and B return and A calls D

In addition to holding local variables, stacks can also be used to hold operands during the computation of an arithmetic operation, in which case it is referred to as the operand stack.  The figures below shows how a1 = a2 + a3 is accomplished.
 
 
 
Stack Pointer
Stack location
Stack Pointer
Stack location
Stack Pointer
Stack location
Stack Pointer
Stack location
   
SP ->
a3
       
SP ->
a2
 
a2
SP ->
a2 + a3
   
 
a3
 
a3
 
a3
SP ->
a3
 
a2
 
a2
 
a2
 
a2
LV ->
a1
LV ->
a1
LV ->
a1
LV ->
a2 + a3

The IJVM (Integer Java Virtual Machine) Memory Model

It consists of a memory that can be viewed as 4,294,967,296 bytes (4 GB) or 1,073,741,824 words each of 4 bytes.  The areas of memory are:

1.    The Constant Pool, which cannot be written by an IJVM program and consists of constants, strings, and pointers to other areas of memory that can be referenced.  It is loaded when the program is brought into memory and cannot be changed later.

2.    The Local Variable Frame stores variables during the lifetime of a method.  The parameters (or arguments) are at the beginning of the frame.  The operand stack is separate.

3.    The operand Stack guaranteed not to exceed a certain size, computed in advance by the Java compiler, and is allocated directly above the local variable frame.  The stack pointer, SP, changes during the execution of the method, unlike CPP and LV.

4.    The Method Area is a region of memory containing the program, referred to as the "text" area in a UNIX process.  This is a byte area and the pointer is the Program Counter, PC.
 
 
 
   
Current Operand Stack 3
<- SP    
   
Current Local variable Frame 3
<- LV    
   
Local Variable Frame 2
     
Constant Pool
<- CPP
Local Variable Frame 1
  Method Area <- PC

The IJVM Instruction Set
 
 
 
Hex
Mnemonic
Meaning
0x10
 BIPUSH byte  Push byte onto stack
0x59
 DUP  Copy top word on stack and push onto stack
0xA7
 GOTO offset  Unconditional branch
0x60
 IADD  Pop two words from stack; push their sum
0x7E
 IAND  Pop two words from stack; push Boolean AND
0x99
 IFEQ offset  Pop word from stack and branch if it is zero
0x9B
 IFLT offset  Pop word from stack and branch if it is less than zero
0x9F
 IF_ICMPEQ offset  Pop two words from stack; branch if equal
0x84
 IINC varnum const  add a constant to a local variable
0x15
 ILOAD varnum  Push local variable onto stack
0xB6
 INVOKEVIRTUAL disp  Invoke a method
0x80
 IOR  Pop two words from stack; push Boolean OR
0xAC
 IRETURN  Return from method with integer value
0x36
 ISTORE varnum  Pop word from stack and store in local variable
0x64
 ISUB  Pop two words from stack; push their difference
0x13
 LDC_W index  Push constant from constant pool onto stack
0x00
 NOP  Do nothing
0x57
 POP  Delete word on top of stack
0x5F
 SWAP  Swap the top two words on the stack
0xC4
 WIDE  Prefix instruction; next instruction has a 16-bit index

Compiling Java to IJVM

Consider the java program:

i = j + k;
if (i == 3)
    k = 0;
else
    j = j - 1

The IJVM program is

1.                ILOAD j                    // i = j + k                   0x15 0x02
2.                ILOAD k                                                     0x15 0x03
3.                IADD                                                           0x60
4.                ISTORE i                                                     0x36 0x01
5.                ILOAD i                    // if (i == 3)                 0x15 0x01
6.                BIPUSH 3                                                    0x10 0x03
7.                IF_ICMPEQ L1                                           0x9F 0x00 0x0D
8.                ILOAD j                    // j = j - 1                    0x15 0x02
9.                BIPUSH 1                                                    0x10 0x01
10.              ISUB                                                            0x64
11.              ISTORE j                                                     0x36 0x02
12.              GOTO L2                                                    0xA7 0x00 0x0F
13.    L1:     BIPUSH 0                // k = 0                        0x10 0x00
14.              ISTORE k                                                    0x36 0x03
15.    L2:

First j and k are pushed on onto the stack, added, and the result stored in i.  Then i and the constant 3 are pushed onto the stack and compared.  If they are equal, a branch is taken to L1, where k is set to 0.  if they are unequal, the code following IF_ICMPEQ is executed.  When it is done, it branches to L2, where the then and else parts merge.

DESIGN OF THE MICROARCHITECTURE LEVEL

Speed versus cost

There are three basic approaches for increasing the sped of execution:

1.    Reduce the number of clock cycles needed to execute an instruction, known as the path length.  It can be shortened by adding specialized hardware, such as an incrementer (an adder with one side permanently wired to add 1).  This means more hardware.  Reducing the number of cycles to fetch instructions requires an additional circuit to increment the PC.

2.    Simplify the organization so that the clock cycle can be shorter.

3.    Overlap the execution of instructions offers the most opportunity for dramatic increases in speed such a s parallel processing.

Cost was a count of the number of components when processors were built of discrete components that were purchased and assembled.  Today, the entire processor exists on a single chip, and bigger, more complex chips are more expensive than smaller, simpler ones.  Manufacturing cost of the chip grows faster than its area (measured in pico-acres).  The system designer must decide whether greater speed is worth the additional cost.

An instruction fetch unit

For every instruction the following operations occur:

1.    The PC is passed thorough the ALU and incremented.
2.    The PC is used to fetch the next byte in the instruction stream.
3.    Operands are read from memory.
4.    Operands are written to memory.
5.    The ALU does a computation and the results are stored back.

Fetching and assembling a field ties up the ALU for at least one cycle per byte to increment the PC, and then again to assemble the resulting index or offset.  In order to overlap the main loop, the ALU must be freed from some of these tasks.   This can be done by introducing a second, but not full ALU.  Or cycles can be eliminated by introducing additional data paths not going through the ALU.  The main path can be eliminated by branching directly to the next instruction at the end of the previous instruction.

IMPROVING PERFORMANCE

Cache memory

A cache holds the most recently used memory word in a small, fast memory, speeding up access to them.   If a large enough percentage of the memory words needed are in the cache, the effective memory latency can be reduced.  Multiple caches can be used as a separate cache for instructions and data known as split cache.

Direct-mapped cache

Each entry (row) in the cache can hold exactly one cache line from main memory.  With a 32-bit cache line size, the cache can hold 64KB.  Each cache entry consists of three parts:

1.    The Valid bit indicates whether there is any valid data in this entry or not.  When the system is booted (started), all entries are marked as invalid.
2.    The Tag field consists of a unique, 16-bit value identifying the corresponding line of memory from which the data came.
3.    The Data field contains a copy of the data in memory.  This field holds one cache line of 32 bytes.

In a direct-mapped cache, a given memory word can be stored in exactly one place within the cache.  Given a memory address, there is only one place to look for it in the cache.  If it is not there, it is not in the cache.  For storing and retrieving data from the cache, the address is broken into four components:

1.    The TAG field corresponds to the Tag bits stored in cache entry.
2.    The LINE field indicates which cache entry holds the corresponding data, if they are present.
3.    The WORD field tells which word within a line is referenced.
4.    The BYTE field is used only if a single byte is requested.  It tells which byte within the word is needed.  For a cache supplying only 32-bit words, this field will always be 0.

Set-associative cache

This allows two or more lines in each entry.  A cache with n possible entries for each address is called an n-way set-associative cache. It is more complicated than a direct-mapped cache as a set of n cache entries must be checked to see if the needed line is present.  When a new entry is to be brought into the cache, which of the present items should be discarded.  Not being able to see the future, LRU (Least Recently Used) is the one discarded.

Branch Prediction

Programs are not linear code sequences, but full of branch instructions.  As an example, if i is compared to 0, and the value of k depends on the result.  The program is:

   if (i == 0)
            k = 1;
    else
            k = 2;

In assembly language, it is:

                                CMP i, 0             ; compare i to 0
                                BNE Else            ; branch to Else if not equal
        Then:                MOV k, 1           ; move 1 to k
                                BR Next              ; unconditional branch to Next
        Else:                 MOV k, 2            ; move 2 to k
        Next:

The first way is to allow instructions fetched after a predicted conditional branch to execute until they try to change the machine's state (store into a register).  The second way is to record the value of any register about to be overwritten (in a secret scratch register), so the machine can be rolled back to the state it had at the time of the mispredicted branch.

Dynamic Branch Prediction

Accurate predictions allow the CPU to proceed at full speed.  The CPU can maintain a history table (in special hardware), in which it logs conditional branches as they occur, so they can be looked up when they occur again.  The history table contains one entry for each conditional branch instruction.  The entry contains the address of the branch instruction along with a bit telling whether it was taken the last time it was executed.  The prediction is that the branch will go the same way it went last time.  If the prediction is wrong, the bit in the history table is changed.

Static Branch Prediction

Dynamic branch predictions require specialized and expensive hardware and complex chips.  The compiler can help.  For instance, with a statement like  for (i = 0; i < 1000000; i++), the compiler knows the branch at the of the loop will always be taken.
 
 
 Top
 On to Instruction Set Architecture
 Back to Schedule and Syllabus