Computer Architecture

© Tim Margush 2006

Digital Logic

Gates are the fundamental component in digital logic circuits. Gates are constructed from elementary electronic components: a transistors and resistors. Logic circuits require a power source and represent binary values by two distinct voltage levels. Logic diagrams often omit the supply voltages and focus only on the logical lines. The first diagrams will show both.

There are three fundamental gates that illustrate the use of these components. The resistor is the simpler component. It simply turns the flow of electrons into heat. In this action, it "drops" some voltage across the resistor. Current is measured in amperes, voltage in volts, and resistance in ohms. The relationship between these quantities related to a resistor is V = IR (Ohm's law). When the current is 0 (or close) the voltage dropped across the resistor is very small. When the current is high, the voltage will be greater.

The Vin and Vout represent the logical data. Vcc is typically 5 volts (depending on the components used in the circuit). A voltage level on the logic (data) lines under 1 volt represents a logic 0 (or 1) and a level of 2-5 volts represents the other.

The transistor acts as a simple switch in these circuits. Only a few nanoseconds are required to change states. The base connection turns the switch on or off. When on, the collector is "connected" to the emitter. When Vin, connected to the base of the transistor, is low (under 1 volt), the transistor is off. This means very little current flows through the resistor so the voltage on each side is about the same. Vout (in the above diagrams) will appear to be at the 5 volt level (the same as Vcc). When the base of the transistor is provided a voltage closer to 5 volts, the transistor switches on causing current to flow at the maximum rate allowed by the resistor. The 5 volt supply is completely dropped by the resistor causing Vout to read a voltage closer to 0 volts.

The three gates shown above implement the Boolean NOT, NAND, and NOR functions. NOT gates are commonly called inverters. Additional inputs are often added to the NAND and NOR gates.

There are 7 basic gates implementing the non-trivial fundamental unary and binary Boolean functions. All logic circuits can be constructed from the above subset of just 3 gates (other subsets are sufficient as well).

Missing from the diagram are the XOR and NXOR.

Boolean (Switching) Algebra

The set of Boolean values is B = {false, true} or {0, 1}. A Boolean function is a function from Bn into B. There are two unary Boolean functions, the identity, and NOT. That is I(x)=x and NOT(x) = 1-x.

Truth tables are often used to describe Boolean functions. This table defines a ternary function f : B3 into B

a b c f
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 1
1 0 1 0
1 1 0 0
1 1 1 1

A logic circuit can be constructed to implement this function. As the number of inputs increases, the number of rows in this truth table grows (exponentially). There is also a functional (and compact) expression using logical operations from Boolean Algebra (NOT, AND, and OR). There are standard techniques for going from one to the other. Here is one function equivalent to the truth table. The bars over the variables indicate negation. A dot or juxtaposition indicate AND, and plus indicates OR.

f(a,b,c) = abc + abc + abc 

This particular formula is called the canonical sum of products formula. It can be obtained by inspecting the rows of the truth table where a 1 output is desired. There is one term (product) for each of these rows. The input variable is negated if the input value is 0. The formula is the sum (OR) of these terms.

a b c f term
0 0 1 1
abc
1 0 0 1
abc
1 1 1 1
abc

This process shows that the set of gates, {NOT, AND, OR}, is complete; that is, any circuit can be constructed from just gates in that set. It is possible to build any circuit with only NANDs or only NOR gates. Formulas are often simplified or placed in an equivalent form to allow implementation in an actual circuit with fewer gates or with only certain types of gates. In some cases, the optimization is to increase the speed of the circuit (each gate introduces a slight processing delay).

Basic Rules of Boolean Algebra and Principle of Duality

Integrated Circuits

An integrated circuit is a collection of gates integrated into a single chip. The number of gates on a chip is used for classification:

Typical SSI chips provide 4-6 simple gates in a single dual-inline package (DIP). Pins protruding from the DIP provide connections to the input and output lines of the internal gates as well as the necessary power connections.

These SSI chips can be combined to build combinational circuits. Combinational circuits are implementations of truth tables or Boolean functions. Their output is uniquely determined by their current input values. There are standard combinational circuits that are pre-built and provided in a single MSI chip.

Multiplexer - Decoder

A multiplexer is a combinational circuit with 2n + n inputs and one output. The n address lines select one of the 2n data lines to be logically copied to the output. A decoder is essentially the reverse. A single input can be selectively passed through to one of 2n outputs selected by n address input lines. Here is a 4-1 multiplexor and a 1-4 decoder.

You can use a multiplexer to implement any combinational circuit. Connect the input values to the address lines and then set each input to a 1 or 0 (connect it to Vcc or GND) to create the correct output for each input. Multiplexers can also be used to convert parallel data to serial (multiplexing the bits - each bit gets a small time slice on the serial line). The address lines are connected to a binary counter so they step through each of the possible addresses. This sequences the input data on the output line. If you connect the input of a decoder (demultiplexer) to the serial line, and count though its addresses in a synchronous fashion, you recover the parallel data.

Comparators

This circuit compares the bits of two n-bit input values and outputs a 0 if they are equal, 1 if not equal. Internally, the comparison is essentially an exclusive-or of the paired bits, the intermediate results are OR'd to produce the output.

Programmable Logic Array (PLA)

A PLA allows any truth table to be implemented by custom "programming" one of these chips. A PLA has n inputs and m outputs. Internally there are x AND gates and m OR gates. Unprogrammed, all n inputs (and their complements) are input to all of the AND gates, and all of the AND gates are input to all of the OR gates. By "burning" internal connections (called fuses), the programmer selects certain input combinations for each AND gate (often the terms required in the sum of products form of a function desired to be implemented) and then selects which of these results are inputs to the OR gates (the sum portion of the formula) to produce the desired output. One PLA can produce m independent functions on the n inputs (provided there are enough gates internally for the required combinations).

Combinational circuits can be implemented using discrete SSI logic components, a collection of multiplexers, or in one or more PLA's. Designers make the choice based on economics (of time, price, or space).

Arithmetic Circuits

The basic functions of the ALU are shifting, adding, and logical operations. These are accomplished using combinational circuits. An 4-bit parallel shifter can be created by building a combinational circuit with 6 inputs and 4 outputs. There are 4 data bits and 2 select bits allowing us to shift left or right and to perform logical or arithmetic shifts.

C0 (0 = left shift, 1=right shift) C1 (0 = logical shift, 1=arithmetic shift)

A half adder computes the sum and carry of two bits. The truth table for the half adder is as follows:

a b Sum Carry
0 0 0 0
0 1 1 0
1 0 1 0
1 1 0 1

A full adder has three inputs to accommodate a carry in to the current bit position. It can be constructed by connecting two half-adders, the carry out is 1 when there is a carry from either of the two stages (there cannot be 2 carries)..

Half Adder Full Adder

A ripple carry adder can be constructed from a series of full adders, with the carry from each stage forming the carry input for the next. This diagram shows a 4-bit, ripple carry adder. The inputs are on the left; The least significant bits of the operands, with a carry in of 0, are at the top. . Each successive adder gets the next pair of bits from the two input values and the carry from the previous stage. Due to the accumulated delays, the lower full adders may need to change their output as the correct carry "ripples" through the system. Because of this long path through the circuit, ripple carry adders are slow. In the context of a processor, the result bits would be copied into a result register and the final carry out might go into the status register.

By combining the circuits for shifters and adders, and combining a few additional gates to add logical operations and a decoder to selecting the desired ALU function, entire ALU can be constructed. Often, an ALU is built from smaller modules that implement the operations for a single bit (column) of the operands. By connecting these bit-slices together, a full ALU is created. Each bit-slice unit accepts 2 data bits and produces a result and carry (to facilitate ripple carry). Additional control lines select the ALU function, and control the input lines.:

Consider the following function selection for a 1-bit ALU.

F0 F1 Function
0 0 AND
0 1 OR
1 0 NOT B
1 1 A+B

The circuit might look like this. ENA and ENB are input enable lines. INVA allows the A input to be inverted if desired.

Eight of these interconnected will implement a small ALU. INC is the carry in for the least significant bit and could be used to increment a register (A+0+1) or perform and add-with-carry (A+B+1). The carry out of position 6 is sometimes needed to set overflow flags for signed addition, so is indicated as a separate output.

Clocks

A clock is a circuit that produces an electrical pulse of a given duration at regular intervals. The time between pulses is the clock cycle time. Clocks are usually driven by a crystal oscillator which produces a highly accurate reference point. Clocks are used to sequence the activities in a computer. Usually the rising or falling edge of the clock signal are used to precisely activate an event. Clock signals are also used to enable events during the high (or low) portion of the signal. Clock cycles are often subdivided to create additional initiation points.

An asymmetric clock signal can be produced using logic functions and two synchronized clock signals. Here, C is the AND of A and B.

Memory

The basic unit of memory (a bit) is constructed from a pair of gates. The following circuit is called an SR-Latch. The S and R lines are to set (to 1) and reset (to 0) the latched value. The latch stores 0 or 1, making it available as Q (and its negation), until changed.

The "feedback" differentiates this circuit from a combinational circuit. The value of Q depends not only on S and R, but on the recent history of the circuit. Thus, an input of 00 sometimes produces a 0, sometimes a 1. Often an additional input (CLK, also called enable or strobe) is provided to enable the S and R inputs. This is called a clocked SR latch. When the CLK signal is 0, the latch can not be changed by manipulating the S and R lines.

The SR-latch's inputs should never be changed directly from 11 to 00. Doing so creates a race condition and the latch may stabilize in either the 0 or 1 state. A slightly modified circuit eliminates this possibility. It is called a (clocked) D latch.

The CLK input is used to unlock the latch, allowing the input D to be stored. When CLK returns to 0, the data is locked.

The flip-flop is another common storage circuit. Flip-flops are edge-triggered rather than level-enabled (like the latch). One variation of flip-flop is the JK flip-flop which can toggle its data value. If a JK flip-flop were level-enabled, and you gave it the signal to toggle, it would toggle repeatedly until the clock signal went low.

Either the rising edge or the falling edge of the clock signal is used to trigger the flip-flop. Here is a simple D flip-flop

The trigger is constructed from an inverter and an and gate. The timing diagram shows how a simple clock signal is changed to a very narrow pulse used to trigger the flip flop to accept the input data.

In addition to the D flip-flop, there are SR an JK flip-flops. Latches and flip-flops are usually drawn as a rectangle with inputs D, SR, or JK. The clock input has a triangle next to it to indicate it is edge-triggered. Negated clock inputs indicate low-enabled or falling edge-triggered circuits. Many of these circuits make both Q and NOT Q available as output.

Latches and flip-flops are also the building blocks for registers and memory.

Memory Organization

The main feature of a memory is the ability to select a small group of bits (such as a byte) using an address and either store new data into the byte or read the current data from the byte. From the outside, a 2n byte memory has n address lines to select the byte, and 8 data lines for the data. A read/write (RD) control line allows the memory to switch between read and write modes. A chip select (CS) line controls whether the chip can accept data to be stored, and an output enable (OE) line controls whether the output lines are active.

The bits in memory can be stored in D flip-flops. Consider a single bit-slice, where data comes from one of the data lines. All of the D inputs are connected directly from to the data line. When data is to be stored, all of the flip-flops will see the data, but only one flip-flop will receive it. The address lines are run through a decoder that has one output for every bit of this slice (and simultaneously the related bits in the other slices). The selection is made by ANDing the CS and RD signals with the decoder outputs and feeding these into the clock input of the flip-flops. Only one flip-flop will see the signal due to the decoder selecting only one output to be high. When the CS line is pulsed (the flip-flops are edge-triggered) exactly one bit (in each slice) will store the value on the data line.

The decoder signal is also ANDed with all of the flip-flop outputs, and the results of these AND gates are ORd to produce the memory output. Again, only one flip-flop's data can appear as output from the memory unit. The final output is passed through a non-inverting, tri-state buffer. The OE, RD, and CS lines control this buffer, allowing the memory output to appear only when all three are high. When low, the output lines appear to be disconnected from the memory. This is different from a zero output, and allows other devices to use the same connections. In fact, the output and the input lines are generally the same pins on a memory chip.

Memory is such a common component, it is no surprise that it is available on a chip. Full memories are constructed from groups of standard memory chips. Storage capacity is usually stated in bits rather than bytes. The following diagram shows two ways a 4 MBit chip might be packaged. The first might be a complete 512 KB memory for an embedded application. Note that the multiplexor in this chip will have a half million outputs to independently select each of the memory bits. The second chip could be used as one bit-slice of a 4 MB memory formed with 7 additional chips.

Note that chip (a) has 19 address lines which corresponds to 219 addresses or 512 KB. Chip (b) has only 11 address lines. The (CAS Column Address Strobe) and RAS (Row Address Strobe) lines are used to select a single bit in two steps. Internally, the bits are arranged in 2048 (211) x 2048 bits. The row address is selected first, followed by the column address. This organization means slightly slower memory access, but reduces the number of address pins. 4 MB (224) would require 24 address lines if conventionally addressed. The bars over the control lines indicate that the logic signal must be low to assert that action. For chip (a), the sequence to write data is

  1. Negate the OE line (logic level 1)
  2. Place the address and data levels on their respective lines
  3. Assert the CS lines (logic level 0)
  4. Assert the WE line (logic 1 to logic 0)

There are timing details that are dependent on the memory specs. OE must be negated before the address data changes or data is applied to the data line(s). The chip select action may latch the address lines to ensure they remain constant during the write cycle, thus those signals may need to be present when CS is asserted. The final step is to trigger the D flip-flop(s) to receive data. This will likely occur on the transition from 1 to 0 on the WE line. After the write has occurred, the chip select and write enable signals might be negated allowing the preparation for the next memory operation.

Memories of any size and shape can be constructed from collections of chips such as those above. Suppose we want a 16 MB memory and want to be able to access a 16 bit word in one read/write operation. We could use 32 of chip (a) or 32 of chip (b). In the first case, chips would be grouped in pairs (to provide one word in parallel). Each pair is called a bank. Bank 0 would provide the first megabyte of address space. Each successive bank would provide the next megabyte. The address data would be 24 bits, but we would always retreive a word starting on a word boundary (an even address). Thus, the least significant bit of address data would be ignored. Of the remaining 23 bits, lower 19 would connect (in parallel) to the address lines (A0-A18) of all of the chips. The extra 4 lines would go to a decoder to select the chips in one of the 16 banks. The memory unit would have 16 data lines (M0 - M15). In each bank, the first chip would connect D0-D7 to M0-M7, and the second chip would connect D0-D7 to M8-M15. The tri-state buffers, controlled by the decoder attached to the high 4 bits of address data would logically connect only one bank at a time to the memory data lines.

To create the same memory organization with 32 1-bit wide chips, we would create 2 banks of 16 chips each. The 16 outputs of each bank would connect to M0-M15. Only one bank would be selected at a time. Again, we would only use even addresses. One bit (say the high bit) of the remaining 23-bit address data would be used as a bank select (bank 0 for the first 8 MB, bank 1 for the secong 8 MB). The 22 remaining address bits would be separated into a row and column strobe. The upper 11 bits would select the column, the lower 11 bits would strobe the column.

RAM and ROM

Random Access Memory is so named because it takes roughly the same amount of time to access any memory location. there is little penalty for accessing addresses in a random order. Disk drives are more efficient when data is accessed sequentially (after the initial seek and rotational delays). SRAM (Static RAM) is what was discussed above. It takes about 6 transistors per bit to implement this type of memory. DRAM (Dynamic RAM) requires only one transistor per bit, however access is slower, and the memory must be continually refreshed. SRAM has a typical access time of a few nanoseconds whereas DRAM may be 10 times slower. The lower component count makes high capacity DRAM less expensive. In paractice, SRAM might be used for L2 Cache, and DRAM for primary storage.

There are also several types of DRAM. FPM (Fast Page Mode) DRAM uses the matrix access, but allows access to all of the bytes in a row with a single row strobe. EDO (Extended Data Out) DRAM added a pipelined access (memory requests could be overlapped) for increased speed. SDRAM (synchronous DRAM) and DDR (Double Data Rate DRAM) replaces these earlier technologies in faster systems. These systems combine SRAM and RAM, and increase data transfer rates by internalizing some of the control signals. These mamories essentially include a simple controller that responds to higher-level requests by the CPU. They run in sync with the system clock.

Non-Volatile Memory

RAM memory is volatile; it loses its state when power is removed. There are several types of non-volatile memory that are used to hold instructions and data between uses, even with no power supply present. ROM (Read Only Memory) is designed for permanaent storage and its data is stored during the manufacturing process. The PROM (Programmable ROM) can be programmed once, much like the PLA. EPROM (Erasable PROM) chips can be reprogrammed many times. Ultraviolet light is used to erase an EPROM through a quartz window on the top of the chip. Once erased, the window is usually covered and then which it can be reprogrammed. An EEPROM (Electrically EPROM) can be erased using an electrical signal allowing it to be reprogrammed without removing it from its circuit board. EEPROMs however are more expensive and have a slower access time than EPROM's. Flash memory is another type or reprogrammable storage. It must be erased in large blocks (EEPROM can erase individual bytes) and can be reprogrammed in place. Flash storage is the common memory used in digitl camera cards, MP3 players, and USB thumb drives. Flash chips can store over 1 GB of data. Access time is approximately 50 nanoseconds. The only drawback is they can only be erased about 100,000 times.

Microprocessor Systems - Digital Level

There are many CPU's available on single chips. These processors interconnect with other system components through their external connections, pins. Usually, processors have pins dedicated to the following activities:

Address lines are used to select memory locations or devices and data lines are for transferring data to and from the processor via the data bus. The bus control lines can select read or write operations and direct other devices on the bus. The bus arbitration lines allow contending devices to share the bus as needed. Interrupts are signals coming to the processor from external devices that notify the processor of the need for immediate attention. The procesor usually interrupts normal tasks to handle interrupt requests. memory. Some microproceessors work in conjunction with coprocessors for graphics or floatin point operations. Special connections facilitate communication between these devices. Additional pins provide status information about the state of the processor, serial communications, or other special purpose connections. In addition, every processor needs a power supply (and ground) and most require an external clock signal.

Bus: A data pathway connecting components of a computer. Typically this consists of parallel traces on a circuit board, or a bundle of wires. Components connected to the bus must communicate according to a bus protocol and meet physical, electrical, and timing restrictions. Busses are commonly drawn on logic diagrams as a fat line (indicating a group of data and control lines) or a single line with a numberd slash across it indicating how many wires are in the bus.

Devices on a bus can be active (able to initiate communications) or passive (listen for requests from other devices). These are sometimes called master and slave devices. Devices can switch roles as required by their protocols. A bus receiver, driver, and transceiver is the term for the electrical interface of a slave, master, and master/slave device. These allow the device to disconnect (electrically) from the bus, and to amplify signals as needed.

Busses usually require three subcomponents: data, address, and control. A multiplexed bus alternates address and data on the same physical bus to cut down on the size and cost of the actual bus. This reduction of cost usually implies a reduction of speed as well. Bus speeds are limited by a variety of factors, the most important are the length of the bus and the electromagnetic interactions of high frequency signals in close proximity to each other. Even the fact that some bus connections are shorter than others (bacause of corners) cna cause difficulty in transmitting data that is supposed to be occurring in parallel.

Synchronous busses are driven by a regular clock signal. A bus cycle is a sequence of clock pulses and bus activities occur strictly under the control of these cycles. An asynchronous bus does not use a regular clock signal, but still utilizes a clock line to control activities. The clock signal in this case may vary in frequency and may be controlled by different devices.

The main data bus in a typical PC is a synchronous bus operating at 100 MHz (10 microsecond pulses). The following timing diagram illustrates a typical read from memory.

The memory on this bus requires slightly longer than one clock cycle to complete a read operation. The read cycle will require three complete clock cycles. The process starts with the rising edge of the clock. The CPU asserts the address value on the address lines. They become stable at some few nanoseconds after the start of cycle T1. This causes the appropriate memory chips to be selected. Half way through the T1 cycle, the falling edge causes the CPU to lower the MREQ line (indicating that the memory suould take control of the data bus) and the RD enable line (telling the memory unit to place the read data on the data lines). You can see the delay of the memory chip as the data takes about one and one-half cycles to become stable. The memory unit, asserts the WAIT line to tell the CPU it is busy fulfilling the request. The memory unit negates WAIT when the data will be ready in the next clock cycle. The data is read by the CPU on the falling edge of T3. Once the CPU latches the data, it negates the MREQ and RD lines, causing the data to become invalid after a short delay. Each of the delays indicated on the timing diagram are available in the memory chip specifications.

The main advantage of asynchronous busses is that they can work at the speed of each device, while synchronous busses are limited to accomodate the slower devices on the bus. The following timing diagram illustrates how a read cycle would be accomplished with an asynchronous bus. The actual duration will depend on the speed of the memory unit.

The CPU (acting as the bus master) places the address data on the address lines, and when stable, asserts MREQ and RD. In addition, it asserts MSYN (Master Sync). The memory unit, on seeing that it is the recipient of the request (the slave), and that the Master unit has asserted MSYN, begins to fulfil the request. As soon as the memory unit is sure that the data is available, it asserts SSYN (Slave Sync). When the CPU is able, it latches the data and negates MREQ, RD, and MSYN. The memory unit, seeing MSYN negated, negates its SSYN signal. This releases the bus for the next operation. This type of interlocked access is called full handshaking.

If more than one bus master wants access to the bus, bus arbitration must take place. Centralized arbitration implies a special arbitrar circuit. Each potential bus master can assert a request on one of the bus request lines (several devices may use the same line). These lines have a priority associated with them. The arbitrar asserts a GRANT signal which passes through each connected device, until one accepts the signal. If 2 devices use the same priority request line, then the closer device will accept the grant signal. This type of connection is often called daisy chaining.

Decentralized arbitration can be prioritized by level, or by position. The simpler scheme is shown here.

The current bus master asserts the Busy signal. The arbitration line is normally asserted sequentially through the devices. If a device wants the bus, it must wait for two things: an asserted arbitration line IN and a negated Busy line. A device waitin for the bus will negate the arbitration line out, so the leftmost device will have priority in getting control of the bus; no devices to the right will see an asserted Arbitration IN signal. As soon as the bus is idle, the new Master asserts the Busy line and allows the Arbitration signal to pass through to the next device. Arbitration occurs while the current bus master is fulfilling its task.

Bus Operations

Buss operations require different times and different sequences of operations depending on the type of transfer. In addition to the single memory read discussed earlier, there may be a block transfer cycle. Once an address has been determined (or a block of data cached by the memory unit) the bus might sequentially transfer each word in the block without havint to transfer and wait for address data. Mulitprocessor systems sometimes need to synchronize on a memory location to ensure they do not corrupt shared resources. A special read-write-modify cycle allows modification of a word in memory without the possibility of another device reading it before the modify takes place. Special bus cycles to facilitate interrupts are also common.

A interrupt controller allows many interrupts to trigger the processor's interrupt request line (INT). The CPU (when ready to service the interrupt) responds to an interrupt request with an interrupt acknowledge (INTA). The interrupt controller then places the interrupt number on the data bus, using special bus cycle. The CPU looks up the interrupt number in an Interrupt Vector Table and initiates the procedure to service the interrupt. Special bus cycles allow the CPU to send data to the interrupt controller.

Example Processors

Pentium 4

Although the Pentium 4 is a 32-bit processor, it does have an 800 MHz internal system bus, and can access memory in 64-bit words. This processor is also a departure from previous Pentiums in that it uses a NetBurst microarchitecture that uses a 20 stage pipeline and includes a "rapid execution engine" (2 or 4 double speed ALU's acting like multiple processors but in the same CPU) and supports hyperthreading (has two sets of registers). The L1 SRAM Cache holds 8 KB data. In addition, there is an execution trace cache that holds the microcode for instructions (decoded as they are fetched). This microcode is already optimized and is executed directly in the RISC core. There is a 1 MB L2 cache, and some models include a 2MB L3 cache.

Because the ALU's operate in parallel, it is possible that a data value needed by one ALU will have been modified by the other ALU. The modified value could be in any of the three levels of cache, or in primary memory. The CPUs snoop (monitor) the memory bus for addresses of cached words. Memory requests for stale data are actually supplied from the other processor's cache instead of coming from memory.

The Pentium 4 generates a lot of heat (the penalty for packing more components into a small space and running at very high frequencies) and required special packaging. The power requirements ar esimilar to a 75 Watt light bulb. One package format is a 478 pin, 3.5 cm square. A hefty heat exchanger with a fan is attached to the top to help dissipate the heat.

Of the 478 pins, only 198 are used for signals. The rest are for power and ground connections. Some signals are duplicated on 2 pins, so there is a total of 56 distinct signals.

Pentium 4 signals

Signal group Signal name Description
Bus arbitration BPRI# (in)
LOCK# (in/out)
BR0# (out)
BR0 is the bus request line. The BPRI signal is used by other devices as a high priority request for the bus. The LOCK signal means the bus is in use.
Request

A# (33 in/out)
ADS# (in/out)
REQ# (5 in/out)

The bus master asserts a 36-bit address on A# (the least significant 3 bits are always 0 as the bus is a 64-bit data path). The ADS# signal is asserted to inform the slave that the address data is valid. The REQ# bits specify the type of bus cycle for the transfer.
Error (5 in/out) Status lines used to assert error conditions
Response RS# (2 in/out)
TRDY# (in)
BNR# (in/out)
RS# is the status code of the bus slave unit. TRDY is asserted when the slave is ready to accept data. The BNR# signal means the bus is not ready and causes wait states to be inserted.
Data D# (64 in/out)
DRDY# (in/out)
DBSY (in/out)
The data transferred on the bus is placed on the 64-bit D# connection. DRDY# is asserted when the data is stable. DBSY# is asserted while the bus is busy so other devices will wait for access. Additional lines check parity and control some of the fine details of the data transfer.
Other RESET# (in) This signal resets the processor - causing it to begin execution a program at a specific address. This signal occurs during power-up and when the reset button on the PC is pressed.
  Interrupts (3 in) These pins provide external devices the capability to interrupt the processor.
  Power and heat management (18 in/out) Communications regarding current power levels and the state of the processor such as sleep or low-power modes. Temperature data is available from the chip on these lines.
  Clock Frequency (5 in/out) Signals used to determine the frequency of the system bus (which operates at different speeds for different transactions)
  Misc Other signals provide debugging connections, startup information, snooping, etc.

In addition to the fact that the processor uses a 20 stage instruction pipeline, the memory bus is also pipelined. There are 6 phases.

  1. Bus arbitration - This is when the bus master is selected.
  2. Request - At this stage the address is placed on the bus by the new bus master and the slave is identified.
  3. Error reporting - The slave can report a parity (or other) error during this cycle.
  4. Snooping - At this point, a memory request for a value that is already in cache is modified to provide the current data.
  5. Response - The slave indicates that it is ready to fulfil the request.
  6. Data - the data transfer is actually completed.

The diagram illustrates the advantage of pipelining. It also shows how a delay in one transaction can introduce a delay that affects subsequent transactions for a short period of time. Each phase uses distinct control lines, so a later transaction must always wait for the previous transaction to complete a phase before it can begin that phase.

UltraSPARC III

The UltraSPARC III is a 64-bit RISC processor packaged in a 1368 pin Land Grid Array. Internally it has two L1 caches: 32 KB for instructions and 64 KB for data. The interface to L2 cache is buffered to improve speed and although support for L2 cache is included with the processor, it must be added as a separate component outside the processor. The data path is 256 bits between the processor and L2 cache. The UltraSPARC system bus runs at 150 MHz and is 128-bits wide and addresses are 43-bits wide allowing 8 TB of memory.

Processor design includes UPA (Ultra Port Architecture) which is used to create multiprocessor systems. UPA specifies a communication protocol between processors. The UDB (UltraSPARC Data Buffer) II chip is a memory buffer and controller that provides the ability to effect memory transfers in larger blocks, maximizing speed. This arrangement frees the processor from some of the memory management tasks. The diagram below shows the major components making up the memory subsystem of an UltraSPARC II system.

L1 cache contains the most accessed cache lines (data and instructions are seperated). The L2 cache is a random collection of cache lines (each 64 bytes), some data, some instructions. The cache tags are an index to the cache lines currently in L2 cache. When the processor access a memory location and it is not in L1 cache (a cache miss), the L2 tags are checked. If found, the line is trnasferred to L1 cache. Otherwise a memory request is made. This starts with bus arbitration. Once access is granted, the address and request type are asserted. The UPA can handle two separate transaction streams, each with multiple requests. The interface must access cache in different processors to synchronize memory contents. Memory access heavily uses the UDB II chip to queue memory requests for efficiency.

8051

The 8051 processor is quite different from the previous two systems. There are only 40 external connections, many of them dual purpose. Below is a block diagram of the 8051 main components.

The 4 I/O Ports can be used for a variety of purposes. In a particularly simple configuration they can directly drive LED's, LCD displays, or a small speaker or read the state of a button or switch. These ports are also used to connect to external memory (if needed). Ports 0 and 2 can be used to efficiently access external memory. They serve as address and data lines. The address is asserted first and is often latched by the memory unit. In a write cycle, the data is then placed on P0 and a write enable signal given. On reads, the data is read on Port 0. Special instructions are included in the instruction set for external data access, and special bus cycles are utilized when external memory is accessed during the fetch cycle. The 8051 has pins for RD (read enable), WR (write enable) ALE (address latch enable), and PSEN (Program Store Enable) and EA (External Access signal).

Port 3 has dual purposes for each of its pins. Two are used as input to internal counters (timers). Two serve as transmit and receive lines for an internal UART. Two are used to allow external devices to supply an interrupt signal to the processor. The final two are the RD and WR signals used with external data memory.

Of course, there are power supply pins, a reset signal, and connections to control the system clock via an internal or external oscillator.

Bus Technologies

Three basic bus technologies are found on most personal computers: ISA. PCI, and USB. IBM developed and used the IBM PC Bus in 1981. This de fecto standard was copied making it easy for third party vendors to develop and supply peripheral devices that would work on nearly any clone of the IBM PC. This bus used 62 lines (20 address and 8 data).

When the 80286 processor was developed and used in the next generation of PC's (PC/AT, 1984), the limitation of the 8-bit data bus became a problem. Rather than replacing the original bus technology, they simply extended it by adding 36 more lines, allowing 24-bit addresses and 16-bit data transfers. This was accomplished by adding a second tab to the new boards, and a corresponding connector on the circuit boards. This bus could handle the older 8-bit cards as well as the newer 16-bit cards. With the introduction of the PS/2, IBM decided to employ a new propietory bus (to kill off the cloners) called Microchannel. The rest of the computer companies standardized the PC/AT Bus concept, naming it ISA (Industry Standard Architecture).

This was later expanded to a 32-bit data bus called EISA.

In 1990, IBM introduced the PCI (Peripheral Component Interconnect) Bus and made it available for everyone to use. This action caused almost universal adoption in the PC clone world. The PCI bus originally was a 32-bit bus. Newer versions use 64-bit data transfers. To allow compatibility with earlier I/O cards, many PC's include bridges that interconnect specialized memory busses with both PCI and ISA busses.

Video controllers gradually began to require greater bandwidth than could be attained using the PCI bus, so a new bus called AGP (Accelerated Graphics Port) was introduced. Most modern systems include all of these busses, interconnected by one or more bridge chips. The main bridge seperates the memory (main and graphics) functionality from I/O (PCI and secondary storage drivers ATAPI) with fast internal interconnections. ATAPI is Advanced Technology Attachment Packet Interface and is now more commonly known as ATA (Advanced Technology Attachment).

The PCI bus allows 64-bit master (initiator) - slave (target) communications. The address and data are multiplexed, reusing the same 64 lines. Overall, PCI cards have 120 (or 184) connectors. Data transfers occor in several cycles. First the initiator asserts an address. The next cycle includes control information. For example, if a read is requested, the control of the bus is exchanged with the target during this cycle. During subsequent cycles, the sender can supply one or more data values in sequence until the end of a data frame is signaled. Wait states are used if devices need to delay responses.

The PCI bus requires arbitration to allow different devices to be initiators. A request is sent to the arbiter and is granted by the arbiter. This occurs in parallel with bus useage by the current initiater so no clock cycles are wasted on arbitration. The arbiter can also request that a device release control of the bus in order to fulfill high-priority requests.

The major control signals for the PCI bus can be seen in the following timing diagram. A typical read (of one word) followed by a write (of one word) is shown. An idle frame often occurs between transactions, although is not always required. In the first transaction, the initiator supplies address and command information and asserts the Frame# signal. The C/BE signal is used to select certain bytes for reading (only the requested bytes need to be supplied by the slave). The presence of this signal is flagged on the IRDY# line. In the last cycle, the target device asserts the data (on the AD lines) and informs the initiator through the DEVSEL# and TRDY# signals that it is available. If more time is needed, the TRDY signal would occur in a leter clock cycle (introducing wait states). The write signal is similar except the data is supplied in the second cycle of the frame.

The PCI-Express architecture utilizes a fast switching technology in place of bus arbitration. In addition, the communication with the peripheral devices is handled serially, greatly reducing the physical bus connections. Communications in the Express model are packet based. This eliminates the need for the large number of control lines. Packets include a header (control information) and a payload (the data). One additional feature of this architecture is the ability to add or remove devices without powering down the system. The PCI Express protocol is based on networking models and can be viewed at different layers.

At the physical layer, the meaning of a sequence of bits is defined. The basic communications between the switch and attached devices is asynchronous (meaning they use separate clocks). 2 extra bits are added to each byte to initiate and synchronize bytes data. At this layer, frames are transmitted which include the start and stop frame signals and the actual packet.

The link layer breaks the packet into a sequence number, the header and payload section, and error correction data (CRC or Cyclic Redundancy Check). An acknowledgement packet is returned so the sender is aware the packet was recieved. This layer handles retransmission requests and matches bandwidth of devices through a flow control mechanism.

The transaction layer breaks high-level transactions into smaller packets, or reassembles packets into transactions. This layer handles out-of-order packets and adds a lot of flexibility to utilize multiple communications chanels and attach priorities to requests.

The software layer provides virtually any functionality to the system. One is a simple emulation of a PCI Bus allowing PCI based operating systems to utilize the PCI Express architecture without modification.

The USB (Universal Serial Bus) is designed to attach slower speed devices. It was introduced in 1998 and has created a large market for all type of interesting devices. The greatest advantages of USB over the PCI or ISA alternatives) are stated in the original design goals motivating the USB architecture.

Most of these goals are met in the USB market, although the cables have a variety of connections at the device end. The heart of the USB system is the root hub that provides the connection to the main system bus. This hub can provide connections to devices or other hubs. The bus itself consists of 4 wires (+5V, GND, D+, and D-). Devices are assigned addresses (1-127) when they are physically attached to the root. This is done through communications with the operating system. If the operating system will support the device, configuration data is transmitted to it and stored internally.

The serial communications between the USB device and the root hub is logically divided into up to 16 sub-pipes per device. Each of these can be used for different types of data. The root hub keeps the communications synchronized by transmitting frames to all devices at more or less regular intervals. Within each frame, is a set of packets. The first is always from the root. Subsequent packets may be from the device. There are 4 basic frame types

Each frame consists of one or more packets. The first is always a SOF (Start of Frame) packet. There are 4 packet types:

USB 2.0 adds a higher speed to the two possible in the USB 1.1 standard allowing 480 Mbps and defines a new interface called EHCI (Enhanced Host Controller Interface).