|
|
|
|
|
|
|
We would like computers to run faster. Nothing can go faster than light. The speed of light is 186,000 miles per second which approximates 1 foot per nano second. Electrons and photons cannot move faster. Another problem is heat dissipation. And as transistor sizes become smaller, each transistor will have fewer atoms. This can cause quantum mechanical effects (Heisenberg uncertainty principle). It would be easier to construct 1000 CPUs with a cycle time of 1 nsec than to build 1 CPU with a cycle time of 0.001 nsecs. Both will have the same capacity. This is called parallel computing.
Pipelining and superscalar designs can gain a factor of about 10 over sequential designs. But for larger factors (even a million) entire CPUs (or a portion of them) must be replicated.
DESIGN ISSUES
They are the nature, size, and number of the processing elements and of the memory modules, and how the two are interconnected. The processing elements can range from minimal ALUs through complete CPUs, with sizes ranging from a small portion of a chip to a cubic meter of electronics per element. Memory systems are split up into small or large modules that operate independently of one another in parallel to allow access by many CPUs at the same time.
Multiprocessors
All CPUs sharing a common physical memory is called a multiprocessor or shared memory system. They share a single virtual address space mapped onto the common memory. Any process can read or write a word of memory by executing a LOAD or STORE instruction. Two process can communicate by one writing data to memory, and the other reads them back.
Multicomputers
If each CPU has its own private memory not accessible to other CPUs, then it is called multicomputer or distributed memory system. Each CPU can access its memory with LOAD and STORE instructions not accessible to other CPUs.
Interconnection Networks
The five components are CPUs, Memory Modules, Interfaces, Links, and Switches. Links are physical channels over which the bits move. They are electrical or optical fiber and serial (1-bit wide) or parallel (more than 1-bit wide). The maximum bandwidth is the number of bits per second it is capable of transferring. They can be simplex (unidirectional), half duplex (one way at a time), or full duplex (both ways at once). Switches are devices with multiple input and output ports. When a packet (explicit message from CPU to memory requesting data, or reply by memory) arrives at an input port, some bits in the packet are used to select the output port to which the packet is sent. a packet might be as short as 2 bytes, or as long as 8 Kilobytes.
The topology of an interconnection network describes how the links and
switches are arranged. Some common topologies:
1. Star where all connected to a host.
2. A complete interconnected where all are connected
to
each other.
3. A tree
4. A ring
5. A grid
6. A double torus
7. A cube, and
8. A 4D hypercube.



Topological designs can be modeled as graphs, with the links as arcs and the switches as nodes. The number of links is the degree or fanout of the node. Greater the fanout, the more routing choices are there and greater the fault tolerance. If every node has k arcs, the network will be fully connected even if k-1 links fail. If the distance between two nodes is measured by the number of arcs that have to be traversed to get from one to the other, then the diameter of a graph is the distance between the nodes that are farthest apart. The diameter of an interconnection network is related to the worst-case delay when sending packets from CPU to CPU or from CPU to memory. The smaller the diameter, the better the worst-case performance is. The amount of data moved per second is its transmission capacity. measured as bisection bandwidth. The network is partitioned into two equal (in terms of number of nodes) but unconnected parts by removing a set of arcs from its graph. The total bandwidth of the removed arcs is then computed. The bisection bandwidth is the minimum of all possible partitions. If it is say 800 bits/sec. then, if there is a lot of communication between the two halves, the total throughput may be limited to only 800 bits/sec. in the worst case. Goal is to maximize the bisection bandwidth. Dimensionality is the number of choices to get from the source to the destination. If there is only one path (no choice) it is zero. With one dimension (example east or west) it is one dimensional. With two axes (east west or north south) the network is two dimensional.
A star network is zero dimensional since all arcs are routed through the central node. This could cause a major bottleneck. A full interconnect is also zero dimensional. But the bisection bandwidth is maximum, the diameter is minimum, and is exceedingly fault tolerant. With eight nodes, it can lose any six links and still be fully connected. In a tree (also zero dimensional) the bisection bandwidth is equal to the link capacity. The top nodes will experience bottlenecks due to the heavy traffic there. This can be counteracted by giving the upper links more bandwidth. The lowest level can have a capacity b, the middle level 2b and the top level 4b, known as fat tree. The ring is one dimensional (packet can go left or right) The grid mesh is two-dimensional, and the diameter increases as the square root of the number of nodes. A double torus is a grid with the edges connected. It is more fault tolerant and the diameter is less as the opposite corners can communicate in only two hops. A k x k x k cube has three-dimensional topology. A n-dimensional cube is called a hypercube.
For more details: Topology and switching

The job of a switch is to accept packets arriving on any input port and send each one out on the correct output port. Each output port is connected to an input port belonging to another switch by a serial or parallel link. Serial links transfer one bit at a time, while parallel links transfer multiple bits at once and have signals for controlling the link. Parallel links have a higher performance than serial links but have a problem of skew (making sure that all bits arrive at the same time) and are more expensive. In circuit switching, before a packet is sent, the entire path from the source to the destination is reserved in advance. All ports and buffers are claimed, so bits can move at full speed. Like a parade with side streets blocked off, it requires advance planning. But competing is prohibited even if there are no packets on the reserved route. Store-and-forward packet switching does not require advance reservation. The source sends a complete packet to the first switch where it is stored. Then the entire packet is moved to another switch and so on to its destination. The packets must be buffered because when a source (CPU, memory, or another switch) presents a packet, the output port may be busy transmitting another packet. Without buffering, the packet would be dropped causing an unreliable interconnection network. But if the packet is waiting for a port, the packets behind are stuck, a situation called head-of-line blocking. (Like a first car in a two lane road turning left blocking subsequent cars even those wishing to go straight or turn right). Head-of-line blocking can be eliminated by output buffering by associating buffers with output rather than input ports. So packets destined for port m cannot block packets destined for port n. Input and output buffering have a fixed number of buffers. If more packets must be stored than there is room, packets will be dropped. This problem is solved by common buffering, where a single pool of buffers is dynamically allocated to ports as needed. This scheme requires a more complex administration to keep track of buffers, and allows one busy connection to hog all buffers starving other connections. In addition, each switch needs to be able to increase memory requirements and drive down the packet size. It is flexible and efficient but has the problem of increased latency (delay). A hybrid network has some of the properties of circuit switching and some of the properties of packet switching. Each packet can be logically divided into smaller units. As soon as the first unit arrives at a switch, it can be moved to the next switch even before the tail of the packet has arrived. In virtual cut through routing, when the first unit of a packet cannot move, the rest of the packet continues to pour in. In wormhole routing, when the first unit cannot go forward, to source is told to stop transmitting, so the packet may end up being strung over two or more switches like a worm. When the necessary resources become available, the packet can go forward.
This is the rule that determines which sequence of nodes a packet must follow from the source to the destination. A deadlock occurs when multiple packets in transit at the same time have claimed resources in such a way that none of them can make any forward progress and all will remain blocked forever. This can happen if four CPUs are in a grid and each is trying to send a packet to the one diagonally opposite. Each one has managed to reserve the input and output pos on its local switch and the one input port on the second switch, but is unable to get the output port on the second switch, so it waits until that is available. If all four CPUs start simultaneously, all of them will block, and the network will hang forever. In source routing, the source determines the full route through the interconnection network in advance. In distributed routing, each switch makes a decision about which port to send each arriving packet to. The routing is static if the same decision is made for each packet and adaptive if the current traffic is taken into account. A deadlock free algorithm is dimensional routing where the packet is moved along the x-axis to its correct coordinate, then along the y-axis and so on to its destination.
Hardware Metrics
From a hardware perspective, the performance metrics of interest are the CPU and I/O speeds and the performance of the interconnection network. They involve latency and bandwidth. The roundtrip latency is the time it takes the CPU to send a packet and get a reply. If sent to memory, then it is the time to read or write a word or a block of words. If sent to another CPU, it is the interprocessor communication time. For circuit switching, it is the sum of the setup time and the transmission time. If the total setup time is Ts, the packet size is p bits, and the bandwidth b bits/sec. the latency is Ts + p/b for one-way reply and Ts + 2p/b for full duplex (sending a p bit packet and getting a p bit reply). For packet switching with an internal setup time of Ta, a delay of Td per switch and n switches, the one-way delay is Ta + n(p/b + Td) +p/b where the final term is due to the copy from the last switch to the destination. For virtual cut through and wormholes, the one-way latencies are Ta + p/b as there is no delay. In addition to the bisection bandwidth, there is aggregate bandwidth, which is the sum of the capacities of all the links, and is the maximum number of bits that can be in transit at the same time. The average bandwidth must be considered.
Software Metrics
If a program runs for T sec on a uniprocessor, with a fraction f of this time being a sequential code, and a fraction (1 - f) being potentially parallelizable. if the latter code can be run on n CPUs with no overhead, its execution time can be reduced from (1 - f)T to (1 - f)T/n. This gives a total execution time of fT + (1 - f)T/n. The speedup is the execution time of the original program, T, divided by the new execution time.
Speedup =
n
1 + (n - 1)f
For f = 0, there is linear speedup, but for f > 0, perfectspeedup
is not possible. This is known as Amdahl's
law
Achieving High Performance
The simplest way is to add more CPUs, but without creating bottlenecks. A system in which more CPUs are added with more computing power is called scalable. If there are 4 CPUs connected by a bus, adding 12 more to the same bus will yield 16 CPUs. If the bandwidth of the original bus was b MB/sec, quadrupling the number of CPUs will reduce the available bandwidth per CPU from b/4 MB/sec to b/16 MB/sec, and so is not scalable. But if the four CPUs were in a grid system, adding new CPUs also adds more links. The ratio of links to CPUs increases from 1.0 with 4 CPUs (4 CPUs, 4 links) to 1.5 with 16 CPUs (16 CPUs, 24 links). So adding CPUs improves the aggregate bandwidth per CPU. But while the diameter or latency does not increase in a bus system, they do increase in a grid system. For a nx n grid, the diameter is 2(n - 1), so latency increases as the square root of the number of CPUs. For 400 CPUs the diameter is 38, while for 1600 it is 78. So quadrupling the number of CPUs doubles the diameter and the latency. In case of a hypercube it grows logarithmically. If a program needs data that are not in local memory, there is a substantial delay, with the bigger the system, the longer the delay. If copies of a block of data can be kept at multiple locations, access from these locations can be speeded up. One such replication technique is caching, in which one or more copies of data blocks are kept close to where they are being used, as well as where they belong. Another strategy is to maintain multiple peer copies (equal status). They can be dynamically placed on demand by the hardware, or at load time following compiler directives.
Another technique for hiding latency is prefetching. If a data item is fetched before it is needed, the fetching process can be overlapped with normal execution, so that when an item is needed, it will be there (like double buffering in Java). Prefetching may be automatic or under program control. When a cache loads not only the word being referenced, but an entire cache line containing the word, it is gambling that the succeeding words are also likely to be needed soon. Such speculative LOAD instructions work best when it is known for sure that the data will be needed. Getting a page fault on a LOAD for a path that is ultimately not taken is very costly. Another technique is multithreading. If switching between processes is fast, by giving each one its own memory map and hardware registers, then one process blocks waiting for remote data to arrive, the hardware can switch to another one that is able to continue. Another technique is using nonblocking writes. When a STORE instruction is executed, the CPU waits until the STORE has completed before continuing. With nonblocking writes, the memory operation is started, but the program continues. Continuing past a LOAD is harder, but with out-of-order execution it is possible.
Taxonomy of Parallel Computers
| Instruction
Streams |
Data
Streams |
Name | Examples |
| 1 | 1 | SISD | Classical Von Numann machine |
| 1 | Multiple | SIMD | Vector supercomputer, array processor |
| Multiple | 1 | MISD | Arguably none |
| Multiple | Multiple | MIMD | Multiple Processor, multicomputer |
Flynn's Taxonomy of Parallel Computers
It is safe to say that as of this writing there is no completely satisfactory
characterization of the different types of parallel systems. The most popular
taxonomy was defined by Flynn in 1966. The classification is based on the
notion of a stream of information. Two types of information flow into a
processor: instructions and data. Conceptually these can be separated into
two
independent streams, whether or not the information actually arrives
on a different set of wires. Flynn's taxonomy classifies machines according
to whether they have one stream or more than one stream of each type.
The four combinations are SISD (single instruction stream, single data
stream), SIMD (single instruction stream, multiple data streams), MISD
(multiple instruction streams, single data stream), and MIMD (multiple
instruction streams, multiple data streams).
SISD Computers
Conventional single processor computers are classified as SISD systems.
Each arithmetic instruction initiates an operation on a data item taken
from a single stream of data elements. Historical supercomputers such as
the Control Data
Corporation 6600 and 7600 fit this category as do most contemporary
microprocessors.
Vector processors such as the Cray-1 and its descendants are often classified
as SIMD machines, although they are more properly regarded as SISD machines.
Vector processors achieve their high performance by passing successive
elements of vectors through separate pieces of hardware dedicated to independent
phases of a complex operation. For example, in order to add two numbers
such as 3.4 x 23 and 1.6 x 22 , the numbers must have the sameexponent.
The processor must shift the mantissa (and decrement the exponent) of one
number until its exponent matches the exponent of the other number. In
this example 3.4 x 23 is adjusted to 6.8 x 22 so it can be
added to , and the sum is 8.4 x 22 . A vector processor is specially constructed
to feed a data stream into the processor at a high rate, so that as one
part of the processor is
adding the mantissas in the pair (ai, bi) another part of the
processor is adjusting the exponents in (ai + 1, bi + 1)
SIMD Computers
SIMD machines have one instruction processing unit, sometimes called
a controller and indicated by a K in the PMS notation, and several data
processing units, generally called D-units or processing elements (PEs).
The first operational machine of this class was the ILLIAC-IV, a joint
project by DARPA, Burroughs Corporation, and the University of Illinois
Institute for Advanced Computation [5]. Later machines included the Distributed
Array Processor (DAP) from the British corporation ICL, and the Goodyear
MPP. Two recent machines, the Thinking Machines CM-1 and the MasPar MP-1,
are discussed in
detail.
The control unit is responsible for fetching and interpreting instructions.
When it encounters an arithmetic or other data processing instruction,
it broadcasts the instruction to all PEs, which then all perform the same
operation. For example,
the instruction might be `` add R3,R0.'' Each PE would add the contents
of its own internal register R3 to its own R0. To allow for needed flexibility
in implementing algorithms, a PE can be deactivated. Thus on each instruction,
a PE is either idle, in which case it does nothing, or it is active, in
which case it performs the same operation as all other active PEs. Each
PE has its own memory for storing data. A memory reference instruction,
for example ``load R0,100'' directs each PE to load its internal register
with the contents of memory location 100, meaning the 100th cell in its
own local memory.
One of the advantages of this style of parallel machine organization
is a savings in the amount of logic. Anywhere from 20% to 50% of the logic
on a typical processor chip is devoted to control, namely to fetching,
decoding, and scheduling instructions. The remainder is used for on-chip
storage (registers and cache) and the logic required to implement the data
processing (adders, multipliers, etc.). In an SIMD machine, only one control
unit fetches and processes instructions, so more logic can be dedicated
to arithmetic circuits and registers. For example, 32 PEs fit on one chip
in the MasPar MP-1, and a 1024-
processor system is built from 32 chips, all of which fit on a single
board (the control unit occupies a separate board).
MISD Computers
There are few machines in this category, none that have been commercially successful or had any impact on computational science. One type of system that fits the description of an MISD computer is a systolic array, which is a network of small computing elements connected in a regular grid. All the elements are controlled by a global clock. On each cycle, an element will read a piece of data from one of its neighbors, perform a simple operation (e.g. add the incoming element to a stored value), and prepare a value to be written to a neighbor on the next step.
One could make a case for pipelined vector processors fitting in this
category, as well, since each step of the pipeline corresponds to a different
operation being performed to the data as it flows past that stage in the
pipe. There have been
pipelined processors with programmable stages, i.e. the function that
is applied at each location in the pipeline could vary, although the pipeline
stage did not fetch its operation from a local control memory so it would
be difficult to
classify it as a ``processor.''
MIMD Computers
The category of MIMD machines is the most diverse of the four classifications
in Flynn's taxonomy. It includes machines with processors and memory units
specifically designed to be components of a parallel architecture, large
scale parallel machines built from ``off the shelf'' microprocessors, small
scale multiprocessors made by connecting four vector processors together,
and a wide variety of other designs. With the continued improvement in
network communication and the development of software packages that allow
programs running on one machine to communicate with programs on other machines,
users are even starting to use local networks of workstations as MIMD
systems.
Computer systems with two or more independent processors have been available
commercially for a long time. For example, the Burroughs Corporation sold
dual processor versions of its B6700 systems in the 1970s. These were
rarely, if ever, used to work on the same job, however. Multiprocessors
of this era were intended to be used for job level parallelism, i.e. each
would run a separate program. Parallel processing, in the sense of using
more than one processor in the execution of a single program, has been
an active area in corporate and academic research labs since the early
1970s. The c.mmp and cm* projects at Carnegie Mellon University used DEC
PDP-11 microcomputers as processing elements and pioneered several important
developments in parallel hardware and software. Commercial parallel processors
started to become widely used in the mid 1980s. By the early 1990s these
systems began to approach top of the line vector processors in computing
power, and the trend for future high performance computing is clearly with
parallel processing.
For more information on Flynn's Taxonomy, SISD, MISD, SIMD, and MIMD Taxonomies
Other Taxonomies
In addition to the vacuous MISD category and the difficulty in classifying
vector processors, there are other weaknesses of the Flynn taxonomy. In
the MIMD category, all arrays of processors are lumped together regardless
of how they are
connected and how they view memory. Since these characteristics can
have a dramatic effect on performance, it would be desirable if the taxonomy
reflected those differences.
Shore (@Shore) offered a very similar taxonomy, but expanded the SIMD
category to four subcategories. He still did not distinguish the pipelined
vector computer and he also did not provide for the completely independent
array in
the MIMD category.
There have been other attempts to modify the Flynn taxonomy. For example in Hwang the MIMD category is subdivided into shared memory systems, distributed memory systems and reconfigurable systems. Unfortunately, this mixes memory organization with communication organization, and although it is a useful distinction it is not very satisfactory as a basis for a taxonomy. Bell divided the MIMD category into systems with shared memory and those without shared memory.
One addition to the Flynn taxonomy that has become very popular is SPMD,
which stands for Single Program / Multiple Data stream (see for example
Karp. In some sense it represents a style of computing rather than
an
architecture. Physically the system is an MIMD multiprocessor because
there are several independent processors, each with its own data set and
program memory. However, the same program is executed by each processor,
and the processors are synchronized periodically. This is a much simpler
way to approach an MIMD system than to have to manage many individual instruction
streams. It also provides more flexibility than the SIMD system because
different processors may be at different parts of the program at any time.
For more information SPMD
.
By far the most ambitious attempt at a taxonomy is given by Hockney and Jesshope where the motivation was to treat pipelined vector processors as a distinct architecture and to differentiate among the many multiprocessor possibilities. The notation resembles chemical notation for organic compounds and its complexity is beyond the scope of this discussion, but it does lead to a classification that provides a unique identifier for all of the systems that have been proposed or manufactured. However, that same complexity is the probable explanation for the lack of acceptance of the taxonomy.
For more information: Non
Blocking Networks
For more information: Parallel
Computer Architecture
For more references: References
Multistage Switchinng Networks
One of the uses of a switching network is for the CPU to access memory.
The two common switching networks are Omega and Benes. The Omega
network is a three stage network. The wiring pattern of the Omega
network is called the perfect shuttle since the mixing of the signals
at each stage resembles a deck of cards being cut in half and then mixed
card-for-card. The request is routed through various switches.
If two requests are made at the same time through the same switch, one
of them has to wait. So the Omega network is a blocking network.
Not every request can be processed simultaneously. Conflicts can
arise over the use of a wire or a switch, as well as requests to and from
memory. This problem of blocking can be solved by adding more hardware
and stages. The Benes network is a five stage network that is non
blocking. Unlike the Omega network, in which there is exactly one
path from any CPU to any memory, in the Benes network there are many alternatives.
If the complete list of CPU-Memory pairs is known, it is always possible
to finds a set of routes to satisfy all requests. But routing in
a Benes network is more complicated than in an Omega network as genuine
choices have to be made at each step along the way. So, Benes networks
are not used as much as Omega networks for connecting CPUs to memories,
although they are widely used in telephone switching and other more static
applications.
|
|
|
|