multp1new

•

UERJ

Bruno Ochotorena

12.08.2014

Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original

cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Chapter 9
Multiprocessors
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
9.1 Introduction
Until now we have studied how to improve performance by exploring parallelism inside the processor(ILP) with techniques such as: Pipelining, superscalar machines, VLIW, and Dynamic Scheduling.
What if we try to connect many processors to build a machine with as much performance as we want?
Multiprocessors must therefore be scalable: the hardware and software are designed to be sold with a variable number of processors. We want performance to improve proportionally to the increase in the number of processors.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Multiprocessors
Idea: create powerful computers by connecting many smaller and ones
good news: works for timesharing (better than supercomputer). Vector processing may be coming back
bad news: it´s really hard to write good concurrent programs. Many commercial failures
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Why Use Multiprocessors?
System is fault tolerant. If one processor fails, the system will continue service with the others.
Multiprocessors may have the highest absolute performance Faster than the fastest uniprocessor.
More cost effective to use many single-chip uniprocessors than building a high-performance uniprocessor from a more exotic technology.
Current file servers can be ordered with multiple processors.
The data base industry has standardized on multiprocessors.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Performance in Multiprocessors
Commercial dealers usually define high performance as high throughput for independent tasks.
Alternative: running a single task on multiple processors.
Definition: Parallel processing program is a single program that runs on multiple processors simultaneously.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Key Questions
How do parallel processors share data?
— single address space — message passing
How do parallel processors coordinate?
— synchronization (locks, semaphores) — built into send / recieve primitives
How many processors should the system have?
— connected by a single bus (2 - 32 procs) — connected by a network (8 - 256 procs)
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Shared Memory Processors (SMPs)
Processors with a single address space: all processors share a single memory address space.
Processors communicate through shared variables in memory. All processors are capable of accessing any memory location with loads and stores.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Coordinating Shared Memory Processors
Prohibit one processor to start working on data before another is finished with it: synchronization.
When sharing is supported with a single address space, there must be a separate mechanism for synchronization.
Example synchronization mechanism: lock. When a processor j wants to access a shared variable it must require its lock. Only one processor at a time can aquire the lock. Other interested processors must wait until j unlocks the variable.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Classifying SMPs as to Memory Access
UMA (Uniform Memory Access) multiprocessors, also called (Symmetric Multiprocessors - SMPs): All memory accesses take the same time, independently of which processors access and which word is accessed. (2 - 64 procs).
NUMA (Nonuniform Memory Access) multiprocessors. Memory access times depend on which processors access which words. (8 - 256 procs).
It is harder to program NUMA machines efficiently, but they are more scalable.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Distributed Memory Processors (8 - 256 procs)
Each processor has a local (private) memory.
Coordination (communication) is physically achieved by message passing. The system has send / recieve primitives (routines) for achieving this. Much harder to program. New programming paradigms.
Events are triggered in the processors by the sending or the reception of a message.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Clusters
One can achieve multiprocessing by making a LAN (computers connected by Local Area Network) act as a single large multiprocessor.
The switch based LAN must provide high bandwidth between computers in the cluster.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Keeping Up to Date
Industry is rapidly changing in this field.
There is no clear answer to all the key questions presented.
Many answers depend strongly on application behavior.
Find some of the latest information in this field in http:\\www.mkp.com/cod2e.htm.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
9.2 Programming Multiprocessors
Uniprocessors that compose multiprocessors are cheap since the early 1980s.
Communication interconnection network technology is well established.
We have good programming languages for parallel program execution.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
What´s Difficult in Parallel and Distributed Processing?
Identifying real (and important) applications that have inherent parallelism and can benefit from parallel processing.
Writing parallel programs for parallel execution for these applications is harder than writing sequential programs.
The challenge is even greater if we want the application to scale (maintain efficiency when the problem size increases and larger multiprocessors are used).
As a result, parallel processing has had sucess when parallel programming specialists have developed a parallel subsystem that presents a sequential interface.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Why is Parallel Program Development so Hard?
An analogy: For 10 people to work on a task (e.g. painting a house, writing an article or developing SW), they must communicate and coordinate their work. There is no such overhead if one person does the task alone.
The overhead is even greater if the number of people increases to 100, 1000 or even 1000000. Because of this overhead, 1000 people do not finish the task 1000 time faster than a single person. n-fold speedup becomes especially unlikely as n increases.
The programmer must know a good deal about the HW (number and types of processors, interconnection topology).
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Problems of Parallel Program Development
On a uniprocessor, the high-level language programmer writes the program largely ignoring the underlying machine organization.
Uniprocessor design techniques for ILP (Instr Level Parallelism) such as pipelines, SS and out-of -order execution are transparent to the programmer. The compiler handles these issues.
you must get good performance and efficiency from the parallel program on a multiprocessor, otherwise you would use a uniprocessor, since programming is easier.
Even small parts of a program must be parallelized to reach their full potential (see Amdahl´s law in chapter 2). Sequential parts are bottlenecks.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
9.3 Multiprocessors Connected by a Single Bus
Single-bus multiprocessor (2 - 32 processors)
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Why use Single-bus Architectures?
Low cost of microprocessors.
Each microprocessor is much smaller than a multichip processor, so more processors can be placed on a bus.
Local caches can lower bus traffic.
There are mechanisms for keeping caches and memory consistent (like consistency protocols for caches and memory for I/O) simplifies programs.
Traffic per processor and the bus bandwidth determine the useful number of processors in such a multiprocessor.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Caches
Caches replicate data in their faster memories:
to reduce the latency to the data and
to reduce memory traffic on the bus
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Example Parallel Program for Single Bus
Sum 100000 numbers
10 processors
Each processor sums 100000 / 10 numbers
Variables
Pn: number of the local processor
A[100000]: Array with numbers to be summed
Each processor
Pn executes (i is private variable)
sum [Pn] = 0;
for (i = 10000 *Pn; i < 10000 * (Pn + 1); i++)
sum [Pn] = sum [Pn] + A[ i ];
Load Instrs bring correct subset of numbers to the local caches from main memory.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Divide and Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Program that Adds Partial Sums
half = 10; /* 10 processors */
repeat
synch( );
if ( half % 2 != 0 && Pn == 0 )
sum [0] = sum [0] + sum [half - 1];
half = half / 2;
if ( Pn < half ) sum [Pn] = sum [Pn] + sum [Pn + half];
until (half ==1); /* exit with final sum in sum[0] */
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
synchronization primitives
synch() is a barrier synchronization primitive, processors wait at the barrier until every processor has reached it. When this occurs, all may proceed.
This function can be implemented either in SW with the lock synchronization primitive, or with special HW that combines each processor ready signal into a single global signal that all processors can test.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Cache Consistency
In the example program, the part that adds the partial sums implies in communication.
It is convenient to have copies of the same data in multiple caches (sum[ ] in the example).
There may be inconsistencies between data in memory and data in caches and also among same data in different caches.
Protocols that maintain coherency for multiprocessors are called cache coherency protocols.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Some Interesting Problems
Cache Coherency
Most popular protocol to maintain cache coherency is called snooping.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Synchronization Using Coherency
Read in page 724 of section 9.3.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
9.4 Multiprocessors Connected by a Network
3 Desirable bus characteristics:
high bandwidth
low latency
long length
INCOMPATIBLE
There is also a limit to the bandwidth of a single memory module attached to a bus.
Number of proceessors that can be connected to a single bus is limited (Max processors ~36 processors)
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Solution for More Demands on Communication
Use more communication channels: a Network connects processor-memory modules.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Example Network Topologies
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Example Network Topologies
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Distributed Memory X Shared Memory
Shared Memory: a single address space (implicit communication with loads and stores).
Multiple Private Memories (Address Spaces): explicit communication with sends and receives.
Distributed Memory: refers to the physical location of the memory. Physical memory is divided into modules, with some placed near each processor.
Centralized Memory: access time to a physical memory location is the same for all processors because all accesses go over the interconnect.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Many Combinations are Possible
Multiprocessors can have a single address space and a distributed physical memory.
Decisions:
single address space?
explicit communication?
distributed physical memory?
In machines without a single global address, communication is explicit, the programmer or compiler must send / receive messages to export / import data to / from other nodes.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Parallel Message Passing Program
Sum 100000 numbers
100 processors
Each processor sums 100000 / 100 numbers
Variables
Pn: number of the local processor
A[100000]: Array with numbers to be summed
One processor containing 100000 numbers sends subsets to each of the 100 processors.
Each processor Pn executes to sum its subset
sum = 0;
for (i = 0; i < 1000; i++)
sum = sum + A[ i ];
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Divide and Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
Divide to Conquer to Add Partial Sums
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Parallel Message Passing Program
Communicate partial sums among neighbors through the network to accumulate the final sum.
Divide and conquer to obtain final sum in parallel.
send (x, y): routine that sends the value y over the interconnection network to processor x.
receive(): function that accepts a value from the network for this processor.
half = limit = 100; /* 100 processors */
repeat
half = (half + 1) / 2;
if (Pn >= half && Pn < limit) send (Pn - half, sum);
if ( Pn < (limit / 2 -1) ) sum = sum + receive();
limit = half;
until (half ==1);
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Parallel Message Passing Program
The program divides processors into senders and receivers.
Each receiving processor gets only one msg a receiving processor will stall until it receives a message.
send and receive can be used as promitives for synchronization as well as for communication.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Addressing in Large-Scale Processors
To be scalable a multiprocessor must have distributed memory.
For HW designer it is easier to offer only send and receive.
Msg passing also minimizes communication because the programmer who knows the application parallelism should optimize it.
It is possible to add a SW layer to provide a single address space on top of sends and receives so that communication is possible using loads and stores (implicit communication) Distributed Shared Memory (DSM).
DSM is comparable to the Virtual Memory system. It tries to detect spatial locallity of data and is only efficient if shared memory communication is rare. Otherwise most of the time is spent transferring data.
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Concluding Remarks
Evolution vs. Revolution “More often the expense of innovation comes from being too disruptive to computer users” “Acceptance of hardware ideas requires acceptance by software people; therefore hardware people should learn about software. And if software people want good machines, they must learn more about hardware to be able to communicate with and thereby influence hardware engineers.”
cs 152 L1 5 .*
DAP Fa97, Ó U.CB
Problems for Chapter 9
9.1, 9.2, 9.3, 9.4, 9.7
1
1
1
1
1
1