3. Asynchronous Parallelism Sample MIMD ... - Robotics UWA

23
Bräunl 2004 1 Mem Network PE Mem PE Mem PE Mem shared memory can be addressed by all PEs • • • 3. Asynchronous Parallelism Setup of an MIMD-System General Model Program is segmented into autonomous processes. Ideal Allocation: 1 Process : 1 Processor But usually: n Processes: 1 Processor Scheduler required Bräunl 2004 2 CPU 80386 CPU 80386 CPU 80386 CPU 80386 CPU 80386 CPU 80386 CPU 80386 CPU CPU CPU 80386 10 CPUs (32-Bit) Memory 40 MB Multibus- Adapter Drive- Controller SCED- Board Multibus SCSI Bus System Bus Console Ethernet Sample MIMD Systems Sequent Symmetry Bräunl 2004 3 Intel iPSC Hypercube iPSC = Intel Personal Supercomputer Predecessor "Cosmic Cube" (CalTech, Pasadena) Generations: iPSC/1 (80286), iPSC/2 (80386), iPSC/860 (80860) max. 128 Processors Hypercube-Connection Network Intel Paragon Predecessor "Touchstone Delta" max. 512 nodes with 2 × i860 XP processors each In each node : 1 processor for arithmetic + 1 processor for data exchange Different node types (dynamically configurable): compute node, service node, I/O node Grid network (4 neighbors) Sample MIMD Systems Bräunl 2004 4 Cray “Red Storm” – in cooperation with Sandia National Labs Standard AMD Opteron processors (originally 2 per node, in future only 1) max. 10,000 nodes max. 40 TFLOPS at US$ 90 Million Different node types: compute node, I/O node 3D Grid network (6 neighbors) Sample MIMD Systems Image: Cray

Transcript of 3. Asynchronous Parallelism Sample MIMD ... - Robotics UWA

Bräunl 2004 1

Mem

Network

PE

Mem

PE

Mem

PE

Mem

shared memory can be addressedby all PEs

• • •

3. Asynchronous Parallelism

Setup of an MIMD-System

General Model

Program is segmented into autonomous processes.Ideal Allocation: 1 Process : 1 ProcessorBut usually: n Processes: 1 Processor ⇒ Scheduler required

Bräunl 2004 2

CPU 80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386 CPU

80386

10 CPUs (32-Bit)

Memory 40 MB

Multibus- Adapter

Drive- Controller

SCED- Board

Multibus SCSI Bus

System Bus Console

Ethernet

Sample MIMD Systems

Sequent Symmetry

Bräunl 2004 3

Intel iPSC Hypercube• iPSC = Intel Personal Supercomputer• Predecessor "Cosmic Cube" (CalTech, Pasadena)• Generations: iPSC/1 (80286), iPSC/2 (80386), iPSC/860 (80860)• max. 128 Processors• Hypercube-Connection Network

Intel Paragon• Predecessor "Touchstone Delta"• max. 512 nodes with 2 × i860 XP processors each• In each node : 1 processor for arithmetic + 1 processor for data exchange• Different node types (dynamically configurable): compute node, service node, I/O node• Grid network (4 neighbors)

Sample MIMD Systems

Bräunl 2004 4

Cray “Red Storm” – in cooperation with Sandia National Labs• Standard AMD Opteron processors (originally 2 per node, in future only 1)• max. 10,000 nodes• max. 40 TFLOPS• at US$ 90 Million• Different node types: compute node, I/O node• 3D Grid network (6 neighbors)

Sample MIMD Systems

Image: Cray

Bräunl 2004 5

3.1 Process Model

Readystart of time slice

or end of time slice

OSReschedule RunningOSReady OSKill

or terminate

Blocked

P(sema)V(sema)

Bräunl 2004 6

Process-States for an individual Processor

The states ‘ready’ and ‘blocked’ are managed as queues.Execution of processes is done in time sharing mode (time slices).

ready executing

blocked

Done New

Process States

Bräunl 2004 7

ready

blocked

Done New

read block

resign

assign

addretire

executing PE1 executing PE2 executing PE3

Process StatesProcess-States for MIMD with shared memory

A process can be executed by different processors in sequential time slices.Bräunl 2004 8

Process-States for MIMD without shared memory

lblocked

ready executing

lblocked

ready executing

lblocked

ready executing

lblocked

ready executing

Process States

The actual allocation, i.e. which process executes on which processor,is often transparent to the programmer.

Processor 1 Processor 2 Processor 3 Processor 4

Bräunl 2004 9

“light" processes

Idea: Process concept remains, but less overhead.Previous costs: Process context switching, especially due to loading/saving of dataSaving: No loading/saving of program data on context switchingPrerequisites: One program with multiple processes on system with shared

memory (computer sequential or MIMD)Implementation: A user program with multiple processes always holds all data

• no loading/saving required • fast execution times

User view: Like processes, only faster

Threads

Bräunl 2004 10

Parallel processing creates the following problems:

1. Data exchange between two processes2. Simultaneous access of data area by multiple processes

must be prevented

message P 1 P 2

3.2 Synchronization and Communication

Bräunl 2004 11

Railway example of this problem:

Synchronization and Communication

1

2

Bräunl 2004 12

(by Peterson, Silberschatz)Multiple processes are executed in parallel. These need to be synchronized.One possibility for this is using synchronization variables:

…start(P1);start(P2);…

var turn: 1..2; Initialization: turn:=1;

P1 P2loopwhile turn≠1 do (*nothing*)

end;<critical section>

turn:=2;<other instructions>

end

loopwhile turn≠2 do (*nothing*)

end;<critical section>

turn:=1;<other instructions>

end

Software Solution

Bräunl 2004 13

Software Solution

• This solution guarantees that only 1 process can enter the critical section.

• But there is one major disadvantage: alternating access is enforced.

⇒ RESTRICTION !!

Bräunl 2004 14

var flag: array [1..2] of BOOLEAN;Initialization: flag[1]:=false; flag[2]:=false;

P1 P2

loopwhile flag[2] do (*nothing*) end;flag[1]:=true;<critical section>

flag[1]:=false;<other instructions>

end

loopwhile flag[1] do (*nothing*) end;flag[2]:=true;<critical section>

flag[2]:=false;<other instructions>

end

Software Solution – 1st Improvement

Bräunl 2004 15

It is not that easy though:

Should both processes exit their while-loops simultaneously (despite the safety check), then both will enter the critical section.

⇒ INCORRECT !!

Software Solution – 1st Improvement

Bräunl 2004 16

var flag: array [1..2] of BOOLEAN;Initialization: flag[1]:=false; flag[2]:=false;

P1 P2

loopflag[1]:=true;while flag[2] do (*nothing*) end;< critical section >

flag[1]:=false;<other instructions>

end

loopflag[2]:=true;while flag[1] do (*nothing *) end;<critical section>

flag[2]:=false;< other instructions >

end

Software Solution – 2nd Improvement

Bräunl 2004 17

If the two lines "flag[i]:=true" are moved before the while-loop, the error of improvement 1 will not occur, but now we can have a deadlock instead.

⇒ INCORRECT !!

Software Solution – 2nd Improvement

Bräunl 2004 18

var turn: 1..2;flag: array [1..2] of BOOLEAN;

Initialization: turn:=1; (* arbitratry *)flag[1]:=false; flag[2]:=false;

P1 P2

loopflag[1]:=true;turn:=2;while flag[2] and (turn=2) do(*nothing*) end;<critical section>

flag[1]:=false;<other instructions >

end

loopflag[2]:=true;turn:=1;while flag[1] and (turn=1) do(*nothing*) end;<critical section >

flag[2]:=false;<other instructions >

end

Software Solution – 3rd Improvement

Bräunl 2004 19

• Expandable to n processes• Also see Dekker’s algorithm• Disadvantage of software solution:

"Busy-wait", i.e. significant efficiency loss if each process does not have its own processor!

⇒ CORRECT !!

Software Solution – 3rd Improvement

Bräunl 2004 20

Test-and-Set Operation(standard in most processors)

function test_and_set (var target: BOOLEAN): BOOLEAN;begin_atomictest_and_set:=target;target := true;

end_atomic.

Implementation as atomic operation(indivisible, uninterruptible: "1 instruction cycle of CPU").

Hardware Solution

Bräunl 2004 21

New Solution using Test-and-Set operation:

var lock: BOOLEAN;Initialization: lock:=false;

Pi

loopwhile test_and_set(lock) do (*nothing*) end;<critical section>

lock:=false;<other instructions>

End;

• Removal of the busy-wait via Queues (see semaphores).

Hardware Solution

Bräunl 2004 22

Dijkstra, 1965, (similar to signal post for trains)

Application:

Pi…P(Sema);

<critical section>V(sema);…

3.3 Semaphore

Bräunl 2004 23

Implementation:P, V are atomic operations (indivisible).

type Semaphore = recordvalue: INTEGER;L: list of Proc_ID;

end;

Generic Semaphore → value: INTEGERBoolean Semaphore → value: BOOLEAN

Semaphore

Bräunl 2004 24

Initialization: L empty listvalue number of allowed P-ops without V-op

procedure P (var S: Semaphore);beginS.value := S.value-1;if S.value < 0 then

append(S.L, actproc); (* append this process to S.L *)block(actproc) (* and move to state "blocked" *)

endend P;procedure V (var S: Semaphore);var Pnew: ProcID;beginS.value := S.value+1;if S.value ≤ 0 then

getfirst(S.L, Pnew); (* remove process P from S.L *)ready(Pnew) (* and change state to "ready" *)

endend V;

Semaphore

Bräunl 2004 25

How do we achieve that P and V are atomic operations?

• Single-processor-system:• Disable all interrupts

• Multi-processor-system:• Software-solution: Busy-wait with short P- or V-operation as critical section.• Hardware-solution: Short busy-wait for Test-and-Set instruction before start of P- or V-operation.

Attention: Convoy-Phenomenon

Semaphore

Bräunl 2004 26

Using Boolean semaphores

Declaration and Initialization:var empty: semaphore [1];

full: semaphore [0];

process Producer; begin loop <create data> P (empty)

; <fill buffer> V (full); end ; end process Producer .

process Consumer; begin loop P (full); <clear buffer> V (empty)

; <process data> end ; end process Consumer .

Producer-Consumer Problem

Bräunl 2004 27

empty full

Data creation

Data write

Data consumption

Data read

Producer-Consumer ProblemCorresponding Petri-Net

Producer ConsumerBräunl 2004 28

var critical: semaphore[1];free : semaphore[n]; (* there are n buffer spaces *)used : semaphore[0];

process Producer;beginloop<Data creation>P(free);P(critical);<write to buffer>

V(critical);V(used);

end;end process Producer;

process Consumer;beginloopP(used);P(critical);<read buffer>V(critical);

V(free);<process data>

end;end process Consumer;

Bounded Buffer Problem

Bräunl 2004 29

Corresponding Petri-Net:

Producer Consumer

Data write critical

n free

0 used

Data consumption

Daten creation

Dataread 1 1 1

Bounded Buffer Problem

Bräunl 2004 30

var count : semaphore[1];r_w : semaphore[1];

(* One writer or many readers *)readcount: INTEGER;

Initialization: readcount:=0;

process Reader;beginloop

P(count);if readcount=0 then P(r_w) end;readcount := readcount + 1;

V(count);

<read buffer>

P(count);readcount := readcount - 1;if readcount=0 then V(r_w) end;

V(count);<process data>

end; (* loop *)end process Reader;

process Writer;begin

loop<data creation>P(r_w);

<write buffer>

V(r_w);end; (* loop *)

end process Writer;

Readers-Writers Problem

Bräunl 2004 31

Dataread

Data consumption

1

P(count)

V(count)

readcount count 0 1

1 r_w Data

creation

Data write

1

V(count)

P(count)

1

2

Readers-Writers Problem

Reader Writer

Bräunl 2004 32

Thread Example 1#include <pthread.h>#include <stdio.h>

#define repeats 3#define threadnum 5

void *slave(void *arg){ int id = (int) arg ;

for (int i = 0; i < repeats; i++) { printf("thread %d\n", id);sched_yield();

}return( 0 ) ;

}

int main(){ pthread_t thread;

for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");

}pthread_exit( 0 );

}

thread 0thread 1thread 2thread 4thread 3thread 0thread 2thread 4thread 1thread 3thread 0thread 2thread 4thread 1thread 3

Bräunl 2004 33

Thread Example 2#include <pthread.h>#include <stdio.h>

#define repeats 3#define threadnum 5

void *slave(void *arg){ int id = (int) arg ;

for (int i = 0; i < repeats; i++) { printf("thread %d\n", id);}return( 0 ) ;

}

int main(){ pthread_t thread;

for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");

}pthread_exit( 0 );

}

thread 0thread 0thread 0thread 2thread 2thread 2thread 1thread 1thread 1thread 3thread 3thread 4thread 3thread 4thread 4

Bräunl 2004 34

Thread Example 3#include <pthread.h>#include <stdio.h>

#define repeats 3#define threadnum 5

void *slave(void *arg){ int id = (int) arg ;

for (int i = 0; i < repeats; i++) { printf("%d-A\n", id);printf("%d-B\n", id);printf("%d-C\n", id);

}return( 0 ) ;

}

int main(){ pthread_t thread;

for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");

}pthread_exit( 0 );

}

0-A0-B0-C0-A0-B2-A1-A0-C3-A2-B4-A1-B0-A3-B2-C4-B1-C0-B3-C2-A4-C1-A0-C3-A2-B4-A1-B3-B2-C4-B1-C3-C2-A4-C1-A3-A2-B4-A1-B3-B2-C4-B1-C3-C4-C

Bräunl 2004 35

Thread Example 4#include <pthread.h>#include <stdio.h>#define repeats 3#define threadnum 5pthread_mutex_t mutex;void *slave(void *arg){ int id = (int) arg ;for (int i = 0; i < repeats; i++) { pthread_mutex_lock(&mutex);printf("%d-A\n", id);printf("%d-B\n", id);printf("%d-C\n", id); pthread_mutex_unlock(&mutex);}return( 0 ) ;}int main(){ pthread_t thread;

if (pthread_mutex_init(&mutex, NULL)){ printf("Error: mutex failed\n"); exit(1); }for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) { printf("Error: thread creation failed\n"); exit(1); }}pthread_exit( 0 );}

0-A0-B0-C0-A0-B0-C0-A0-B0-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-CBräunl 2004 36

By Hoare 1974, Brinch Hansen 1975

• Abstract data type• A monitor encapsulates the data area to be protected and the

corresponding data access mechanisms (Entries, Conditions).

Usage:P1 P2... ...Buffer:DataWrite(x) Buffer:DataRead(x)... ...

Monitor calls are mutually exclusive,they are automatically synchronized.

3.4 Monitor

Bräunl 2004 37

Stack

1

n

2

0

• • Pointer

Monitor

Application Example: Buffer management

Bräunl 2004 38

Monitor for Buffer Managementmonitor buffer;var Stack: array [1..n] of Dataset;

Pointer: 0..n;free, used: CONDITION;

entry WriteData (a: Dataset);begin

while Pointer=n do (* Stack full *)wait(free)

end;inc(Pointer);Stack[Pointer]:=a;if Pointer=1 then signal(used) end;

end WriteData;

entry ReadData (var a: Dataset);begin

while Pointer=0 do (* Stack empty *)wait(used)

end;a:=Stack[Pointer];dec(Pointer);if Pointer=n-1 then signal(free) end;

end ReadData;

Begin (* Monitor-Initialization *)Pointer:=0

end monitor Buffer;

Bräunl 2004 39

Monitor Conditions

• wait (Cond) The executing process blocks itself and waits until another process executes a signal-operation on the condition Cond

• signal (Cond)All processes waiting in the condition Cond are reactivated and will again apply for access to the monitor. (Another variant only releases one process: the next one in the queue)

• status (Cond)This function returns the number of processes waiting for entry into this condition

Bräunl 2004 40

Conditions are queues, similar to those of semaphores.

Monitor Implementation Steps:

1) var MSema: semaphore[1]; (* is required for every monitor *)

2) Rewriting of entries to procedures:

procedure xyz(...)begin

P(MSema);<instruction>

V(MSema);end xyz;

3) Rewrite of monitor initialization into a procedure and corresponding call from main program.

Monitor Implementation

Bräunl 2004 41

4) procedure wait(Cond: condition; MSema: semaphore);begin

append(Cond,actproc); (* insert ProcID in Condition-queue *)block(actproc); (* Dispatcher: insert ProcID in block-liste *)V(MSema); (* release monitor-semaphor *)assign; (* Dispatcher: load next ready process *)

end wait;

5) procedure status(Cond: condition): CARDINAL;begin

return length(Cond);end status;

Monitor Implementation

Bräunl 2004 42

6a) In this implementation only one process is released (wait in if-clause)

procedure signal(Cond, MSema);var NewProc: Proc_ID;beginif not empty(cond) thenGetFirst(Cond, NewProc);P-Op. für NewProc

endend signal; • • •

• • •

• • •

• • •

Cond

MSema

(Cond = NIL)

MSema

Monitor Implementation

Bräunl 2004 43

• • •

• • •

• • •

• • •

Cond

MSema

Cond

MSema

6b) In this implementation all waiting processes are released. (wait in while-loop)

procedure signal(var Cond:condition; var MSema:semaphore);begin (* Status-check not needed here *)append(MSema.L,Cond); (* append list *)MSema.Value := MSema.Value - status(Cond);Cond := nil;

end signal;

Monitor Implementation

Bräunl 2004 44

3.5 Message Passing

• For distributed systems (no shared memory)this is the only method of communication

• Also usable for systems with shared memory• Easy to use, but computation time expensive (overhead)

⇒ Implementation with implicit communication:Remote Procedure Call

Bräunl 2004 45

Operations: Send_A (Receiver, Message)Receive_A (var Sender; var Message)Send_R (Receiver, Message)Receive_R (var Sender; var Message)

Send/receive of jobs, or send/receive for repliesEach process can receive as many jobs without processing them, as fit into its buffer.

Process 2

A R Process n

A R

• • • Process 1

A R

Buffer area for messages

A = Tasks R = Replies

Message Passing Example

Bräunl 2004 46

PC PS

...Send_A(to_Server,Task);...Receive_R(from_Server,Result)

loopReceive_A(Client,Task);<process task>

Send_R(Client,Result);end;

Implementation• In parallel systems with shared memory: Monitor• In parallel systems without shared memory: Decentralized network management

with message protocols

Message Passing Example

Bräunl 2004 47

Schematic for systems with shared memory:(use of a global message pool)

Send_A

Receive_R

Receive_A

Send_R

Pool

Message Passing Example

Bräunl 2004 48

type PoolElem = recordfree: BOOLEAN;from: 1..number of Procs;

info: Message;end;

Queue = recordcontents: 1..max;next : pointer to Queueend;

monitor Messages;var Pool: array [1..max] of PoolElem; (* global message pool *)

pfree: CONDITION; (* queue, in case pool is fully filled *)Afull, Rfull: array [1..number of Procs] of CONDITION;

(* queue for each process for incoming messages *)queueA, queueR: array [1..number of Procs] of Queue;

(* local message pools for each process *)

Message Passing Example

Bräunl 2004 49

entry Send_A (to: 1..number of Procs; a: Message);var id: 1..max;begin

while not GetFreeElem(id) do wait(pfree) end;with pool[id] dofree := false;from := actproc;info := a;

end;append(queueA[to],id); (* insert place number in task queue *)signal(Afull[to]);

end Send_A;

entry Receive_A (var von: 1..number of Procs; var a: Message);var id: 1..max;begin

while empty(queueA[actproc]) do wait(Afull[actproc]) end;id := head(queueA[actproc]);von := pool[id].from;a := pool[id].info; (* pool[id] not yet freed *)

end Receive_A;

Message Passing Example

Bräunl 2004 50

entry Send_R (nach: 1..number of Procs; ergebnis: Message);var id: 1..max;begin

id := head(queueA[actproc]);tail(queueA[actproc]); (* remove first element (head) of Queue *)pool[id].from := actproc;pool[id].info := ergebnis;append(queueR[to],id]; (* insert place number in reply queue *)signal(Rfull[to])

end Send_R;entry Receive_R (var von: 1..number of Procs; var erg: Messages);var id: 1..max;begin

while empty(queueR[actproc]) do wait(Rfull[actproc]) end;id := head(queueR[actproc]);tail(queueR[actproc]); (* remove first element (head) of Queue *)with pool[id] dovon := from;erg := info;free := true; (* release of pool element *)

end;signal(pfree); (* a free pool element exists *)

end Receive_R;

Message Passing Example

Bräunl 2004 51

3.6 Problems with Asynchronous Parallelism

Time dependent errors• Not reproducible!• Cannot be found by systematic testing!

A. Inconsistent DataA set of data (or, a relationship between data) does not have the value it would have received if the operation had been processed sequentially.

B. BlockingsDeadlock, Livelock

C. InefficienciesLoad Balancing, Process Migration

Bräunl 2004 52

Before : Income[Miller] = 1000

P1 P2…x := Income[Miller];x := x+50;Income[Miller] := x;…

…y := Income[Miller];y := 1.1*y;Income[Miller] := y;…

After:Income[Miller] = ?

1155115011001050

Inconsistent Data

Problem A.1: Lost Update

Depending on sequence of memory accesses

Bräunl 2004 53

Before: account[A] = 2000account[B] = 1000 Sum(A,B) = 3000

P1 P2

…x := account[A];x := x-400;account[A] := x;

x := account[B];x := x+400;account[B] := x;…

…a := account[A];b := account[B];sum := a+b;…

Inconsistent Data

Problem A.2: Inconsistent Analysis

Bräunl 2004 54

After: account[A] = 1600,-account[B] = 1400,- Sum(A,B) = 3000,-

(Same as before!)

sum = ?

Inconsistent Data

340030002600

Depending on sequence of memory accesses

Bräunl 2004 55

Inconsistent Data

Problem A.3: Uncommitted Dependencies

In databases and transaction systems:Transactions are atomic operations commit / rollback

Bräunl 2004 56

Blockings

Sample Occurrences:

• All processes are blocked in semaphore or queue conditions (deadlock)• All processes are caught in mutual busy wait or poling statements (livelock)

A group of processes is waiting for the occurrence of a condition which can only be generated by that group itself (alternating dependency)

Bräunl 2004 57

P (TE); P (PR); <processing> V (PR); V (TE);

1 P

P (PR); P (TE); <processing > V (TE); V (PR);

2 P

Blockings

Two processes require terminal and printer resources for computing

Problem B.1: Deadlock

Bräunl 2004 58

Blockings

The following conditions are required for a deadlock to occur[Coffman, Elphick, Shoshani 71]:

1. Resources can only be used exclusive.

2. Processes possess resources while requesting new ones.(may be broken by demand that all required resources must be requested at the same time)

3. Resources cannot be forcibly removed from processes.(may be broken by forced removal of resources, e.g. resolving existing deadlocks)

4. There is a circular loop or processes such that each process possesses the resources requested by the next one in the chain.

Bräunl 2004 59

PE 1 PE 3 PE 2

2

8

5

1

7

4

3

9

6

– –1

7

4

PE 1 PE3 PE2

Some processesare blocked

Inefficiencies

Simple Scheduling Model:Static distribution of processes to processors(no redistribution during run time)

⇒ can potentially leads to large inefficiencies.

Problem C.1: Load Balancing

Initial states Loss of parallelismBräunl 2004 60

(by K. Hwang)

Dynamic allocation of processes to processors during run time(dynamic load balancing).Allocated processes may be re-distributed (process migration)depending on the load setting (threshold).

Methods:Receiver-Initiative: Processors with low load request more processes.

(useful at high system load)

Sender-Initiative: Processors with too much load want to offload processes.(useful at low system load)

Hybrid Method: System switches between Receiver- and Sender-Initiatives depending on global system load.

Extended Scheduling Model

Bräunl 2004 61

Advantages and Disadvantages+ Better processor utilization, no loss of possible parallelism– General management overheads– Methods kick in too late, namely when the load distribution has already been

seriously out of balance– "process migration" is an expensive operation and only makes sense for longer

running processeso circular "process migration" must be prevented via appropriate parallel algorithms

and thresholds– Under full parallel load, load-balancing is useless!

Extended Scheduling Model

Bräunl 2004 62

3.7 MIMD Programming Languages• Pascal derivatives

– Concurrent Pascal (Brinch-Hansen, 1977)– Ada (US Dept. of Defense, 1975)– Modula-P (Bräunl, 1986)

• C/C++ plus parallel libraries– Sequent C (Sequent, 1988)– pthreads– PVM “parallel virtual machine” (Sunderam et al., 1990)– MPI“message passing interface” (based on CVS, MPI Forum, 1995)

• Special languages– CSP ”Communicating Sequential Processes” (Hoare, 1978)– Occam (based on CSP, inmos Ltd. 1984)– Linda (Carriero, Gelernter, 1986)

Bräunl 2004 63

Pthreads

• Extension to standard Unix fork() and join()• Previously many different implementations• Standard IEEE POSIX 1003.1c standard (1995)• Performance:

– Much faster than fork() (about 10x)– Much faster than MPI, PVM

• See: http://www.llnl.gov/computing/tutorials/pthreads/http://www.cs.nmsu.edu/~jcook/Tools/pthreads/library.html

Bräunl 2004 64

Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/

• In the UNIX environment a thread: – Exists within a process and uses the process resources – Has its own independent flow of control as long as its parent process exists and

the OS supports it – May share the process resources with other threads that act equally independently

(and dependently) – Dies if the parent process dies - or something similar

• To the software developer, the concept of a "procedure" that runs independently from its main program may best describe a thread.

• Because threads within the same process share resources: – Changes made by one thread to shared system resources (such as closing a file)

will be seen by all other threads. – Two pointers having the same value point to the same data. – Reading and writing to the same memory locations is possible, and therefore

requires explicit synchronization by the programmer.

Bräunl 2004 65

Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/

Sequentialaddress space

Bräunl 2004 66

Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/

Paralleladdress space

Bräunl 2004 67

Pthreads: Thread Functions

Create a new thread:int pthread_create (pthread_t *thread_id, const pthread_attr_t

*attributes,void *(*thread_function)(void *), void *arguments);

A thread terminates when the function returns, or explicitly by calling:int pthread_exit (void *status);

A thread can wait for the termination of another:int pthread_join (pthread_t thread, void **status_ptr);

A thread can read its own id:pthread_t pthread_self ();

It can be checked whether two threads are identical:int pthread_equal (pthread_t t1, pthread_t t2);

Bräunl 2004 68

Pthreads: Mutex FunctionsMutex data type: pthread_mutex_t

Init of Mutex (mututal exclusion = simple binary semaphore) :(use NULL pointer for default attributes)

int pthread_mutex_init (pthread_mutex_t *mut, const pthread_mutexattr_t *attr);

Lock Mutex (= P(sema) )int pthread_mutex_lock (pthread_mutex_t *mut);

Unlock Mutex (= V(sema) )int pthread_mutex_unlock (pthread_mutex_t *mut);

Nonblocking version of lock (either succeeds or returns EBUSY)int pthread_mutex_trylock (pthread_mutex_t *mut);

Deallocate MutexInt pthread_mutex_destroy (pthread_mutex_t *mut);

Bräunl 2004 69

Pthreads: Semaphore FunctionsSempahore data type: sem_t

Init and de-allocate of Semaphore : (use PTHREAD_PROCESS_PRIVATE as shared value)int sem_init (sem_t * sem, int pshared, unsigned int value);int sem_destroy (sem_t * sem);

P(sema) int sem_wait (sem_t * sem);int sem_timedwait (sem_t * sem, const struct timespec * abstime)

V(sema)int sem_post (sem_t * sem);int sem_post_multiple (sem_t * sem, int count);

Otherint sem_getvalue (sem_t * sem, int * sval);

Bräunl 2004 70

Pthreads: Monitor ConditionsInit Condition :int pthread_cond_init (pthread_cond_t *cond, pthread_condattr_t *attr);

Wait Version 1: Standard (note: always blocks)int pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mut);

Wait Version 2: With timeoutint pthread_cond_timedwait (pthread_cond_t *cond, pthread_mutex_t *mut,

const struct timespec *abstime);

Signal Version 1 (note: releases 1 waiting thread)int pthread_cond_signal (pthread_cond_t *cond);

Signal Version 2 (note: releases all waiting threads)int pthread_cond_broadcast (pthread_cond_t *cond);

Deallocate ConditionInt pthread_cond_destroy (pthread_cond_t *cond);

Bräunl 2004 71

Pthreads: Hello World Example#include <pthread.h>#include <stdio.h>#define NUM_THREADS 5

void *PrintHello(void *threadid){

printf("\n%d: Hello World!\n", (int) threadid);pthread_exit(NULL);return 0;

}

int main (int argc, char *argv[]){ pthread_t threads[NUM_THREADS];

int rc, t;for(t=0;t < NUM_THREADS;t++){

printf("Creating thread %d\n", t);rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t);if (rc){

printf("ERROR; return code from create() is %d\n", rc);exit(-1);

}}pthread_exit(NULL);

}

Creating thread 0Creating thread 1

0: Hello World!Creating thread 2

1: Hello World!

2: Hello World!Creating thread 3Creating thread 4

3: Hello World!

4: Hello World!

Bräunl 2004 72

MPI (Message Passing Interface)

• based on CVS, MPI Forum (incl. hardware vendors), 1994/95• MPI is a standard for a library of functions and macros that

implements data sharing and synchronization between processes.• Designed to be practical, portable, efficient and flexible.• Public domain implementations: MPICH (www.mcs.anl.gov/mpi)

LAM-MPI(www.lam-mpi.org)• Available for almost any platform, links to C, C++, FORTRAN• MPE routines provide profiling and graphics output.• Debugging tools are implementation specific. Prepared to address

the lack of compatibility between software utilizing vendor specific message passing libraries.

Bräunl 2004 73

MPI (Message Passing Interface)

• Programming in well known language, with insertion of synchronisation, communication and process grouping functions (no process creation, however, like in PVM)

• Parallel processing for MIMD distributed memory (can be used on shared or hybrid systems). Unlike in PVM, code for all processes is usually contained in one executable.

• Different implementations provide different runtime environments. Some have daemons that watch over process execution (like PVM) – e.g. LAM-MPI and some do not, e.g. MPICH. Not all implementations are thread safe!

Bräunl 2004 74

MPI Programming• One source file for all processes. Distinguish processing responsibility via rank

(integer uniquely assigned to a process) by calling MPI_Comm_rank(). The number of processes requested by the user is obtained trough MPI_Comm_size().

• MPI programs must call MPI_Init() before using any MPI calls. All programs must end with MPI_Finalize().

• MPI data type definitions are provided for basic data types (e.g MPI_INT, MPI_DOUBLE). User data types can be defined. Packing is optional.

• Basic message passing– Blocking: MPI_Send(), MPI_Recv()– Non Blocking: MPI_Isend(), MPI_Irecv() followed by e.g.

MPI_Wait(), Waitall() or MPI_Test()• Collective communications MPI_Broadcast(), MPI_Barrier(), MPI_Reduce(),

MPI_Scatter(), MPI_Gather(), MPI_Alltoall()• MPI was designed to please many interest → many functions exist.

Only a subset is really required for most application programs.

Bräunl 2004 75

MPI_Send(vector,10,MPI_INT,dest,tag,MPI_COMM_WORLD)

MPI_Recv(vector,10,MPI_INT,src,tag,MPI_COMM_WORLD,status)

data buffer start size(no. els)

communicator(user defined process group)

data typesource

destination tag

MPI_Bcast(vector,10,MPI_INT,root_rank,MPI_COMM_WORLD)

info about rcvd msgrank of a broadcasting process

MPI_Barrier(communicator)

MPI Programming

Bräunl 2004 76

int MPI_Init(int *argc, char ***argv) int MPI_Finalize()

int MPI_Comm_rank ( MPI_Comm comm, int *rank ) int MPI_Comm_size ( MPI_Comm comm, int *size )

int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)

int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )

int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)

int MPI_Test (MPI_Request *request,int *flag,MPI_Status *status) int MPI_Wait (MPI_Request *request,MPI_Status *status) int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[])

Initialisation/Cleanup Communicator Info

Blocking Message Passing

Non Blocking Message Passing

MPI Functions

Bräunl 2004 77

int MPI_Barrier (MPI_Comm comm)

int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)

int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,MPI_Comm comm)

int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm)

int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype, MPI_Comm comm)

Collective Communication

MPI Functions

Bräunl 2004 78

• Runtime environment is implementation specific (here: MPICH).• Machine-file containing a list of machines to be used should be created.• Processes are started with command mpirun. A machinefile can be

given as a parameter (if not a global default is used).• A number of processes to be started may also be specified (option -np):

mpirun -machinefile mfile -np 2 myprog

This starts 2 instances of myprog on machines specified in mfile

MPI Execution

Bräunl 2004 79

#include <stdio.h>#include <string.h>#include "mpi.h"int main(int argc, char **argv){ int myrank = -1; int l; char buf[100]; MPI_Status status;

MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){

MPI_Recv(buf,100,MPI_CHAR,1,0,MPI_COMM_WORLD,&status);printf("message from process %d : %s \n",status.MPI_SOURCE,buf);

}else if(myrank == 1){

strcpy(buf,"Hello world from: ");MPI_Get_processor_name(buf+strlen(buf),&l);MPI_Send(buf,100,MPI_CHAR,0,0,MPI_COMM_WORLD);

}MPI_Finalize();

}

MPI “Hello World” (adapted from Manchek’s PVM ex.)

Bräunl 2004 80

3.8 Coarse-Grained Parallel Algorithms

• Synchronization with Semaphores• Synchronization with Monitor• Pi calculation• Distributed simulation

Bräunl 2004 81

Synchronization with Semaphores (pthreads)#include <stdio.h>#include <pthread.h>#include <semaphore.h>#define BUF_SIZE 5 sem_t *full,*empty;sem_t critical,freex,used;int buf[BUF_SIZE];int pos,z;

void produce(int i){ sem_wait(&freex);

sem_wait(&critical);if(pos>=BUF_SIZE){ printf("Err\n");}buf[pos++]=i;printf("write Pos: %d %d\n",pos-1,i);

sem_post(&critical);sem_post(&used);

}

void consume(int* i){ sem_wait(&used);

sem_wait(&critical);if(pos < 0) { printf("Err\n");}*i=buf[--pos];printf("read Pos: %d %d \n",pos,*i);

sem_post(&critical);sem_post(&freex);

}

void *prod_thread(void *pparam){ int i=0;

while(1){ i=(i+1)%10;

produce(i);}

}void *cons_thread(void *pparam){ int i,quer;

quer=0;while(1){ consume(&i);

quer = (quer+i)%10;}

}int main(void){ pthread_t p,c;

int i;sem_init(&freex,PTHREAD_PROCESS_PRIVATE,BUF_SIZE);sem_init(&used,PTHREAD_PROCESS_PRIVATE,0);sem_init(&critical,PTHREAD_PROCESS_PRIVATE,1);

for(i=0;i<BUF_SIZE;i++) buf[i]=0;pos = 0;pthread_create(&p,NULL,prod_thread,0);pthread_create(&c,NULL,cons_thread,0);pthread_join(p,NULL);pthread_join(c,NULL);

} Bräunl 2004 82

write Pos: 1 1write Pos: 2 2read Pos: 2 2write Pos: 2 3read Pos: 2 3write Pos: 2 4read Pos: 2 4write Pos: 2 5read Pos: 2 5write Pos: 2 6read Pos: 2 6.....

Synchronization with SemaphoresSample Run

Bräunl 2004 83

#include <stdio.h>#include <pthread.h>#define BUF_SIZE 5 pthread_cond_t xused, xfree;pthread_mutex_t mon;int stack[BUF_SIZE];int pointer=0;

void buffer_write(int a){ pthread_mutex_lock(&mon);

if(pointer==BUF_SIZE)pthread_cond_wait(&xfree,&mon);

stack[pointer++] = a;printf("write %d %d \n", pointer-1,a);if(pointer==1)pthread_cond_signal(&xused);

pthread_mutex_unlock(&mon);}

void buffer_read(int* a){ pthread_mutex_lock(&mon);

if(pointer==0)pthread_cond_wait(&xused,&mon);

*a = stack[--pointer];printf("read %d %d\n",pointer,*a);if(pointer==BUF_SIZE-1)pthread_cond_signal(&xfree);

pthread_mutex_unlock(&mon);}

Synchronization with Monitor (pthreads)

Bräunl 2004 84

Synchronization with Monitor (pthreads)void *prod_thread(void *pparam){ int i=0;printf("Initi Producer \n");while(1){ i=(i+1)%10;

buffer_write(i);}

}

void *cons_thread(void *pparam){ int i,quer=0;printf("Init Consumer \n");while(1){ buffer_read(&i);

quer = (quer+i)%10;}

}

int main(void){ pthread_t p,c;int i;

printf("Init... \n");pthread_cond_init(&xfree,NULL);pthread_cond_init(&xused,NULL);pthread_mutex_init(&mon,NULL);

for(i=0;i<BUF_SIZE;i++) stack[i]=0;pointer = 0;

pthread_create(&p,NULL,prod_thread,0);pthread_create(&c,NULL,cons_thread,0);

pthread_join(p,NULL);pthread_join(c,NULL);

}

Bräunl 2004 85

write 1 1read 1 1write 1 2read 1 2write 1 3write 2 4read 2 4write 2 5read 2 5write 2 6read 2 6write 2 7read 2 7write 2 8read 2 8.....

Synchronization with MonitorSample Run

Bräunl 2004 86

4 2 0

0 0,5 1

Pi Calculation

π = ∫ +

1

021

4 dxx ∑

= −+

ervals

iwidth

widthi

int

12 *

)*)5.0((14

=

Bräunl 2004 87

#include <stdio.h>#include <pthread.h>#define MAX_THREADS 10#define INTERVALS 1000#define WIDTH 1.0/(double)(INTERVALS)pthread_mutex_t result_mtx;pthread_mutex_t interval_mtx;

double f(double x ){ return 4.0/(1+x*x); }

void assignment_put_result(double res){ static double sum = 0;

static int answers = 0;

pthread_mutex_lock(&result_mtx);sum+=res;answers++;pthread_mutex_unlock(&result_mtx);if(answers==INTERVALS)printf("Result is %f\n",sum);

}

Pi Calculation (pthreads)4 2 0

0 0,5 1

void assignment_get_interval(int *iv) { static int pos = 0;

pthread_mutex_lock(&interval_mtx);if(++pos<=INTERVALS) *iv=pos;else *iv=-1;pthread_mutex_unlock(&interval_mtx);

}

void *worker(void *pparam){ int iv;

double res;

assignment_get_interval(&iv);while(iv>0){ res = WIDTH*f( ( (double)(iv)-0.5)*WIDTH)assignment_put_result(res);assignment_get_interval(&iv);

}return 0;

}Bräunl 2004 88

int main(void){ pthread_t thread[MAX_THREADS];int i;

pthread_mutex_init(&interval_mtx,NULL);pthread_mutex_init(&result_mtx,NULL);for(i=0;i<MAX_THREADS;i++)pthread_create(&thread[i],NULL,worker,(void*) i);

printf("Main: threads created\n");

for(i=0;i<MAX_THREADS;i++)pthread_join(thread[i],NULL);

pthread_mutex_destroy(&interval_mtx);pthread_mutex_destroy(&result_mtx);

}

Pi Calculation (pthreads)4 2 0

0 0,5 1

Bräunl 2004 89

Start a number of worker threads.

Each thread should get his line number from a monitor,then work locally in his area and store back results.

Distributed Simulation

Model:• 2-dim. field of elements (Persons)• During each time step each person assumes the opinion of a random neighbor

Problem decomposition and assignment to processes:

Bräunl 2004 90

#include <stdio.h>#include <pthread.h>#include <stdlib.h>#include <time.h>#define MAX_THREADS 5#define LINES 40#define COLS 60#define GENERATIONS 10000

pthread_mutex_t crmtx;int arr[LINES][COLS];

void monitor_get_linenumber(int* j)/* returns current line index */{ static int current_row = 0;

pthread_mutex_lock(&crmtx);*j = current_row; current_row = (current_row+1)%LINES;

pthread_mutex_unlock(&crmtx);}

Distributed Simulation (pthreads)void monitor_read_line(int j,int* line)/* returns specified line */{ int i;

pthread_mutex_lock(&crmtx);for(i=0;i<COLS;i++)line[i]=arr[j][i];

pthread_mutex_unlock(&crmtx);}

void monitor_put_line(int j,int* line)/* write back one line */{ int i;

pthread_mutex_lock(&crmtx);for(i=0;i<COLS;i++)arr[j][i]=line[i];

pthread_mutex_unlock(&crmtx);}

void print_array(int cnt){ …}

Bräunl 2004 91

void *worker(void *pparam){ int k,j,pos,cnt;

int line[COLS],above[COLS],below[COLS],newl[COLS];

for(cnt=0;cnt<=GENERATIONS;cnt++){ if (pparam == 0 && cnt%1000 == 0) print_array(cnt);

monitor_get_linenumber(&j);monitor_read_line(j,line);if (j>0) monitor_read_line(j-1, above);

else monitor_read_line(LINES-1, above); if (j<LINES-1) monitor_read_line(j+1, below);

else monitor_read_line(0, below);;for(k=0; k<COLS; k++){ pos = rand()%8;

switch(pos){ case 0: newl[k] = above[(k-1+COLS)%COLS]; break;

case 1: newl[k] = above[k]; break;case 2: newl[k] = above[(k+1)%COLS]; break;case 3: newl[k] = line [(k-1+COLS)%COLS]; break;case 4: newl[k] = line [(k+1)%COLS]; break;case 5: newl[k] = below[(k-1+COLS)%COLS]; break;case 6: newl[k] = below[k]; break;case 7: newl[k] = below[(k+1)%COLS]; break;

}}monitor_put_line(j,newl);

}return 0;

}

Distributed Simulation (pthreads)int main(void){ pthread_t thread[MAX_THREADS];

int i,j;

srand(time(NULL));pthread_mutex_init(&crmtx,NULL);

/* Initialise array */for(j=0;j<LINES;j++)for(i=0;i<COLS;i++)arr[j][i] = rand()%2;

/* Start threads */for(i=0;i<MAX_THREADS;i++)

pthread_create(&thread[i],NULL,worker,(void *) i);

/* Make sure all threads are finished */for(i=0;i<MAX_THREADS;i++)

pthread_join(thread[i],NULL);return 0;

}Bräunl 2004 92

Result of simulation: Local clustering appears.

Random start value after 10 steps after 100 steps

Distributed Simulation