3. Asynchronous Parallelism Sample MIMD ... - Robotics UWA
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of 3. Asynchronous Parallelism Sample MIMD ... - Robotics UWA
Bräunl 2004 1
Mem
Network
PE
Mem
PE
Mem
PE
Mem
shared memory can be addressedby all PEs
• • •
3. Asynchronous Parallelism
Setup of an MIMD-System
General Model
Program is segmented into autonomous processes.Ideal Allocation: 1 Process : 1 ProcessorBut usually: n Processes: 1 Processor ⇒ Scheduler required
Bräunl 2004 2
CPU 80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386 CPU
80386
10 CPUs (32-Bit)
Memory 40 MB
Multibus- Adapter
Drive- Controller
SCED- Board
Multibus SCSI Bus
System Bus Console
Ethernet
Sample MIMD Systems
Sequent Symmetry
Bräunl 2004 3
Intel iPSC Hypercube• iPSC = Intel Personal Supercomputer• Predecessor "Cosmic Cube" (CalTech, Pasadena)• Generations: iPSC/1 (80286), iPSC/2 (80386), iPSC/860 (80860)• max. 128 Processors• Hypercube-Connection Network
Intel Paragon• Predecessor "Touchstone Delta"• max. 512 nodes with 2 × i860 XP processors each• In each node : 1 processor for arithmetic + 1 processor for data exchange• Different node types (dynamically configurable): compute node, service node, I/O node• Grid network (4 neighbors)
Sample MIMD Systems
Bräunl 2004 4
Cray “Red Storm” – in cooperation with Sandia National Labs• Standard AMD Opteron processors (originally 2 per node, in future only 1)• max. 10,000 nodes• max. 40 TFLOPS• at US$ 90 Million• Different node types: compute node, I/O node• 3D Grid network (6 neighbors)
Sample MIMD Systems
Image: Cray
Bräunl 2004 5
3.1 Process Model
Readystart of time slice
or end of time slice
OSReschedule RunningOSReady OSKill
or terminate
Blocked
P(sema)V(sema)
Bräunl 2004 6
Process-States for an individual Processor
The states ‘ready’ and ‘blocked’ are managed as queues.Execution of processes is done in time sharing mode (time slices).
ready executing
blocked
Done New
Process States
Bräunl 2004 7
ready
blocked
Done New
read block
resign
assign
addretire
executing PE1 executing PE2 executing PE3
Process StatesProcess-States for MIMD with shared memory
A process can be executed by different processors in sequential time slices.Bräunl 2004 8
Process-States for MIMD without shared memory
lblocked
ready executing
lblocked
ready executing
lblocked
ready executing
lblocked
ready executing
Process States
The actual allocation, i.e. which process executes on which processor,is often transparent to the programmer.
Processor 1 Processor 2 Processor 3 Processor 4
Bräunl 2004 9
“light" processes
Idea: Process concept remains, but less overhead.Previous costs: Process context switching, especially due to loading/saving of dataSaving: No loading/saving of program data on context switchingPrerequisites: One program with multiple processes on system with shared
memory (computer sequential or MIMD)Implementation: A user program with multiple processes always holds all data
• no loading/saving required • fast execution times
User view: Like processes, only faster
Threads
Bräunl 2004 10
Parallel processing creates the following problems:
1. Data exchange between two processes2. Simultaneous access of data area by multiple processes
must be prevented
message P 1 P 2
3.2 Synchronization and Communication
Bräunl 2004 11
Railway example of this problem:
Synchronization and Communication
1
2
Bräunl 2004 12
(by Peterson, Silberschatz)Multiple processes are executed in parallel. These need to be synchronized.One possibility for this is using synchronization variables:
…start(P1);start(P2);…
var turn: 1..2; Initialization: turn:=1;
P1 P2loopwhile turn≠1 do (*nothing*)
end;<critical section>
turn:=2;<other instructions>
end
loopwhile turn≠2 do (*nothing*)
end;<critical section>
turn:=1;<other instructions>
end
Software Solution
Bräunl 2004 13
Software Solution
• This solution guarantees that only 1 process can enter the critical section.
• But there is one major disadvantage: alternating access is enforced.
⇒ RESTRICTION !!
Bräunl 2004 14
var flag: array [1..2] of BOOLEAN;Initialization: flag[1]:=false; flag[2]:=false;
P1 P2
loopwhile flag[2] do (*nothing*) end;flag[1]:=true;<critical section>
flag[1]:=false;<other instructions>
end
loopwhile flag[1] do (*nothing*) end;flag[2]:=true;<critical section>
flag[2]:=false;<other instructions>
end
Software Solution – 1st Improvement
Bräunl 2004 15
It is not that easy though:
Should both processes exit their while-loops simultaneously (despite the safety check), then both will enter the critical section.
⇒ INCORRECT !!
Software Solution – 1st Improvement
Bräunl 2004 16
var flag: array [1..2] of BOOLEAN;Initialization: flag[1]:=false; flag[2]:=false;
P1 P2
loopflag[1]:=true;while flag[2] do (*nothing*) end;< critical section >
flag[1]:=false;<other instructions>
end
loopflag[2]:=true;while flag[1] do (*nothing *) end;<critical section>
flag[2]:=false;< other instructions >
end
Software Solution – 2nd Improvement
Bräunl 2004 17
If the two lines "flag[i]:=true" are moved before the while-loop, the error of improvement 1 will not occur, but now we can have a deadlock instead.
⇒ INCORRECT !!
Software Solution – 2nd Improvement
Bräunl 2004 18
var turn: 1..2;flag: array [1..2] of BOOLEAN;
Initialization: turn:=1; (* arbitratry *)flag[1]:=false; flag[2]:=false;
P1 P2
loopflag[1]:=true;turn:=2;while flag[2] and (turn=2) do(*nothing*) end;<critical section>
flag[1]:=false;<other instructions >
end
loopflag[2]:=true;turn:=1;while flag[1] and (turn=1) do(*nothing*) end;<critical section >
flag[2]:=false;<other instructions >
end
Software Solution – 3rd Improvement
Bräunl 2004 19
• Expandable to n processes• Also see Dekker’s algorithm• Disadvantage of software solution:
"Busy-wait", i.e. significant efficiency loss if each process does not have its own processor!
⇒ CORRECT !!
Software Solution – 3rd Improvement
Bräunl 2004 20
Test-and-Set Operation(standard in most processors)
function test_and_set (var target: BOOLEAN): BOOLEAN;begin_atomictest_and_set:=target;target := true;
end_atomic.
Implementation as atomic operation(indivisible, uninterruptible: "1 instruction cycle of CPU").
Hardware Solution
Bräunl 2004 21
New Solution using Test-and-Set operation:
var lock: BOOLEAN;Initialization: lock:=false;
Pi
loopwhile test_and_set(lock) do (*nothing*) end;<critical section>
lock:=false;<other instructions>
End;
• Removal of the busy-wait via Queues (see semaphores).
Hardware Solution
Bräunl 2004 22
Dijkstra, 1965, (similar to signal post for trains)
Application:
Pi…P(Sema);
<critical section>V(sema);…
3.3 Semaphore
Bräunl 2004 23
Implementation:P, V are atomic operations (indivisible).
type Semaphore = recordvalue: INTEGER;L: list of Proc_ID;
end;
Generic Semaphore → value: INTEGERBoolean Semaphore → value: BOOLEAN
Semaphore
Bräunl 2004 24
Initialization: L empty listvalue number of allowed P-ops without V-op
procedure P (var S: Semaphore);beginS.value := S.value-1;if S.value < 0 then
append(S.L, actproc); (* append this process to S.L *)block(actproc) (* and move to state "blocked" *)
endend P;procedure V (var S: Semaphore);var Pnew: ProcID;beginS.value := S.value+1;if S.value ≤ 0 then
getfirst(S.L, Pnew); (* remove process P from S.L *)ready(Pnew) (* and change state to "ready" *)
endend V;
Semaphore
Bräunl 2004 25
How do we achieve that P and V are atomic operations?
• Single-processor-system:• Disable all interrupts
• Multi-processor-system:• Software-solution: Busy-wait with short P- or V-operation as critical section.• Hardware-solution: Short busy-wait for Test-and-Set instruction before start of P- or V-operation.
Attention: Convoy-Phenomenon
Semaphore
Bräunl 2004 26
Using Boolean semaphores
Declaration and Initialization:var empty: semaphore [1];
full: semaphore [0];
process Producer; begin loop <create data> P (empty)
; <fill buffer> V (full); end ; end process Producer .
process Consumer; begin loop P (full); <clear buffer> V (empty)
; <process data> end ; end process Consumer .
Producer-Consumer Problem
Bräunl 2004 27
empty full
Data creation
Data write
Data consumption
Data read
Producer-Consumer ProblemCorresponding Petri-Net
Producer ConsumerBräunl 2004 28
var critical: semaphore[1];free : semaphore[n]; (* there are n buffer spaces *)used : semaphore[0];
process Producer;beginloop<Data creation>P(free);P(critical);<write to buffer>
V(critical);V(used);
end;end process Producer;
process Consumer;beginloopP(used);P(critical);<read buffer>V(critical);
V(free);<process data>
end;end process Consumer;
Bounded Buffer Problem
Bräunl 2004 29
Corresponding Petri-Net:
Producer Consumer
Data write critical
n free
0 used
Data consumption
Daten creation
Dataread 1 1 1
Bounded Buffer Problem
Bräunl 2004 30
var count : semaphore[1];r_w : semaphore[1];
(* One writer or many readers *)readcount: INTEGER;
Initialization: readcount:=0;
process Reader;beginloop
P(count);if readcount=0 then P(r_w) end;readcount := readcount + 1;
V(count);
<read buffer>
P(count);readcount := readcount - 1;if readcount=0 then V(r_w) end;
V(count);<process data>
end; (* loop *)end process Reader;
process Writer;begin
loop<data creation>P(r_w);
<write buffer>
V(r_w);end; (* loop *)
end process Writer;
Readers-Writers Problem
Bräunl 2004 31
Dataread
Data consumption
1
P(count)
V(count)
readcount count 0 1
1 r_w Data
creation
Data write
1
V(count)
P(count)
1
2
Readers-Writers Problem
Reader Writer
Bräunl 2004 32
Thread Example 1#include <pthread.h>#include <stdio.h>
#define repeats 3#define threadnum 5
void *slave(void *arg){ int id = (int) arg ;
for (int i = 0; i < repeats; i++) { printf("thread %d\n", id);sched_yield();
}return( 0 ) ;
}
int main(){ pthread_t thread;
for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");
}pthread_exit( 0 );
}
thread 0thread 1thread 2thread 4thread 3thread 0thread 2thread 4thread 1thread 3thread 0thread 2thread 4thread 1thread 3
Bräunl 2004 33
Thread Example 2#include <pthread.h>#include <stdio.h>
#define repeats 3#define threadnum 5
void *slave(void *arg){ int id = (int) arg ;
for (int i = 0; i < repeats; i++) { printf("thread %d\n", id);}return( 0 ) ;
}
int main(){ pthread_t thread;
for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");
}pthread_exit( 0 );
}
thread 0thread 0thread 0thread 2thread 2thread 2thread 1thread 1thread 1thread 3thread 3thread 4thread 3thread 4thread 4
Bräunl 2004 34
Thread Example 3#include <pthread.h>#include <stdio.h>
#define repeats 3#define threadnum 5
void *slave(void *arg){ int id = (int) arg ;
for (int i = 0; i < repeats; i++) { printf("%d-A\n", id);printf("%d-B\n", id);printf("%d-C\n", id);
}return( 0 ) ;
}
int main(){ pthread_t thread;
for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) printf("Error: thread creation failed\n");
}pthread_exit( 0 );
}
0-A0-B0-C0-A0-B2-A1-A0-C3-A2-B4-A1-B0-A3-B2-C4-B1-C0-B3-C2-A4-C1-A0-C3-A2-B4-A1-B3-B2-C4-B1-C3-C2-A4-C1-A3-A2-B4-A1-B3-B2-C4-B1-C3-C4-C
Bräunl 2004 35
Thread Example 4#include <pthread.h>#include <stdio.h>#define repeats 3#define threadnum 5pthread_mutex_t mutex;void *slave(void *arg){ int id = (int) arg ;for (int i = 0; i < repeats; i++) { pthread_mutex_lock(&mutex);printf("%d-A\n", id);printf("%d-B\n", id);printf("%d-C\n", id); pthread_mutex_unlock(&mutex);}return( 0 ) ;}int main(){ pthread_t thread;
if (pthread_mutex_init(&mutex, NULL)){ printf("Error: mutex failed\n"); exit(1); }for(int i = 0; i < threadnum ; i++) { if (pthread_create(&thread, NULL, slave, (void *) i)) { printf("Error: thread creation failed\n"); exit(1); }}pthread_exit( 0 );}
0-A0-B0-C0-A0-B0-C0-A0-B0-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-C2-A2-B2-C4-A4-B4-C1-A1-B1-C3-A3-B3-CBräunl 2004 36
By Hoare 1974, Brinch Hansen 1975
• Abstract data type• A monitor encapsulates the data area to be protected and the
corresponding data access mechanisms (Entries, Conditions).
Usage:P1 P2... ...Buffer:DataWrite(x) Buffer:DataRead(x)... ...
Monitor calls are mutually exclusive,they are automatically synchronized.
3.4 Monitor
Bräunl 2004 37
•
Stack
1
n
2
0
• • Pointer
Monitor
Application Example: Buffer management
Bräunl 2004 38
Monitor for Buffer Managementmonitor buffer;var Stack: array [1..n] of Dataset;
Pointer: 0..n;free, used: CONDITION;
entry WriteData (a: Dataset);begin
while Pointer=n do (* Stack full *)wait(free)
end;inc(Pointer);Stack[Pointer]:=a;if Pointer=1 then signal(used) end;
end WriteData;
entry ReadData (var a: Dataset);begin
while Pointer=0 do (* Stack empty *)wait(used)
end;a:=Stack[Pointer];dec(Pointer);if Pointer=n-1 then signal(free) end;
end ReadData;
Begin (* Monitor-Initialization *)Pointer:=0
end monitor Buffer;
Bräunl 2004 39
Monitor Conditions
• wait (Cond) The executing process blocks itself and waits until another process executes a signal-operation on the condition Cond
• signal (Cond)All processes waiting in the condition Cond are reactivated and will again apply for access to the monitor. (Another variant only releases one process: the next one in the queue)
• status (Cond)This function returns the number of processes waiting for entry into this condition
Bräunl 2004 40
Conditions are queues, similar to those of semaphores.
Monitor Implementation Steps:
1) var MSema: semaphore[1]; (* is required for every monitor *)
2) Rewriting of entries to procedures:
procedure xyz(...)begin
P(MSema);<instruction>
V(MSema);end xyz;
3) Rewrite of monitor initialization into a procedure and corresponding call from main program.
Monitor Implementation
Bräunl 2004 41
4) procedure wait(Cond: condition; MSema: semaphore);begin
append(Cond,actproc); (* insert ProcID in Condition-queue *)block(actproc); (* Dispatcher: insert ProcID in block-liste *)V(MSema); (* release monitor-semaphor *)assign; (* Dispatcher: load next ready process *)
end wait;
5) procedure status(Cond: condition): CARDINAL;begin
return length(Cond);end status;
Monitor Implementation
Bräunl 2004 42
6a) In this implementation only one process is released (wait in if-clause)
procedure signal(Cond, MSema);var NewProc: Proc_ID;beginif not empty(cond) thenGetFirst(Cond, NewProc);P-Op. für NewProc
endend signal; • • •
• • •
• • •
• • •
Cond
MSema
(Cond = NIL)
MSema
Monitor Implementation
Bräunl 2004 43
• • •
• • •
• • •
• • •
Cond
MSema
Cond
MSema
6b) In this implementation all waiting processes are released. (wait in while-loop)
procedure signal(var Cond:condition; var MSema:semaphore);begin (* Status-check not needed here *)append(MSema.L,Cond); (* append list *)MSema.Value := MSema.Value - status(Cond);Cond := nil;
end signal;
Monitor Implementation
Bräunl 2004 44
3.5 Message Passing
• For distributed systems (no shared memory)this is the only method of communication
• Also usable for systems with shared memory• Easy to use, but computation time expensive (overhead)
⇒ Implementation with implicit communication:Remote Procedure Call
Bräunl 2004 45
Operations: Send_A (Receiver, Message)Receive_A (var Sender; var Message)Send_R (Receiver, Message)Receive_R (var Sender; var Message)
Send/receive of jobs, or send/receive for repliesEach process can receive as many jobs without processing them, as fit into its buffer.
Process 2
A R Process n
A R
• • • Process 1
A R
Buffer area for messages
A = Tasks R = Replies
Message Passing Example
Bräunl 2004 46
PC PS
...Send_A(to_Server,Task);...Receive_R(from_Server,Result)
loopReceive_A(Client,Task);<process task>
Send_R(Client,Result);end;
Implementation• In parallel systems with shared memory: Monitor• In parallel systems without shared memory: Decentralized network management
with message protocols
Message Passing Example
Bräunl 2004 47
Schematic for systems with shared memory:(use of a global message pool)
Send_A
Receive_R
Receive_A
Send_R
Pool
Message Passing Example
Bräunl 2004 48
type PoolElem = recordfree: BOOLEAN;from: 1..number of Procs;
info: Message;end;
Queue = recordcontents: 1..max;next : pointer to Queueend;
monitor Messages;var Pool: array [1..max] of PoolElem; (* global message pool *)
pfree: CONDITION; (* queue, in case pool is fully filled *)Afull, Rfull: array [1..number of Procs] of CONDITION;
(* queue for each process for incoming messages *)queueA, queueR: array [1..number of Procs] of Queue;
(* local message pools for each process *)
Message Passing Example
Bräunl 2004 49
entry Send_A (to: 1..number of Procs; a: Message);var id: 1..max;begin
while not GetFreeElem(id) do wait(pfree) end;with pool[id] dofree := false;from := actproc;info := a;
end;append(queueA[to],id); (* insert place number in task queue *)signal(Afull[to]);
end Send_A;
entry Receive_A (var von: 1..number of Procs; var a: Message);var id: 1..max;begin
while empty(queueA[actproc]) do wait(Afull[actproc]) end;id := head(queueA[actproc]);von := pool[id].from;a := pool[id].info; (* pool[id] not yet freed *)
end Receive_A;
Message Passing Example
Bräunl 2004 50
entry Send_R (nach: 1..number of Procs; ergebnis: Message);var id: 1..max;begin
id := head(queueA[actproc]);tail(queueA[actproc]); (* remove first element (head) of Queue *)pool[id].from := actproc;pool[id].info := ergebnis;append(queueR[to],id]; (* insert place number in reply queue *)signal(Rfull[to])
end Send_R;entry Receive_R (var von: 1..number of Procs; var erg: Messages);var id: 1..max;begin
while empty(queueR[actproc]) do wait(Rfull[actproc]) end;id := head(queueR[actproc]);tail(queueR[actproc]); (* remove first element (head) of Queue *)with pool[id] dovon := from;erg := info;free := true; (* release of pool element *)
end;signal(pfree); (* a free pool element exists *)
end Receive_R;
Message Passing Example
Bräunl 2004 51
3.6 Problems with Asynchronous Parallelism
Time dependent errors• Not reproducible!• Cannot be found by systematic testing!
A. Inconsistent DataA set of data (or, a relationship between data) does not have the value it would have received if the operation had been processed sequentially.
B. BlockingsDeadlock, Livelock
C. InefficienciesLoad Balancing, Process Migration
Bräunl 2004 52
Before : Income[Miller] = 1000
P1 P2…x := Income[Miller];x := x+50;Income[Miller] := x;…
…y := Income[Miller];y := 1.1*y;Income[Miller] := y;…
After:Income[Miller] = ?
1155115011001050
Inconsistent Data
Problem A.1: Lost Update
Depending on sequence of memory accesses
Bräunl 2004 53
Before: account[A] = 2000account[B] = 1000 Sum(A,B) = 3000
P1 P2
…x := account[A];x := x-400;account[A] := x;
x := account[B];x := x+400;account[B] := x;…
…a := account[A];b := account[B];sum := a+b;…
Inconsistent Data
Problem A.2: Inconsistent Analysis
Bräunl 2004 54
After: account[A] = 1600,-account[B] = 1400,- Sum(A,B) = 3000,-
(Same as before!)
sum = ?
Inconsistent Data
340030002600
Depending on sequence of memory accesses
Bräunl 2004 55
Inconsistent Data
Problem A.3: Uncommitted Dependencies
In databases and transaction systems:Transactions are atomic operations commit / rollback
Bräunl 2004 56
Blockings
Sample Occurrences:
• All processes are blocked in semaphore or queue conditions (deadlock)• All processes are caught in mutual busy wait or poling statements (livelock)
A group of processes is waiting for the occurrence of a condition which can only be generated by that group itself (alternating dependency)
Bräunl 2004 57
P (TE); P (PR); <processing> V (PR); V (TE);
1 P
P (PR); P (TE); <processing > V (TE); V (PR);
2 P
Blockings
Two processes require terminal and printer resources for computing
Problem B.1: Deadlock
Bräunl 2004 58
Blockings
The following conditions are required for a deadlock to occur[Coffman, Elphick, Shoshani 71]:
1. Resources can only be used exclusive.
2. Processes possess resources while requesting new ones.(may be broken by demand that all required resources must be requested at the same time)
3. Resources cannot be forcibly removed from processes.(may be broken by forced removal of resources, e.g. resolving existing deadlocks)
4. There is a circular loop or processes such that each process possesses the resources requested by the next one in the chain.
Bräunl 2004 59
PE 1 PE 3 PE 2
2
8
5
1
7
4
3
9
6
– –1
7
4
PE 1 PE3 PE2
Some processesare blocked
Inefficiencies
Simple Scheduling Model:Static distribution of processes to processors(no redistribution during run time)
⇒ can potentially leads to large inefficiencies.
Problem C.1: Load Balancing
Initial states Loss of parallelismBräunl 2004 60
(by K. Hwang)
Dynamic allocation of processes to processors during run time(dynamic load balancing).Allocated processes may be re-distributed (process migration)depending on the load setting (threshold).
Methods:Receiver-Initiative: Processors with low load request more processes.
(useful at high system load)
Sender-Initiative: Processors with too much load want to offload processes.(useful at low system load)
Hybrid Method: System switches between Receiver- and Sender-Initiatives depending on global system load.
Extended Scheduling Model
Bräunl 2004 61
Advantages and Disadvantages+ Better processor utilization, no loss of possible parallelism– General management overheads– Methods kick in too late, namely when the load distribution has already been
seriously out of balance– "process migration" is an expensive operation and only makes sense for longer
running processeso circular "process migration" must be prevented via appropriate parallel algorithms
and thresholds– Under full parallel load, load-balancing is useless!
Extended Scheduling Model
Bräunl 2004 62
3.7 MIMD Programming Languages• Pascal derivatives
– Concurrent Pascal (Brinch-Hansen, 1977)– Ada (US Dept. of Defense, 1975)– Modula-P (Bräunl, 1986)
• C/C++ plus parallel libraries– Sequent C (Sequent, 1988)– pthreads– PVM “parallel virtual machine” (Sunderam et al., 1990)– MPI“message passing interface” (based on CVS, MPI Forum, 1995)
• Special languages– CSP ”Communicating Sequential Processes” (Hoare, 1978)– Occam (based on CSP, inmos Ltd. 1984)– Linda (Carriero, Gelernter, 1986)
Bräunl 2004 63
Pthreads
• Extension to standard Unix fork() and join()• Previously many different implementations• Standard IEEE POSIX 1003.1c standard (1995)• Performance:
– Much faster than fork() (about 10x)– Much faster than MPI, PVM
• See: http://www.llnl.gov/computing/tutorials/pthreads/http://www.cs.nmsu.edu/~jcook/Tools/pthreads/library.html
Bräunl 2004 64
Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/
• In the UNIX environment a thread: – Exists within a process and uses the process resources – Has its own independent flow of control as long as its parent process exists and
the OS supports it – May share the process resources with other threads that act equally independently
(and dependently) – Dies if the parent process dies - or something similar
• To the software developer, the concept of a "procedure" that runs independently from its main program may best describe a thread.
• Because threads within the same process share resources: – Changes made by one thread to shared system resources (such as closing a file)
will be seen by all other threads. – Two pointers having the same value point to the same data. – Reading and writing to the same memory locations is possible, and therefore
requires explicit synchronization by the programmer.
Bräunl 2004 65
Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/
Sequentialaddress space
Bräunl 2004 66
Pthreads source: http://www.llnl.gov/computing/tutorials/pthreads/
Paralleladdress space
Bräunl 2004 67
Pthreads: Thread Functions
Create a new thread:int pthread_create (pthread_t *thread_id, const pthread_attr_t
*attributes,void *(*thread_function)(void *), void *arguments);
A thread terminates when the function returns, or explicitly by calling:int pthread_exit (void *status);
A thread can wait for the termination of another:int pthread_join (pthread_t thread, void **status_ptr);
A thread can read its own id:pthread_t pthread_self ();
It can be checked whether two threads are identical:int pthread_equal (pthread_t t1, pthread_t t2);
Bräunl 2004 68
Pthreads: Mutex FunctionsMutex data type: pthread_mutex_t
Init of Mutex (mututal exclusion = simple binary semaphore) :(use NULL pointer for default attributes)
int pthread_mutex_init (pthread_mutex_t *mut, const pthread_mutexattr_t *attr);
Lock Mutex (= P(sema) )int pthread_mutex_lock (pthread_mutex_t *mut);
Unlock Mutex (= V(sema) )int pthread_mutex_unlock (pthread_mutex_t *mut);
Nonblocking version of lock (either succeeds or returns EBUSY)int pthread_mutex_trylock (pthread_mutex_t *mut);
Deallocate MutexInt pthread_mutex_destroy (pthread_mutex_t *mut);
Bräunl 2004 69
Pthreads: Semaphore FunctionsSempahore data type: sem_t
Init and de-allocate of Semaphore : (use PTHREAD_PROCESS_PRIVATE as shared value)int sem_init (sem_t * sem, int pshared, unsigned int value);int sem_destroy (sem_t * sem);
P(sema) int sem_wait (sem_t * sem);int sem_timedwait (sem_t * sem, const struct timespec * abstime)
V(sema)int sem_post (sem_t * sem);int sem_post_multiple (sem_t * sem, int count);
Otherint sem_getvalue (sem_t * sem, int * sval);
Bräunl 2004 70
Pthreads: Monitor ConditionsInit Condition :int pthread_cond_init (pthread_cond_t *cond, pthread_condattr_t *attr);
Wait Version 1: Standard (note: always blocks)int pthread_cond_wait (pthread_cond_t *cond, pthread_mutex_t *mut);
Wait Version 2: With timeoutint pthread_cond_timedwait (pthread_cond_t *cond, pthread_mutex_t *mut,
const struct timespec *abstime);
Signal Version 1 (note: releases 1 waiting thread)int pthread_cond_signal (pthread_cond_t *cond);
Signal Version 2 (note: releases all waiting threads)int pthread_cond_broadcast (pthread_cond_t *cond);
Deallocate ConditionInt pthread_cond_destroy (pthread_cond_t *cond);
Bräunl 2004 71
Pthreads: Hello World Example#include <pthread.h>#include <stdio.h>#define NUM_THREADS 5
void *PrintHello(void *threadid){
printf("\n%d: Hello World!\n", (int) threadid);pthread_exit(NULL);return 0;
}
int main (int argc, char *argv[]){ pthread_t threads[NUM_THREADS];
int rc, t;for(t=0;t < NUM_THREADS;t++){
printf("Creating thread %d\n", t);rc = pthread_create(&threads[t],NULL,PrintHello,(void *)t);if (rc){
printf("ERROR; return code from create() is %d\n", rc);exit(-1);
}}pthread_exit(NULL);
}
Creating thread 0Creating thread 1
0: Hello World!Creating thread 2
1: Hello World!
2: Hello World!Creating thread 3Creating thread 4
3: Hello World!
4: Hello World!
Bräunl 2004 72
MPI (Message Passing Interface)
• based on CVS, MPI Forum (incl. hardware vendors), 1994/95• MPI is a standard for a library of functions and macros that
implements data sharing and synchronization between processes.• Designed to be practical, portable, efficient and flexible.• Public domain implementations: MPICH (www.mcs.anl.gov/mpi)
LAM-MPI(www.lam-mpi.org)• Available for almost any platform, links to C, C++, FORTRAN• MPE routines provide profiling and graphics output.• Debugging tools are implementation specific. Prepared to address
the lack of compatibility between software utilizing vendor specific message passing libraries.
Bräunl 2004 73
MPI (Message Passing Interface)
• Programming in well known language, with insertion of synchronisation, communication and process grouping functions (no process creation, however, like in PVM)
• Parallel processing for MIMD distributed memory (can be used on shared or hybrid systems). Unlike in PVM, code for all processes is usually contained in one executable.
• Different implementations provide different runtime environments. Some have daemons that watch over process execution (like PVM) – e.g. LAM-MPI and some do not, e.g. MPICH. Not all implementations are thread safe!
Bräunl 2004 74
MPI Programming• One source file for all processes. Distinguish processing responsibility via rank
(integer uniquely assigned to a process) by calling MPI_Comm_rank(). The number of processes requested by the user is obtained trough MPI_Comm_size().
• MPI programs must call MPI_Init() before using any MPI calls. All programs must end with MPI_Finalize().
• MPI data type definitions are provided for basic data types (e.g MPI_INT, MPI_DOUBLE). User data types can be defined. Packing is optional.
• Basic message passing– Blocking: MPI_Send(), MPI_Recv()– Non Blocking: MPI_Isend(), MPI_Irecv() followed by e.g.
MPI_Wait(), Waitall() or MPI_Test()• Collective communications MPI_Broadcast(), MPI_Barrier(), MPI_Reduce(),
MPI_Scatter(), MPI_Gather(), MPI_Alltoall()• MPI was designed to please many interest → many functions exist.
Only a subset is really required for most application programs.
Bräunl 2004 75
MPI_Send(vector,10,MPI_INT,dest,tag,MPI_COMM_WORLD)
MPI_Recv(vector,10,MPI_INT,src,tag,MPI_COMM_WORLD,status)
data buffer start size(no. els)
communicator(user defined process group)
data typesource
destination tag
MPI_Bcast(vector,10,MPI_INT,root_rank,MPI_COMM_WORLD)
info about rcvd msgrank of a broadcasting process
MPI_Barrier(communicator)
MPI Programming
Bräunl 2004 76
int MPI_Init(int *argc, char ***argv) int MPI_Finalize()
int MPI_Comm_rank ( MPI_Comm comm, int *rank ) int MPI_Comm_size ( MPI_Comm comm, int *size )
int MPI_Send( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm ) int MPI_Recv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status)
int MPI_Sendrecv( void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status *status )
int MPI_Isend( void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irecv( void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request)
int MPI_Test (MPI_Request *request,int *flag,MPI_Status *status) int MPI_Wait (MPI_Request *request,MPI_Status *status) int MPI_Waitall(int count, MPI_Request array_of_requests[], MPI_Status array_of_statuses[])
Initialisation/Cleanup Communicator Info
Blocking Message Passing
Non Blocking Message Passing
MPI Functions
Bräunl 2004 77
int MPI_Barrier (MPI_Comm comm)
int MPI_Bcast ( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )
int MPI_Reduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm)
int MPI_Allreduce ( void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op,MPI_Comm comm)
int MPI_Gather ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)
int MPI_Scatter ( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm)
int MPI_Alltoall( void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf,int recvcnt, MPI_Datatype recvtype, MPI_Comm comm)
Collective Communication
MPI Functions
Bräunl 2004 78
• Runtime environment is implementation specific (here: MPICH).• Machine-file containing a list of machines to be used should be created.• Processes are started with command mpirun. A machinefile can be
given as a parameter (if not a global default is used).• A number of processes to be started may also be specified (option -np):
mpirun -machinefile mfile -np 2 myprog
This starts 2 instances of myprog on machines specified in mfile
MPI Execution
Bräunl 2004 79
#include <stdio.h>#include <string.h>#include "mpi.h"int main(int argc, char **argv){ int myrank = -1; int l; char buf[100]; MPI_Status status;
MPI_Init(&argc, &argv);MPI_Comm_rank(MPI_COMM_WORLD, &myrank);if(myrank == 0){
MPI_Recv(buf,100,MPI_CHAR,1,0,MPI_COMM_WORLD,&status);printf("message from process %d : %s \n",status.MPI_SOURCE,buf);
}else if(myrank == 1){
strcpy(buf,"Hello world from: ");MPI_Get_processor_name(buf+strlen(buf),&l);MPI_Send(buf,100,MPI_CHAR,0,0,MPI_COMM_WORLD);
}MPI_Finalize();
}
MPI “Hello World” (adapted from Manchek’s PVM ex.)
Bräunl 2004 80
3.8 Coarse-Grained Parallel Algorithms
• Synchronization with Semaphores• Synchronization with Monitor• Pi calculation• Distributed simulation
Bräunl 2004 81
Synchronization with Semaphores (pthreads)#include <stdio.h>#include <pthread.h>#include <semaphore.h>#define BUF_SIZE 5 sem_t *full,*empty;sem_t critical,freex,used;int buf[BUF_SIZE];int pos,z;
void produce(int i){ sem_wait(&freex);
sem_wait(&critical);if(pos>=BUF_SIZE){ printf("Err\n");}buf[pos++]=i;printf("write Pos: %d %d\n",pos-1,i);
sem_post(&critical);sem_post(&used);
}
void consume(int* i){ sem_wait(&used);
sem_wait(&critical);if(pos < 0) { printf("Err\n");}*i=buf[--pos];printf("read Pos: %d %d \n",pos,*i);
sem_post(&critical);sem_post(&freex);
}
void *prod_thread(void *pparam){ int i=0;
while(1){ i=(i+1)%10;
produce(i);}
}void *cons_thread(void *pparam){ int i,quer;
quer=0;while(1){ consume(&i);
quer = (quer+i)%10;}
}int main(void){ pthread_t p,c;
int i;sem_init(&freex,PTHREAD_PROCESS_PRIVATE,BUF_SIZE);sem_init(&used,PTHREAD_PROCESS_PRIVATE,0);sem_init(&critical,PTHREAD_PROCESS_PRIVATE,1);
for(i=0;i<BUF_SIZE;i++) buf[i]=0;pos = 0;pthread_create(&p,NULL,prod_thread,0);pthread_create(&c,NULL,cons_thread,0);pthread_join(p,NULL);pthread_join(c,NULL);
} Bräunl 2004 82
write Pos: 1 1write Pos: 2 2read Pos: 2 2write Pos: 2 3read Pos: 2 3write Pos: 2 4read Pos: 2 4write Pos: 2 5read Pos: 2 5write Pos: 2 6read Pos: 2 6.....
Synchronization with SemaphoresSample Run
Bräunl 2004 83
#include <stdio.h>#include <pthread.h>#define BUF_SIZE 5 pthread_cond_t xused, xfree;pthread_mutex_t mon;int stack[BUF_SIZE];int pointer=0;
void buffer_write(int a){ pthread_mutex_lock(&mon);
if(pointer==BUF_SIZE)pthread_cond_wait(&xfree,&mon);
stack[pointer++] = a;printf("write %d %d \n", pointer-1,a);if(pointer==1)pthread_cond_signal(&xused);
pthread_mutex_unlock(&mon);}
void buffer_read(int* a){ pthread_mutex_lock(&mon);
if(pointer==0)pthread_cond_wait(&xused,&mon);
*a = stack[--pointer];printf("read %d %d\n",pointer,*a);if(pointer==BUF_SIZE-1)pthread_cond_signal(&xfree);
pthread_mutex_unlock(&mon);}
Synchronization with Monitor (pthreads)
Bräunl 2004 84
Synchronization with Monitor (pthreads)void *prod_thread(void *pparam){ int i=0;printf("Initi Producer \n");while(1){ i=(i+1)%10;
buffer_write(i);}
}
void *cons_thread(void *pparam){ int i,quer=0;printf("Init Consumer \n");while(1){ buffer_read(&i);
quer = (quer+i)%10;}
}
int main(void){ pthread_t p,c;int i;
printf("Init... \n");pthread_cond_init(&xfree,NULL);pthread_cond_init(&xused,NULL);pthread_mutex_init(&mon,NULL);
for(i=0;i<BUF_SIZE;i++) stack[i]=0;pointer = 0;
pthread_create(&p,NULL,prod_thread,0);pthread_create(&c,NULL,cons_thread,0);
pthread_join(p,NULL);pthread_join(c,NULL);
}
Bräunl 2004 85
write 1 1read 1 1write 1 2read 1 2write 1 3write 2 4read 2 4write 2 5read 2 5write 2 6read 2 6write 2 7read 2 7write 2 8read 2 8.....
Synchronization with MonitorSample Run
Bräunl 2004 86
4 2 0
0 0,5 1
Pi Calculation
π = ∫ +
1
021
4 dxx ∑
= −+
ervals
iwidth
widthi
int
12 *
)*)5.0((14
=
Bräunl 2004 87
#include <stdio.h>#include <pthread.h>#define MAX_THREADS 10#define INTERVALS 1000#define WIDTH 1.0/(double)(INTERVALS)pthread_mutex_t result_mtx;pthread_mutex_t interval_mtx;
double f(double x ){ return 4.0/(1+x*x); }
void assignment_put_result(double res){ static double sum = 0;
static int answers = 0;
pthread_mutex_lock(&result_mtx);sum+=res;answers++;pthread_mutex_unlock(&result_mtx);if(answers==INTERVALS)printf("Result is %f\n",sum);
}
Pi Calculation (pthreads)4 2 0
0 0,5 1
void assignment_get_interval(int *iv) { static int pos = 0;
pthread_mutex_lock(&interval_mtx);if(++pos<=INTERVALS) *iv=pos;else *iv=-1;pthread_mutex_unlock(&interval_mtx);
}
void *worker(void *pparam){ int iv;
double res;
assignment_get_interval(&iv);while(iv>0){ res = WIDTH*f( ( (double)(iv)-0.5)*WIDTH)assignment_put_result(res);assignment_get_interval(&iv);
}return 0;
}Bräunl 2004 88
int main(void){ pthread_t thread[MAX_THREADS];int i;
pthread_mutex_init(&interval_mtx,NULL);pthread_mutex_init(&result_mtx,NULL);for(i=0;i<MAX_THREADS;i++)pthread_create(&thread[i],NULL,worker,(void*) i);
printf("Main: threads created\n");
for(i=0;i<MAX_THREADS;i++)pthread_join(thread[i],NULL);
pthread_mutex_destroy(&interval_mtx);pthread_mutex_destroy(&result_mtx);
}
Pi Calculation (pthreads)4 2 0
0 0,5 1
Bräunl 2004 89
Start a number of worker threads.
Each thread should get his line number from a monitor,then work locally in his area and store back results.
Distributed Simulation
Model:• 2-dim. field of elements (Persons)• During each time step each person assumes the opinion of a random neighbor
Problem decomposition and assignment to processes:
Bräunl 2004 90
#include <stdio.h>#include <pthread.h>#include <stdlib.h>#include <time.h>#define MAX_THREADS 5#define LINES 40#define COLS 60#define GENERATIONS 10000
pthread_mutex_t crmtx;int arr[LINES][COLS];
void monitor_get_linenumber(int* j)/* returns current line index */{ static int current_row = 0;
pthread_mutex_lock(&crmtx);*j = current_row; current_row = (current_row+1)%LINES;
pthread_mutex_unlock(&crmtx);}
Distributed Simulation (pthreads)void monitor_read_line(int j,int* line)/* returns specified line */{ int i;
pthread_mutex_lock(&crmtx);for(i=0;i<COLS;i++)line[i]=arr[j][i];
pthread_mutex_unlock(&crmtx);}
void monitor_put_line(int j,int* line)/* write back one line */{ int i;
pthread_mutex_lock(&crmtx);for(i=0;i<COLS;i++)arr[j][i]=line[i];
pthread_mutex_unlock(&crmtx);}
void print_array(int cnt){ …}
Bräunl 2004 91
void *worker(void *pparam){ int k,j,pos,cnt;
int line[COLS],above[COLS],below[COLS],newl[COLS];
for(cnt=0;cnt<=GENERATIONS;cnt++){ if (pparam == 0 && cnt%1000 == 0) print_array(cnt);
monitor_get_linenumber(&j);monitor_read_line(j,line);if (j>0) monitor_read_line(j-1, above);
else monitor_read_line(LINES-1, above); if (j<LINES-1) monitor_read_line(j+1, below);
else monitor_read_line(0, below);;for(k=0; k<COLS; k++){ pos = rand()%8;
switch(pos){ case 0: newl[k] = above[(k-1+COLS)%COLS]; break;
case 1: newl[k] = above[k]; break;case 2: newl[k] = above[(k+1)%COLS]; break;case 3: newl[k] = line [(k-1+COLS)%COLS]; break;case 4: newl[k] = line [(k+1)%COLS]; break;case 5: newl[k] = below[(k-1+COLS)%COLS]; break;case 6: newl[k] = below[k]; break;case 7: newl[k] = below[(k+1)%COLS]; break;
}}monitor_put_line(j,newl);
}return 0;
}
Distributed Simulation (pthreads)int main(void){ pthread_t thread[MAX_THREADS];
int i,j;
srand(time(NULL));pthread_mutex_init(&crmtx,NULL);
/* Initialise array */for(j=0;j<LINES;j++)for(i=0;i<COLS;i++)arr[j][i] = rand()%2;
/* Start threads */for(i=0;i<MAX_THREADS;i++)
pthread_create(&thread[i],NULL,worker,(void *) i);
/* Make sure all threads are finished */for(i=0;i<MAX_THREADS;i++)
pthread_join(thread[i],NULL);return 0;
}Bräunl 2004 92
Result of simulation: Local clustering appears.
Random start value after 10 steps after 100 steps
Distributed Simulation