A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing
High Performance Computing
-
Upload
khangminh22 -
Category
Documents
-
view
0 -
download
0
Transcript of High Performance Computing
High Performance Computing
Course #: CSI 440/540
High Perf Sci Comp I
Fall ‘09
Mark R. GilderEmail: [email protected]
CSI 440/540This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.
Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).
Course Goals
Understanding of the latest trends in HPC architecture evolution,
Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,
Familiarity with various program transformations in order to improve performance,
Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using Pthreads, OpenMPand MPI.
Experience in evaluating performance of parallel programs.
2CSI 440/540 – SUNY Albany Fall '09Mark R. Gilder
Final Projects
Monte Carlo Simulations
Image Ray Tracing
Image Processing Pipeline
Image Processing Filters
Genetic Algorithms
Pattern Matching (Bioinformatics)
N-Body Particle Simulator
CFD – Navier-Stokes Eqs.
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3
BLAS
Lattice Gas Cellular Automata
3D FFT
Shortest Path Algorithms
NP-Hard Problems – 3-SAT, Traveling Salesman, etc.
Flocking / Emergent Behavior
Lecture 9
CSI 440/540 – SUNY Albany Fall '09
Outline:◦ MPI
The following notes are based on: Introduction To Parallel Computing 2nd Edition
by Grama, Gupta, Karypis, Kumar, 2003, Pearson Education Limited.
Message Passing Interface (MPI) Tutorial: http://www.llnl.gov/computing/tutorials/mpi/
4Mark R. Gilder
Send and Receive Example
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6
#include "mpi.h"
#include <stdio.h>
int main(argc,argv)
int argc;
char *argv[]; {
int numtasks, rank, dest, source, rc, count, tag=1;
char inmsg, outmsg='x';
MPI_Status Stat;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD, &numtasks);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
dest = 1; source = 1;
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
}
else if (rank == 1) {
dest = 0; source = 0;
rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);
rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);
}
rc = MPI_Get_count(&Stat, MPI_CHAR, &count);
printf("Task %d: Received %d char(s) from task %d with tag %d \n",
rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);
MPI_Finalize();
}
Task-0 sends a single character to Task-1 and waits for it to be sent back.
Avoiding DeadlocksConsider:
int a[10], b[10], myrank;
MPI_Status status;
...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
}
else if (myrank == 1) {
MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);
}
...
If MPI_Send is blocking (non-buffered), Then there is a deadlock
Mark R. Gilder 7CSI 440/540 – SUNY Albany Fall '09
int MPI_Send(void *buf, int count, MPI_Datatype datatype,
int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype datatype,
int source, int tag, MPI_Comm comm,
MPI_Status *status)
Avoiding DeadlocksConsider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes).
int a[10], b[10], nprocs, myrank;
MPI_Status status;
...
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Send(a, 10, MPI_INT, (myrank+1)%nprocs, 1,
MPI_COMM_WORLD);
MPI_Recv(b, 10, MPI_INT, (myrank-1+nprocs)%nprocs, 1,
MPI_COMM_WORLD, &status);
...
Once again, we have a deadlock if MPI_Send is blocking.
Mark R. Gilder 8CSI 440/540 – SUNY Albany Fall '09
Avoiding DeadlocksWe can break the circular wait to avoid deadlocks as follows:
int a[10], b[10], nprocs, myrank;
MPI_Status status;
...
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank%2 == 1) {
MPI_Send(a, 10, MPI_INT,(myrank+1)%nprocs, 1,
MPI_COMM_WORLD);
MPI_Recv(b, 10, MPI_INT,(myrank-1+nprocs)%nprocs, 1,
MPI_COMM_WORLD, &status);
}
else {
MPI_Recv(b, 10, MPI_INT,(myrank-1+nprocs)%nprocs, 1,
MPI_COMM_WORLD);
MPI_Send(a, 10, MPI_INT, (myrank+1)%nprocs, 1,
MPI_COMM_WORLD, &status);
}
...
Mark R. Gilder 9CSI 440/540 – SUNY Albany Fall '09
Sending and Receiving Messages SimultaneouslyTo exchange messages, MPI provides the following function:
int MPI_Sendrecv(void *sendbuf, int sendcount,
MPI_Datatype senddatatype, int dest, int
sendtag, void *recvbuf, int recvcount,
MPI_Datatype recvdatatype, int source, int recvtag,
MPI_Comm comm, MPI_Status *status)
The arguments include arguments to the send and receivefunctions. If we wish to use the same buffer for both send andreceive, we can use:
int MPI_Sendrecv_replace(void *buf, int count,
MPI_Datatype datatype, int dest, int sendtag,
int source, int recvtag, MPI_Comm comm,
MPI_Status *status)
Mark R. Gilder 10CSI 440/540 – SUNY Albany Fall '09
Collective Communication and Computation Operations
MPI provides an extensive set of functions for performing common collective communication operations.
Each of these operations is defined over a group corresponding to the communicator.
All processors in a communicator must call these operations.
Mark R. Gilder 11CSI 440/540 – SUNY Albany Fall '09
Collective Communication Operations
Barrier One-to-All Broadcast Reduce All Reduce Scan Scatter Gather All Gather All-to-All Reduce Scatter
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 12
Barrier Synchronization
Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barriercall.
The group is defined by the tasks in the communicator (MPI_Comm).
The barrier synchronization operation is performed in MPI using:
Mark R. Gilder 13CSI 440/540 – SUNY Albany Fall '09
int MPI_Barrier(MPI_Comm comm)
One-to-All Broadcast
The “One-to-All” Broadcast sends a message from the process with the specified rank to all other processes in the communicator (group).
The one-to-all broadcast operation is:
Mark R. Gilder 14CSI 440/540 – SUNY Albany Fall '09
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,
int source, MPI_Comm comm)
All-to-One Reduction
Applies a reduction operation on all tasks in the group and places the result in one task.
The all-to-one reduction operation is:
Mark R. Gilder 16CSI 440/540 – SUNY Albany Fall '09
int MPI_Reduce(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op,
int target, MPI_Comm comm)
Predefined Reduction Operations Operation Meaning Datatypes
MPI_MAX Maximum C integers and floating point
MPI_MIN Minimum C integers and floating point
MPI_SUM Sum C integers and floating point
MPI_PROD Product C integers and floating point
MPI_LAND Logical AND C integers
MPI_BAND Bit-wise AND C integers and byte
MPI_LOR Logical OR C integers
MPI_BOR Bit-wise OR C integers and byte
MPI_LXOR Logical XOR C integers
MPI_BXOR Bit-wise XOR C integers and byte
MPI_MAXLOC max-min value-location Data-pairs
MPI_MINLOC min-min value-location Data-pairs
Mark R. Gilder 17CSI 440/540 – SUNY Albany Fall '09
All-to-One ReductionMPI_MAXLOC & MPI_MINLOC Example
The operation MPI_MAXLOC combines pairs of values (vi, pi) and returns the pair (v, p) such that v is the maximum among all vi 's and p is the corresponding pi (if there are more than one, it is the smallest among all these pi 's).
MPI_MINLOC does the same, except for minimum value of
vi.
Mark R. Gilder 19CSI 440/540 – SUNY Albany Fall '09
All-to-One ReductionMPI_MAXLOC & MPI_MINLOC
MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction operations.
MPI Datatype C Datatype
MPI_2INT pair of ints
MPI_SHORT_INT short and int
MPI_LONG_INT long and int
MPI_LONG_DOUBLE_INT long double and int
MPI_FLOAT_INT float and int
MPI_DOUBLE_INT double and int
Mark R. Gilder 20CSI 440/540 – SUNY Albany Fall '09
Another Example – MPI_MINLOC
Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21
#define LEN 1000
float val[LEN]; /* local array of values */
int count; /* local number of values */
int myrank, minrank, minindex;
float minval;
struct {
float value;
int index;
} in, out;
/* local minloc */
in.value = val[0];
in.index = 0;
for (i=1; i < count; i++)
if (in.value > val[i]) {
in.value = val[i];
in.index = i;
}
/* global minloc */
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
in.index = myrank*LEN + in.index;
MPI_Reduce( in, out, 1, MPI_FLOAT_INT, MPI_MINLOC, root, comm );
/* At this point, the answer resides on process root
*/
if (myrank == root) {
/* read answer out
*/
minval = out.value;
minrank = out.index / LEN;
minindex = out.index % LEN;
}
Each process has a non-empty array of values. Find the minimum global value, the rank of the process that holds it, and its index on this process.
All-to-All Reduction
Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.
The all-to-all reduction operation is:
Mark R. Gilder 22CSI 440/540 – SUNY Albany Fall '09
int MPI_Allreduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
Scan
Performs a scan operation with respect to a reduction operation across a task group.
Basically, performs partial reductions up to the process id / rank.
Mark R. Gilder 24CSI 440/540 – SUNY Albany Fall '09
int MPI_Scan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm)
Scatter
The Scatter operation distributes distinct messages from a single source task to each task in the group.
The corresponding scatter operation is:
Mark R. Gilder 26CSI 440/540 – SUNY Albany Fall '09
int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm)
Equivalent to having the root sending messages with MPI_Send(sendbuf, sendcount n, sendtype, ...). This message is split into n equal segments, the ith segment is sent to the ith process in the group
The send buffer is ignored for all non-root processes.
Gather Gathers distinct messages from each task in the group
to a single destination task. This routine is the reverse operation of MPI_Scatter.
The corresponding gather operation is:
Mark R. Gilder 28CSI 440/540 – SUNY Albany Fall '09
int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target, MPI_Comm comm)
All-Gather
Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group.
The corresponding All-Gather operation is:
Mark R. Gilder 30CSI 440/540 – SUNY Albany Fall '09
int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm)
All-to-All
Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index.
The corresponding All-to-All operation is:
Mark R. Gilder 32CSI 440/540 – SUNY Albany Fall '09
int MPI_Alltoall(void *sendbuf, int sendcount,
MPI_Datatype senddatatype, void *recvbuf,
int recvcount, MPI_Datatype recvdatatype,
MPI_Comm comm)
Reduce-Scatter
First perform an element-wise reduction on a vector across all tasks in the group. Next, the result vector is split into disjoint segments and distributed across the tasks. This is equivalent to an MPI_Reduce followed by an MPI_Scatter operation.
The corresponding Reduce-Scatter operation is:
Mark R. Gilder 34CSI 440/540 – SUNY Albany Fall '09
int MPI_Reduce_scatter(void *sendbuf,
void *recvbuf, int recvcount,
MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm)