High Performance Computing

36
High Performance Computing Course #: CSI 440/540 High Perf Sci Comp I Fall ‘09 Mark R. Gilder Email: [email protected] [email protected]

Transcript of High Performance Computing

High Performance Computing

Course #: CSI 440/540

High Perf Sci Comp I

Fall ‘09

Mark R. GilderEmail: [email protected]

[email protected]

CSI 440/540This course investigates the latest trends in high-performance computing (HPC) evolution and examines key issues in developing algorithms capable of exploiting these architectures.

Grading: Your grade in the course will be based on completion of assignments (40%), course project (35%), class presentation(15%), class participation (10%).

Course Goals

Understanding of the latest trends in HPC architecture evolution,

Appreciation for the complexities in efficiently mapping algorithms onto HPC architectures,

Familiarity with various program transformations in order to improve performance,

Hands-on experience in design and implementation of algorithms for both shared & distributed memory parallel architectures using Pthreads, OpenMPand MPI.

Experience in evaluating performance of parallel programs.

2CSI 440/540 – SUNY Albany Fall '09Mark R. Gilder

Final Projects

Monte Carlo Simulations

Image Ray Tracing

Image Processing Pipeline

Image Processing Filters

Genetic Algorithms

Pattern Matching (Bioinformatics)

N-Body Particle Simulator

CFD – Navier-Stokes Eqs.

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 3

BLAS

Lattice Gas Cellular Automata

3D FFT

Shortest Path Algorithms

NP-Hard Problems – 3-SAT, Traveling Salesman, etc.

Flocking / Emergent Behavior

Lecture 9

CSI 440/540 – SUNY Albany Fall '09

Outline:◦ MPI

The following notes are based on: Introduction To Parallel Computing 2nd Edition

by Grama, Gupta, Karypis, Kumar, 2003, Pearson Education Limited.

Message Passing Interface (MPI) Tutorial: http://www.llnl.gov/computing/tutorials/mpi/

4Mark R. Gilder

Outline

CSI 440/540 – SUNY Albany Fall '09

◦ MPI

5Mark R. Gilder

Send and Receive Example

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 6

#include "mpi.h"

#include <stdio.h>

int main(argc,argv)

int argc;

char *argv[]; {

int numtasks, rank, dest, source, rc, count, tag=1;

char inmsg, outmsg='x';

MPI_Status Stat;

MPI_Init(&argc,&argv);

MPI_Comm_size(MPI_COMM_WORLD, &numtasks);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0) {

dest = 1; source = 1;

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

}

else if (rank == 1) {

dest = 0; source = 0;

rc = MPI_Recv(&inmsg, 1, MPI_CHAR, source, tag, MPI_COMM_WORLD, &Stat);

rc = MPI_Send(&outmsg, 1, MPI_CHAR, dest, tag, MPI_COMM_WORLD);

}

rc = MPI_Get_count(&Stat, MPI_CHAR, &count);

printf("Task %d: Received %d char(s) from task %d with tag %d \n",

rank, count, Stat.MPI_SOURCE, Stat.MPI_TAG);

MPI_Finalize();

}

Task-0 sends a single character to Task-1 and waits for it to be sent back.

Avoiding DeadlocksConsider:

int a[10], b[10], myrank;

MPI_Status status;

...

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank == 0) {

MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);

MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);

}

else if (myrank == 1) {

MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD, &status);

MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD, &status);

}

...

If MPI_Send is blocking (non-buffered), Then there is a deadlock

Mark R. Gilder 7CSI 440/540 – SUNY Albany Fall '09

int MPI_Send(void *buf, int count, MPI_Datatype datatype,

int dest, int tag, MPI_Comm comm)

int MPI_Recv(void *buf, int count, MPI_Datatype datatype,

int source, int tag, MPI_Comm comm,

MPI_Status *status)

Avoiding DeadlocksConsider the following piece of code, in which process i sends a message to process i + 1 (modulo the number of processes) and receives a message from process i - 1 (module the number of processes).

int a[10], b[10], nprocs, myrank;

MPI_Status status;

...

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

MPI_Send(a, 10, MPI_INT, (myrank+1)%nprocs, 1,

MPI_COMM_WORLD);

MPI_Recv(b, 10, MPI_INT, (myrank-1+nprocs)%nprocs, 1,

MPI_COMM_WORLD, &status);

...

Once again, we have a deadlock if MPI_Send is blocking.

Mark R. Gilder 8CSI 440/540 – SUNY Albany Fall '09

Avoiding DeadlocksWe can break the circular wait to avoid deadlocks as follows:

int a[10], b[10], nprocs, myrank;

MPI_Status status;

...

MPI_Comm_size(MPI_COMM_WORLD, &nprocs);

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

if (myrank%2 == 1) {

MPI_Send(a, 10, MPI_INT,(myrank+1)%nprocs, 1,

MPI_COMM_WORLD);

MPI_Recv(b, 10, MPI_INT,(myrank-1+nprocs)%nprocs, 1,

MPI_COMM_WORLD, &status);

}

else {

MPI_Recv(b, 10, MPI_INT,(myrank-1+nprocs)%nprocs, 1,

MPI_COMM_WORLD);

MPI_Send(a, 10, MPI_INT, (myrank+1)%nprocs, 1,

MPI_COMM_WORLD, &status);

}

...

Mark R. Gilder 9CSI 440/540 – SUNY Albany Fall '09

Sending and Receiving Messages SimultaneouslyTo exchange messages, MPI provides the following function:

int MPI_Sendrecv(void *sendbuf, int sendcount,

MPI_Datatype senddatatype, int dest, int

sendtag, void *recvbuf, int recvcount,

MPI_Datatype recvdatatype, int source, int recvtag,

MPI_Comm comm, MPI_Status *status)

The arguments include arguments to the send and receivefunctions. If we wish to use the same buffer for both send andreceive, we can use:

int MPI_Sendrecv_replace(void *buf, int count,

MPI_Datatype datatype, int dest, int sendtag,

int source, int recvtag, MPI_Comm comm,

MPI_Status *status)

Mark R. Gilder 10CSI 440/540 – SUNY Albany Fall '09

Collective Communication and Computation Operations

MPI provides an extensive set of functions for performing common collective communication operations.

Each of these operations is defined over a group corresponding to the communicator.

All processors in a communicator must call these operations.

Mark R. Gilder 11CSI 440/540 – SUNY Albany Fall '09

Collective Communication Operations

Barrier One-to-All Broadcast Reduce All Reduce Scan Scatter Gather All Gather All-to-All Reduce Scatter

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 12

Barrier Synchronization

Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barriercall.

The group is defined by the tasks in the communicator (MPI_Comm).

The barrier synchronization operation is performed in MPI using:

Mark R. Gilder 13CSI 440/540 – SUNY Albany Fall '09

int MPI_Barrier(MPI_Comm comm)

One-to-All Broadcast

The “One-to-All” Broadcast sends a message from the process with the specified rank to all other processes in the communicator (group).

The one-to-all broadcast operation is:

Mark R. Gilder 14CSI 440/540 – SUNY Albany Fall '09

int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,

int source, MPI_Comm comm)

One-to-All Broadcast Example

Mark R. Gilder 15CSI 440/540 – SUNY Albany Fall '09

All-to-One Reduction

Applies a reduction operation on all tasks in the group and places the result in one task.

The all-to-one reduction operation is:

Mark R. Gilder 16CSI 440/540 – SUNY Albany Fall '09

int MPI_Reduce(void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op,

int target, MPI_Comm comm)

Predefined Reduction Operations Operation Meaning Datatypes

MPI_MAX Maximum C integers and floating point

MPI_MIN Minimum C integers and floating point

MPI_SUM Sum C integers and floating point

MPI_PROD Product C integers and floating point

MPI_LAND Logical AND C integers

MPI_BAND Bit-wise AND C integers and byte

MPI_LOR Logical OR C integers

MPI_BOR Bit-wise OR C integers and byte

MPI_LXOR Logical XOR C integers

MPI_BXOR Bit-wise XOR C integers and byte

MPI_MAXLOC max-min value-location Data-pairs

MPI_MINLOC min-min value-location Data-pairs

Mark R. Gilder 17CSI 440/540 – SUNY Albany Fall '09

All-to-One Reduction Example

Mark R. Gilder 18CSI 440/540 – SUNY Albany Fall '09

All-to-One ReductionMPI_MAXLOC & MPI_MINLOC Example

The operation MPI_MAXLOC combines pairs of values (vi, pi) and returns the pair (v, p) such that v is the maximum among all vi 's and p is the corresponding pi (if there are more than one, it is the smallest among all these pi 's).

MPI_MINLOC does the same, except for minimum value of

vi.

Mark R. Gilder 19CSI 440/540 – SUNY Albany Fall '09

All-to-One ReductionMPI_MAXLOC & MPI_MINLOC

MPI datatypes for data-pairs used with the MPI_MAXLOC and MPI_MINLOC reduction operations.

MPI Datatype C Datatype

MPI_2INT pair of ints

MPI_SHORT_INT short and int

MPI_LONG_INT long and int

MPI_LONG_DOUBLE_INT long double and int

MPI_FLOAT_INT float and int

MPI_DOUBLE_INT double and int

Mark R. Gilder 20CSI 440/540 – SUNY Albany Fall '09

Another Example – MPI_MINLOC

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 21

#define LEN 1000

float val[LEN]; /* local array of values */

int count; /* local number of values */

int myrank, minrank, minindex;

float minval;

struct {

float value;

int index;

} in, out;

/* local minloc */

in.value = val[0];

in.index = 0;

for (i=1; i < count; i++)

if (in.value > val[i]) {

in.value = val[i];

in.index = i;

}

/* global minloc */

MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

in.index = myrank*LEN + in.index;

MPI_Reduce( in, out, 1, MPI_FLOAT_INT, MPI_MINLOC, root, comm );

/* At this point, the answer resides on process root

*/

if (myrank == root) {

/* read answer out

*/

minval = out.value;

minrank = out.index / LEN;

minindex = out.index % LEN;

}

Each process has a non-empty array of values. Find the minimum global value, the rank of the process that holds it, and its index on this process.

All-to-All Reduction

Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

The all-to-all reduction operation is:

Mark R. Gilder 22CSI 440/540 – SUNY Albany Fall '09

int MPI_Allreduce(void *sendbuf, void *recvbuf,

int count, MPI_Datatype datatype,

MPI_Op op, MPI_Comm comm)

All-to-All Reduction Example

Mark R. Gilder 23CSI 440/540 – SUNY Albany Fall '09

Scan

Performs a scan operation with respect to a reduction operation across a task group.

Basically, performs partial reductions up to the process id / rank.

Mark R. Gilder 24CSI 440/540 – SUNY Albany Fall '09

int MPI_Scan(void *sendbuf, void *recvbuf, int count,

MPI_Datatype datatype, MPI_Op op,

MPI_Comm comm)

Scan Example

Mark R. Gilder 25CSI 440/540 – SUNY Albany Fall '09

Scatter

The Scatter operation distributes distinct messages from a single source task to each task in the group.

The corresponding scatter operation is:

Mark R. Gilder 26CSI 440/540 – SUNY Albany Fall '09

int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm)

Equivalent to having the root sending messages with MPI_Send(sendbuf, sendcount n, sendtype, ...). This message is split into n equal segments, the ith segment is sent to the ith process in the group

The send buffer is ignored for all non-root processes.

Scatter Example

Mark R. Gilder 27CSI 440/540 – SUNY Albany Fall '09

Gather Gathers distinct messages from each task in the group

to a single destination task. This routine is the reverse operation of MPI_Scatter.

The corresponding gather operation is:

Mark R. Gilder 28CSI 440/540 – SUNY Albany Fall '09

int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target, MPI_Comm comm)

Gather Example

Mark R. Gilder 29CSI 440/540 – SUNY Albany Fall '09

All-Gather

Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group.

The corresponding All-Gather operation is:

Mark R. Gilder 30CSI 440/540 – SUNY Albany Fall '09

int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm)

All-Gather Example

Mark R. Gilder 31CSI 440/540 – SUNY Albany Fall '09

All-to-All

Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group in order by index.

The corresponding All-to-All operation is:

Mark R. Gilder 32CSI 440/540 – SUNY Albany Fall '09

int MPI_Alltoall(void *sendbuf, int sendcount,

MPI_Datatype senddatatype, void *recvbuf,

int recvcount, MPI_Datatype recvdatatype,

MPI_Comm comm)

All-to-All Example

Mark R. Gilder 33CSI 440/540 – SUNY Albany Fall '09

Reduce-Scatter

First perform an element-wise reduction on a vector across all tasks in the group. Next, the result vector is split into disjoint segments and distributed across the tasks. This is equivalent to an MPI_Reduce followed by an MPI_Scatter operation.

The corresponding Reduce-Scatter operation is:

Mark R. Gilder 34CSI 440/540 – SUNY Albany Fall '09

int MPI_Reduce_scatter(void *sendbuf,

void *recvbuf, int recvcount,

MPI_Datatype datatype, MPI_Op op,

MPI_Comm comm)

Reduce-Scatter Example

Mark R. Gilder 35CSI 440/540 – SUNY Albany Fall '09

Collective Communication Operations

Barrier

One-to-All Broadcast

Reduce

All Reduce

Scan

Mark R. Gilder CSI 440/540 – SUNY Albany Fall '09 36

Using this core set of collective operations, a number of programs can be greatly simplified.

Scatter

Gather

All Gather

All-to-All

Reduce Scatter