Hands-On - CCAD-UNC

WHPC’13, Cordoba | May 8th, 2013

Nicolas Pinto [email protected]

GPU Computing with PyCUDAHands-On

mailto:[email protected]

mailto:[email protected]

Still doing HPC the old way?

tew

GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive

Scripting: Python

One example of a scripting language: Python

Mature

Large and active community

Emphasizes readability

Written in widely-portable C

A ‘multi-paradigm’ language

Rich ecosystem of sci-comp relatedsoftware

Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python

GPU Scripting !

The future is now: always use the right tool...

Goals

• Get your feet wet

• Understand PyCUDA’s different levels of abstraction

• Taste the power of Python and its batteries

• DIY Flavor

Interactive Session

$ salloc -n1 --gres=gpu:1 srun --pty --preserve-env $SHELL

$ module load cuda/5.0$ module load interpreters/python-2.7.4$ module load compilers/gcc/4.6.4

$ ipython

Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]:

Interactive Session (*)

$ srun -p interactive -n1 -E --pty $SHELL$ export CUDA_VISIBLE_DEVICES=`expr $SLURM_JOBID % 2`


$ ipython

Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]:

IPython

Enhanced interactive shell for Python

• additional syntax (e.g. %timeit %logstart)

• introspection

• tab completion

• rich history

• better debugging

• parallel computing (!) http://goo.gl/M7ftK

Interactive Session

$ srun -p interactive -n1 -E --pty $SHELL$ export CUDA_VISIBLE_DEVICES=$(expr $SLURM_JOBID % 2)


$ ipython

Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]: import pycuda.autoinit

Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA

gpuarray: Simple Linear Algebra

pycuda.gpuarray:Meant to look and feel just like numpy.

gpuarray.to gpu(numpy array)

numpy array = gpuarray.get()

No: nd indexing, slicing, etc. (yet!)

Yes: +, -, �, /, fill, sin, exp, rand, take, . . .

Random numbers using pycuda.curandom

Mixed types (int32 + float32 = float64)

print gpuarray for debugging.

Memory behind gpuarray available as .gpudataattribute.

Use as kernel arguments, textures, etc.

Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)

Exercise: timing gpuarray$ ipython

import pycuda.autoinitfrom pycuda import gpuarrayfrom pycuda.curandom import rand as curandimport numpy as np

n = 10a_cpu = np.random.randn(n).astype(‘float32’)b_cpu = np.random.randn(n).astype(‘float32’)a_gpu = curand(n)b_gpu = curand(n)

a_gpu * b_gpu%timeit a_gpu * b_gpu# ...

Exercise: timing gpuarray

• “a_cpu * b_cpu” faster than “a_gpu * b_gpu” ?

• Why ?

• How ?

DIY: gpuarray

• Exercise:

Accelerate tanh(x)using pycuda.cumath

from pycuda import cumath

cumath??

http://documen.tician.de/pycuda/array.html

http://documen.tician.de/pycuda/array.html#module-pycuda.cumath

http://documen.tician.de/pycuda/array.html#module-pycuda.elementwise




DIY: ElementwiseKernel



• gpuarray vs. ElementwiseKernel ?

• Exercise:

Accelerate (a*x+b*y)/(w*z)





DIY: ReductionKernel


http://documen.tician.de/pycuda/array.html#module-pycuda.reduction

• gpuarray.sum() vs ReductionKernel?





Hack: show-me-the-code

1h001h30

From CUDA/C to PyCUDA$ cp -a /home/npinto/public/simple.cu .$ nvcc -run ./simple.cu

low-level (driver api)

CUDA C-to-Py:using SourceModule

1. Interactive session using IPython

2. Two errors?

3. Solution:$ cp /home/npinto/public/simple-pycuda1.py .

CUDA C-to-Py:using ElementwiseKernel



CUDA C-to-Py:using gpuarray



CUDA C-to-Py:comparison

$ cat ./simple-pycuda1.py



pycuda-simple1.py

pycuda-simple2.py

pycuda-simple3.py

DIY: PyCUDA Examples

$ cp -a /home/npinto/public/pycuda-examples .

2h002h30

Exercise: matmul (simple)

float b = N[k * width + j];

}}

{

float sum = 0;

float a = M[i * width + k];

}

P[i * Width + j] = sum;

}

}

void MatrixMultiplication(float* M, float* N, float* P, int Width)

for (int i = 0; i < Width; ++i)

for (int j = 0; j < Width; ++j) {

for (int k = 0; k < Width; ++k) {

sum += a * b;

N

k

j

M P

WID

TH

WID

TH

i

k

WIDTH WIDTH

k

FIGURE 3.4

A simple matrix multiplication function with only host code.

M2,0

M1,1

M1,0M0,0

M0,1

M3,0

M2,1 M3,1

M1,2M0,2 M2,2 M3,2

M1,3M0,3 M2,3 M3,3M

M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2 M1,3M0,3 M2,3 M3,3

FIGURE 3.5

Placement of two-dimensional array elements into the linear address systemmemory.

44 CHAPTER 3 Introduction to CUDA$ cp -a /home/npinto/public/matrixmul_simple.py .

Exercise: matmul (simple)

$ cp -a /home/npinto/public/matrixmul_simple-sol.py .

Exercise: matmul (tiled)

$ cp -a /home/npinto/public/matrixmul_tiled.py .

x index of its Pd element as (bx*TILE_WIDTH ! tx) and the y index as(by*TILE_WIDTH ! ty). That is, thread (tx, ty) in block (bx, by) is to userow (by*TILE_WIDTH ! ty) of Md and column (bx*TILE_WIDTH ! tx) ofNd to calculate the Pd element at column (bx*TILE_WIDTH ! tx) and row(by*TILE_WIDTH ! ty).

Figure 4.4 shows a small example of using multiple blocks to calculatePd. For simplicity, we use a very small TILE_WIDTH value (2) so we can fitthe entire example in one picture. The Pd matrix is now divided into 4 tiles.Each dimension of Pd is now divided into sections of 2 elements. Eachblock needs to calculate 4 Pd elements. We can do so by creating blocksthat are organized into 2"2 arrays of threads, with each thread calculating

bx

tx

0 1 2

Nd

0 1 TILE_WIDTH-12

WID

TH

Md Pd

0

0

TILE_WIDTH

by ty 21

TILE_WIDTH-1

1

TIL

E_W

IDT

HE

WID

TH

3

WIDTHWIDTH2

Pdsub

FIGURE 4.3

Matrix multiplication using multiple blocks by tiling Pd.

654.2 Using blockIdx and threadIdx

one Pd element. In the example, thread (0, 0) of block (0, 0) calculatesPd0,0, whereas thread (0, 0) of block (1, 0) calculates Pd2,0. It is easy toverify that one can identify the Pd element calculated by thread (0, 0)of block (1, 0) with the formula given above: Pd[bx* TILE_WIDTH ! tx]

[by* TILE_WIDTH ! ty] " Pd[1*2 ! 0][0*2 ! 0] " Pd[2][0]. The readershould work through the index derivation for as many threads as it takes tobecome comfortable with the concept.

Once we have identified the indices for the Pd element to be calculated bya thread, we also have identified the row (y) index ofMd and the column (x)index of Nd for input values. As shown in Figure 4.3, the row index of Mdused by thread (tx, ty) of block (bx, by) is (by*TILE_WIDTH ! ty). The col-umn index of Nd used by the same thread is (bx*TILE_WIDTH ! tx). We arenow ready to revise the kernel of Figure 3.11 into a version that uses multipleblocks to calculate Pd.

Figure 4.5 illustrates the multiplication actions in each thread block. Forthe small matrix multiplication, threads in block (0, 0) produce four dotproducts: Thread (0, 0) generates Pd0,0 by calculating the dot product ofrow 0 of Md and column 0 of Nd. Thread (1, 0) generates Pd1,0 by calcu-lating the dot product of row 0 of Md and column 1 of Nd. The arrows ofPd0,0, Pd1,0, Pd0,1, and Pd1,1 shows the row and column used for generat-ing their result value.

Figure 4.6 shows a revised matrix multiplication kernel function thatuses multiple blocks. In Figure 4.6, each thread uses its blockIdx andthreadIdx values to identify the row index (Row) and the column index

Block(0,0)

Block(0,1) Block(1,1)

Block(1,0)

Pd0,0

Pd0,1

Pd0,2

Pd0,3

Pd1,0

Pd1,1

Pd1,2

Pd1,3

Pd2,0

Pd2,1

Pd2,2

Pd2,3

Pd3,0

Pd3,1

Pd3,2

Pd3,3

TILE_WIDTH = 2

FIGURE 4.4

A simplified example of using multiple blocks to calculate Pd.

66 CHAPTER 4 CUDA Threads

Exercise: matmul (tiled)

$ cp -a /home/npinto/public/matrixmul_tiled.py .

Basic GPU Meta-programming System

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Exercise: matmul (meta)

http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah

$ cp -a /home/npinto/public/pycuda-examples/wiki-examples/DemoMetaMatrixmulCheetah.py .

$ cp -a /home/npinto/public/pycuda-examples/wiki-examples/demo_meta_matrixmul_cheetah.template.cu .



Smart GPU Auto Tuning ?

GPU Meta-Programming: A Case Study

in Biologically-Inspired Machine Vision

[GPU Computing Gems]

Pinto N, Cox DD

Intelligent

with Machine Learning

and fast

Auto-Tuning

Stay tuned...

Back pocket slides

Hands-On - CCAD-UNC

Documents

Transcript of Hands-On - CCAD-UNC