Hands-On - CCAD-UNC
-
Upload
khangminh22 -
Category
Documents
-
view
4 -
download
0
Transcript of Hands-On - CCAD-UNC
tew
GPU Scripting PyOpenCL News RTCG Showcase Overview Being Productive
Scripting: Python
One example of a scripting language: Python
Mature
Large and active community
Emphasizes readability
Written in widely-portable C
A ‘multi-paradigm’ language
Rich ecosystem of sci-comp relatedsoftware
Andreas Klockner PyCUDA: Even Simpler GPU Programming with Python
GPU Scripting !
The future is now: always use the right tool...
Goals
• Get your feet wet
• Understand PyCUDA’s different levels of abstraction
• Taste the power of Python and its batteries
• DIY Flavor
Interactive Session
$ salloc -n1 --gres=gpu:1 srun --pty --preserve-env $SHELL
$ module load cuda/5.0$ module load interpreters/python-2.7.4$ module load compilers/gcc/4.6.4
$ ipython
Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]:
Interactive Session (*)
$ srun -p interactive -n1 -E --pty $SHELL$ export CUDA_VISIBLE_DEVICES=`expr $SLURM_JOBID % 2`
$ module load cuda/5.0$ module load interpreters/python-2.7.4$ module load compilers/gcc/4.6.4
$ ipython
Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]:
IPython
Enhanced interactive shell for Python
• additional syntax (e.g. %timeit %logstart)
• introspection
• tab completion
• rich history
• better debugging
• parallel computing (!) http://goo.gl/M7ftK
Interactive Session
$ srun -p interactive -n1 -E --pty $SHELL$ export CUDA_VISIBLE_DEVICES=$(expr $SLURM_JOBID % 2)
$ module load cuda/5.0$ module load interpreters/python-2.7.4$ module load compilers/gcc/4.6.4
$ ipython
Python 2.7.4 (default, Apr 24 2013, 19:56:24)IPython 0.13.2 -- An enhanced Interactive Python.In [1]: import pycuda.autoinit
Intro GPUs Scripting Hands-on Intro Example Working with PyCuda A peek under the hood Metaprogramming CUDA
gpuarray: Simple Linear Algebra
pycuda.gpuarray:Meant to look and feel just like numpy.
gpuarray.to gpu(numpy array)
numpy array = gpuarray.get()
No: nd indexing, slicing, etc. (yet!)
Yes: +, -, �, /, fill, sin, exp, rand, take, . . .
Random numbers using pycuda.curandom
Mixed types (int32 + float32 = float64)
print gpuarray for debugging.
Memory behind gpuarray available as .gpudataattribute.
Use as kernel arguments, textures, etc.
Nicolas Pinto (MIT) and Andreas Klockner (Brown) PyCuda Tutorialslide by Andreas Klockner (NYU)
Exercise: timing gpuarray$ ipython
import pycuda.autoinitfrom pycuda import gpuarrayfrom pycuda.curandom import rand as curandimport numpy as np
n = 10a_cpu = np.random.randn(n).astype(‘float32’)b_cpu = np.random.randn(n).astype(‘float32’)a_gpu = curand(n)b_gpu = curand(n)
a_gpu * b_gpu%timeit a_gpu * b_gpu# ...
DIY: gpuarray
• Exercise:
Accelerate tanh(x)using pycuda.cumath
from pycuda import cumath
cumath??
http://documen.tician.de/pycuda/array.html
http://documen.tician.de/pycuda/array.html#module-pycuda.cumath
DIY: ElementwiseKernel
http://documen.tician.de/pycuda/array.html
http://documen.tician.de/pycuda/array.html#module-pycuda.elementwise
• gpuarray vs. ElementwiseKernel ?
• Exercise:
Accelerate (a*x+b*y)/(w*z)
DIY: ReductionKernel
http://documen.tician.de/pycuda/array.html
http://documen.tician.de/pycuda/array.html#module-pycuda.reduction
• gpuarray.sum() vs ReductionKernel?
CUDA C-to-Py:using SourceModule
1. Interactive session using IPython
2. Two errors?
3. Solution:$ cp /home/npinto/public/simple-pycuda1.py .
CUDA C-to-Py:using ElementwiseKernel
1. Interactive session using IPython
2. Solution:$ cp /home/npinto/public/simple-pycuda2.py .
CUDA C-to-Py:using gpuarray
1. Interactive session using IPython
2. Solution:$ cp /home/npinto/public/simple-pycuda3.py .
CUDA C-to-Py:comparison
$ cat ./simple-pycuda1.py
$ cat ./simple-pycuda2.py
$ cat ./simple-pycuda3.py
Exercise: matmul (simple)
float b = N[k * width + j];
}}
{
float sum = 0;
float a = M[i * width + k];
}
P[i * Width + j] = sum;
}
}
void MatrixMultiplication(float* M, float* N, float* P, int Width)
for (int i = 0; i < Width; ++i)
for (int j = 0; j < Width; ++j) {
for (int k = 0; k < Width; ++k) {
sum += a * b;
N
k
j
M P
WID
TH
WID
TH
i
k
WIDTH WIDTH
k
FIGURE 3.4
A simple matrix multiplication function with only host code.
M2,0
M1,1
M1,0M0,0
M0,1
M3,0
M2,1 M3,1
M1,2M0,2 M2,2 M3,2
M1,3M0,3 M2,3 M3,3M
M2,0M1,0M0,0 M3,0 M1,1M0,1 M2,1 M3,1 M1,2M0,2 M2,2 M3,2 M1,3M0,3 M2,3 M3,3
FIGURE 3.5
Placement of two-dimensional array elements into the linear address systemmemory.
44 CHAPTER 3 Introduction to CUDA$ cp -a /home/npinto/public/matrixmul_simple.py .
Exercise: matmul (tiled)
$ cp -a /home/npinto/public/matrixmul_tiled.py .
x index of its Pd element as (bx*TILE_WIDTH ! tx) and the y index as(by*TILE_WIDTH ! ty). That is, thread (tx, ty) in block (bx, by) is to userow (by*TILE_WIDTH ! ty) of Md and column (bx*TILE_WIDTH ! tx) ofNd to calculate the Pd element at column (bx*TILE_WIDTH ! tx) and row(by*TILE_WIDTH ! ty).
Figure 4.4 shows a small example of using multiple blocks to calculatePd. For simplicity, we use a very small TILE_WIDTH value (2) so we can fitthe entire example in one picture. The Pd matrix is now divided into 4 tiles.Each dimension of Pd is now divided into sections of 2 elements. Eachblock needs to calculate 4 Pd elements. We can do so by creating blocksthat are organized into 2"2 arrays of threads, with each thread calculating
bx
tx
0 1 2
Nd
0 1 TILE_WIDTH-12
WID
TH
Md Pd
0
0
TILE_WIDTH
by ty 21
TILE_WIDTH-1
1
TIL
E_W
IDT
HE
WID
TH
3
WIDTHWIDTH2
Pdsub
FIGURE 4.3
Matrix multiplication using multiple blocks by tiling Pd.
654.2 Using blockIdx and threadIdx
one Pd element. In the example, thread (0, 0) of block (0, 0) calculatesPd0,0, whereas thread (0, 0) of block (1, 0) calculates Pd2,0. It is easy toverify that one can identify the Pd element calculated by thread (0, 0)of block (1, 0) with the formula given above: Pd[bx* TILE_WIDTH ! tx]
[by* TILE_WIDTH ! ty] " Pd[1*2 ! 0][0*2 ! 0] " Pd[2][0]. The readershould work through the index derivation for as many threads as it takes tobecome comfortable with the concept.
Once we have identified the indices for the Pd element to be calculated bya thread, we also have identified the row (y) index ofMd and the column (x)index of Nd for input values. As shown in Figure 4.3, the row index of Mdused by thread (tx, ty) of block (bx, by) is (by*TILE_WIDTH ! ty). The col-umn index of Nd used by the same thread is (bx*TILE_WIDTH ! tx). We arenow ready to revise the kernel of Figure 3.11 into a version that uses multipleblocks to calculate Pd.
Figure 4.5 illustrates the multiplication actions in each thread block. Forthe small matrix multiplication, threads in block (0, 0) produce four dotproducts: Thread (0, 0) generates Pd0,0 by calculating the dot product ofrow 0 of Md and column 0 of Nd. Thread (1, 0) generates Pd1,0 by calcu-lating the dot product of row 0 of Md and column 1 of Nd. The arrows ofPd0,0, Pd1,0, Pd0,1, and Pd1,1 shows the row and column used for generat-ing their result value.
Figure 4.6 shows a revised matrix multiplication kernel function thatuses multiple blocks. In Figure 4.6, each thread uses its blockIdx andthreadIdx values to identify the row index (Row) and the column index
Block(0,0)
Block(0,1) Block(1,1)
Block(1,0)
Pd0,0
Pd0,1
Pd0,2
Pd0,3
Pd1,0
Pd1,1
Pd1,2
Pd1,3
Pd2,0
Pd2,1
Pd2,2
Pd2,3
Pd3,0
Pd3,1
Pd3,2
Pd3,3
TILE_WIDTH = 2
FIGURE 4.4
A simplified example of using multiple blocks to calculate Pd.
66 CHAPTER 4 CUDA Threads
Basic GPU Meta-programming System
GPU Meta-Programming: A Case Study
in Biologically-Inspired Machine Vision
[GPU Computing Gems]
Pinto N, Cox DD
Exercise: matmul (meta)
http://wiki.tiker.net/PyCuda/Examples/DemoMetaMatrixmulCheetah
$ cp -a /home/npinto/public/pycuda-examples/wiki-examples/DemoMetaMatrixmulCheetah.py .
$ cp -a /home/npinto/public/pycuda-examples/wiki-examples/demo_meta_matrixmul_cheetah.template.cu .
Smart GPU Auto Tuning ?
GPU Meta-Programming: A Case Study
in Biologically-Inspired Machine Vision
[GPU Computing Gems]
Pinto N, Cox DD