Glasgow Pascal Compiler for the Xeon Phi
Transcript of Glasgow Pascal Compiler for the Xeon Phi
Glasgow Pascal
● Implements Turner’s Vector Pascal.
● SIMD auto parellisation
● MIMD auto parallelisation
● We have implemented it on Intel
machine including the Xeon Phi, and
earlier on the Sony Cell
● We are using it in the Clopema cloth
manipulating robot project
Stereo Matcher
● We want an implementation that is hardware independent so
that we can upgrade the platform to a 60 core intel Xeon Phi
card
● That requires us to harness both SIMD and multi-core
parallelism
● We have used Vector Pascal a parallel array dialect of Pascal
for this since it auto-parallelises across SIMD and multi-cores
● It can be linked to ROS using a harness in Python ( that is
how our current matcher works with ROS)
listing from file matchlib.pas
+---A 'P' at the start of a line indicates the line has been SIMD parallelised
|+--An 'M' at the start of a line indicates the line has been multi-core parallelised
||
Vv
219 procedure computepolynomialinterpolation(var f_1,a,f1:plane; var x:image;layer:integer);
220 (*! this performs polynomial interpolation and updates all positions
221 in x[layer] where there is a maximum of the polynomial *)
223 var b,c:^plane;
225 BEGIN
...
246 PM b^:= (f1-f_1)/2;
247 PM c^:= f1 - (a+b^);
264 M x[layer]:= if c^<0 then (-b^ *0.5)/c^ else 0.0;
265 PM x[layer]:= (1.0 min(-1.0 max x[layer])) ;
267 x[confidenceplane]:= if c^<epsilon then
268 x[confidenceplane]*(a+b^*x[layer]+c^*x[layer]*x[layer])
269 M else 0.4;
...
272 END;
The parallel interpolation
function
We show this just as an example of the sort of parallelism we use
Variables are images represented as 2d arrays of reals
Operate on whole arrays at a time like MATLAB
Xeon Phi features
• 60 cores
• Each core is a basic
Pentium
• Additional SIMD features
• Whole chip runs Linux
• Talks to host using IP over
the PCI bus
• 6 GB memory on the board
• Coherent level 2 cache
Vector processor on each core
• Vector registers hold 16 x
32-bit floats
• Vector to vector arithmetic
• K registers allow you to
mask the write to a particular
destination
• K registers set by
comparisons
Multi-core parallelism
• If a statement is a two dimensional array assignment and it
uses only basic arithmetic operations or pure functions (side
effect free) on the right hand side,
• then the work on rows of the array is interleaved between
different n processors so that processor j handles all rows i
such that i mod n = j.
SIMD
• If a statement is an array assignment and the right hand side
contains no function calls and operates on adjacent array
elements,
• then the compiler generates SIMD code.
Parallel IF-expressions
• If the expression on the right is a conditional one with no
function calls
• Then it is evaluated using boolean masks to allow SIMD
execution.
Example of parallel IF-expression
; #substituting in k7 with 2 occurences and score of 17.0 vloadunpacklps ZMM0,[ A] vloadunpackhps ZMM0,[ A+64 ] vloadunpacklps ZMM1,[ B] vloadunpackhps ZMM1,[ B+64]
vcmpps k7, ZMM0, ZMM1,1 ; #substituting in ZMM7 with 2 occurences and score of 6.0
vloadunpacklps ZMM7,[ A] vloadunpackhps ZMM7,[ A+64] vloadunpacklps ZMM0,[ B] vloadunpackhps ZMM0,[ B+64] vmulps ZMM0, ZMM0, ZMM7 vblendmps ZMM0 { k7}, ZMM0, ZMM7 vpackstorelps [ C], ZMM0 vpackstorehps [ C+64], ZMM0
program compvec;
var A,B,C:array[1..16 ]
of real;
begin
A:=iota [0];
B:= 7;
C:= if A<B then A
else A*B;
writeln(a:5,b:5,c:5);
end.
K reg bits select between Zmm0 and Zmm7
Time against floating point channels
Assume 8 channels per core using AVX, no
hyperthreading, for x87 1 channel per core
Power laws: for AVX time Ùfpus-0.68, x87 time Ù fpus-0.66
Sandy Bridge Performance of Matcher
Xeon Phi compilation of
matcher
y = 5055.7x-0.557
y = 872.35x-0.654
y = 850.7x-0.726
10
80
640
5120
1 10 100
Seco
nd
s
Threads
Scalar time
simd time
simd opt3 busywait
Power (Scalar time)
Power (simd time)
Power (simd opt3busywait)
• This shows colour
matcher
performance as
function of threads
and of other options
• Note this is 16
megapixels which is
bigger than the last
example
Gather instructions
• The new instructionset allows a vector register to be loaded with
data spread around memory , ie, at non contiguous locations
• Vgather zmm0, [r12+zmm1]
• R12 is a base register point at an array
• Zmm1 holds a vector of offsets where ZMMO is to be loaded
from
• This can not be handled by current design of code generation
strategy
More sophisticated pattern directed rules
needed
ι
Rep(ι,16) [0,1,2,3,…15]
+
(ieee32)Mem[b+ι*4] (ieee32 vector (16))Mem[b+ι*4]
<index>
<Scalar array> <vector array access>
<replicated index>
vectorize
vectorize
Gather version
Mem[b:<base> +j:<genindexp>] (ieee32 vec(16))mem[ + ]
Rep(b,16) Vectorize(j)
vectorize
X:<scalar> Rep(X,16) vectorize
Approach from now
• Extend the grammar of machine description patterns to allow
machine specific transformation rules on intermediate code
trees.
• Extend the code generator generator to automatically build tree
transformer pass that is applied to code tree before instruction
selection phase.
• Allow named transformation rules to be called by the Java front
end of the compiler.
• lot of design of specification language needed for this.
Jacobi solver performance Phi versus 6 core Xeon
2.00E+07
8.00E+07
3.20E+08
1.28E+09
1 10 100
Flo
atin
g p
oin
t o
ps/
sec
Threads
Jacobi solver
host n=2k
mic n=2k
mic n=2k busywaitnoaffinity
mic n=2k noaffinity
mic n=2k opt3noaffinity busywait
• This is a test of an equation
solver solving the equation
v= Av +m
For a matrix a which is 2000
square using 300 iterations of
relaxation, same source code
on both machines