BSMBench: a flexible and scalable supercomputer benchmark from computational particle physics

16
BSMBench: a flexible and scalable supercomputer benchmark from computational particle physics Ed Bennett 1a , Luigi Del Debbio 2b , Kirk Jordan 3c , Biagio Lucini 4a , Agostino Patella 5d,e , Claudio Pica 6f , and Antonio Rago 7e a Department of Physics, Swansea University, Singleton Park, Swansea, SA2 8PP UK b The Higgs Centre for Theoretical Physics, University of Edinburgh, Edinburgh, EH9 3JZ UK c IBM Research, Thomas J Watson Research Center, 1 Rogers Street, Cambridge, MA 02142 USA d PH-TH, CERN, CH-1211 Geneva 23, Switzerland e School of Computing and Mathematics & Centre for Mathematical Science, Plymouth University, Drake Circus, Plymouth, PL4 8AA UK f CP-Origins & the Danish IAS, University of Southern Denmark, Odense, Denmark January 16, 2014 CP3-Origins-2014-001 DNRF90 DIAS-2014-1 Abstract Benchmarking plays a central role in the evalu- ation of High Performance Computing architec- tures. Several benchmarks have been designed that allow users to stress various components of supercomputers. In order for the figures they provide to be useful, benchmarks need to be rep- resentative of the most common real-world sce- narios. In this work, we introduce BSMBench, a benchmarking suite derived from Monte Carlo code used in computational particle physics. The advantage of this suite (which can be freely downloaded from http://www.bsmbench.org) over others is the capacity to vary the relative importance of computation and communication. This enables the tests to simulate various practi- cal situations. To showcase BSMBench, we per- form a wide range of tests on various architec- tures, from desktop computers to state-of-the- art supercomputers, and discuss the correspond- ing results. Possible future directions of devel- opment of the benchmark are also outlined. 1 Introduction Evaluating the performance of different ar- chitectures is an important aspect of High- Performance Computing (HPC). This evalua- tion is most often performed using benchmark 1 [email protected] 2 [email protected] 3 [email protected] 4 [email protected] 5 [email protected] 6 [email protected] 7 [email protected] applications, which conduct simulated work- loads of known complexity, so their execu- tion times may be compared between machines. Historically, the High-Performance LINPACK (HPL) [1] benchmark has been the “gold stan- dard” of HPC benchmarks, with the Top500 Supercomputing Sites [2] (listing the machines with the top HPL scores) being widely used as a proxy for the “fastest” computers in the world. Despite remaining the universal standard for HPC benchmarking, HPL has some shortcom- ings that are starting to cast serious doubts about its current and future usefulness to assess the real capability of a given machine for realistic workloads. From an algorithmic point of view, HPL is a rather simple benchmark. This has led to machines’ architectures (and compilers) being tweaked to optimise their performance in the HPL benchmark. Until recently this has not been a problem, with workloads’ requirements correlating roughly with those of HPL. However, as highlighted in a recent paper by Dongarra and Heroux [3], as we look towards machines capa- ble of tackling exascale problems, the demands of workloads have begun to diverge from those of HPL, to the point that design decisions made to improve HPL performance is detrimental to the machines’ performance in real-world tasks. Additionally, the time taken to complete a run of HPL has increased with machine size, with current top-end machines requiring over 24 hours; this is expensive, in terms of both highly-constrained machine time and energy us- age, leading to single runs being performed, rather than multiple runs to confirm consistency as might be made with a shorter-running tool. Dongarra and Heroux propose a new HPCG benchmark based on Conjugate Gradient (CG) 1 arXiv:1401.3733v1 [cs.DC] 15 Jan 2014

Transcript of BSMBench: a flexible and scalable supercomputer benchmark from computational particle physics

BSMBench: a flexible and scalable supercomputerbenchmark from computational particle physics

Ed Bennett1a, Luigi Del Debbio2b, Kirk Jordan3c, Biagio Lucini4a, Agostino Patella5d,e,Claudio Pica6f, and Antonio Rago7e

aDepartment of Physics, Swansea University, Singleton Park, Swansea, SA2 8PP UKbThe Higgs Centre for Theoretical Physics, University of Edinburgh, Edinburgh, EH9 3JZ UK

cIBM Research, Thomas J Watson Research Center, 1 Rogers Street, Cambridge, MA 02142 USAdPH-TH, CERN, CH-1211 Geneva 23, Switzerland

eSchool of Computing and Mathematics & Centre for Mathematical Science, PlymouthUniversity, Drake Circus, Plymouth, PL4 8AA UK

fCP-Origins & the Danish IAS, University of Southern Denmark, Odense, Denmark

January 16, 2014

CP3-Origins-2014-001 DNRF90DIAS-2014-1

Abstract

Benchmarking plays a central role in the evalu-ation of High Performance Computing architec-tures. Several benchmarks have been designedthat allow users to stress various components ofsupercomputers. In order for the figures theyprovide to be useful, benchmarks need to be rep-resentative of the most common real-world sce-narios. In this work, we introduce BSMBench, abenchmarking suite derived from Monte Carlocode used in computational particle physics.The advantage of this suite (which can be freelydownloaded from http://www.bsmbench.org)over others is the capacity to vary the relativeimportance of computation and communication.This enables the tests to simulate various practi-cal situations. To showcase BSMBench, we per-form a wide range of tests on various architec-tures, from desktop computers to state-of-the-art supercomputers, and discuss the correspond-ing results. Possible future directions of devel-opment of the benchmark are also outlined.

1 Introduction

Evaluating the performance of different ar-chitectures is an important aspect of High-Performance Computing (HPC). This evalua-tion is most often performed using benchmark

[email protected]@[email protected]@[email protected]@[email protected]

applications, which conduct simulated work-loads of known complexity, so their execu-tion times may be compared between machines.Historically, the High-Performance LINPACK(HPL) [1] benchmark has been the “gold stan-dard” of HPC benchmarks, with the Top500Supercomputing Sites [2] (listing the machineswith the top HPL scores) being widely used as aproxy for the “fastest” computers in the world.

Despite remaining the universal standard forHPC benchmarking, HPL has some shortcom-ings that are starting to cast serious doubtsabout its current and future usefulness to assessthe real capability of a given machine for realisticworkloads. From an algorithmic point of view,HPL is a rather simple benchmark. This hasled to machines’ architectures (and compilers)being tweaked to optimise their performance inthe HPL benchmark. Until recently this has notbeen a problem, with workloads’ requirementscorrelating roughly with those of HPL. However,as highlighted in a recent paper by Dongarra andHeroux [3], as we look towards machines capa-ble of tackling exascale problems, the demandsof workloads have begun to diverge from thoseof HPL, to the point that design decisions madeto improve HPL performance is detrimental tothe machines’ performance in real-world tasks.

Additionally, the time taken to complete arun of HPL has increased with machine size,with current top-end machines requiring over24 hours; this is expensive, in terms of bothhighly-constrained machine time and energy us-age, leading to single runs being performed,rather than multiple runs to confirm consistencyas might be made with a shorter-running tool.Dongarra and Heroux propose a new HPCGbenchmark based on Conjugate Gradient (CG)

1

arX

iv:1

401.

3733

v1 [

cs.D

C]

15

Jan

2014

matrix inversion with preconditioning to supple-ment HPL as a reference benchmark for morerealistic workloads.1

In this paper, we propose a new benchmarkbased on code from Beyond the Standard Model(BSM) Lattice Gauge Theory (LGT), which ful-fils many of the criteria laid out by Dongarra andHeroux for the HPCG benchmark. Algorithmi-cally, our benchmark can be seen as a generalisa-tion of performance tests derived from a branchof theoretical particle physics known as Quan-tum ChromoDynamics (QCD) on the Lattice (orLattice QCD). For this family of scalable bench-marks, characterised by simultaneously high de-mands in terms of both computation and com-munication, there is a wide literature. The ad-vantage of our benchmark over standard LatticeQCD-based ones is that it provides the addi-tional benefit of being able to tune the ratio ofcommunication over computational demands ofthe algorithm, by altering the physical theorythat is being considered. This is an importantfeature that enables us to change the benchmarkrequirements in order to simulate various realis-tic workloads; no existing benchmarks based onLattice QCD have this ability.

The rest of the paper is organised as follows.To place our results in context, in Section 2, abrief account of the approach to benchmarkingunderlying our work is given. In Section 3, anoverview of the physical context from which thecode has originated is provided. Section 4 de-scribes the structure of the benchmark code andthe benchmarking strategy it employs. Section5 shows the results we have found when run-ning the benchmark on a selection of systems,including IBM Blue Gene machines, x86-basedclusters, and a desktop workstation. In Section 6we discuss some of the interesting features of thedata found, and present suggestions for futurework. Finally, Appendix A gives some techni-cal details of the theoretical and computationalcalculations performed in BSM Lattice physicsthat are relevant to the benchmark.

2 Benchmark requirements

Before discussing our benchmark and the phys-ical problem from which it was spawned, it isworth setting the scene by specifying our re-

1Other supplementary benchmarks have been intro-duced in recent years to assess performance at graph-traversing applications [4] and energy efficiency [5]; how-ever, neither of these is targeted at realistic compute-intensive workloads, and they will not be discussed inthis work.

quirements for the benchmarking tool and—as aconsequence—the scope of the results discussedin this work. Various components of a super-computer (or, in fact, of any computer) can bebenchmarked, either (at least ideally) in isola-tion, or in reciprocal interaction. In our case, weare interested in the speed of execution of a com-plex numerical problem (i.e. a problem requiringa high number of floating point operations) in ascenario in which the graphic rendering does notplay any role and disk access times are negligi-ble. Hence, the main parameter we will focus onis the number of operations that the CPU canexecute, which is influenced both by the speed ofthe CPU and the speed of access to memory. Inaddition, since the problem can be distributedacross interlinked CPUs that do not necessar-ily share the same physical memory, speed ofcommunication of data from one computationalunit to another and latencies of interconnectionsare also important. Applications with those re-quirements are commonly found in physics, en-gineering and finance, to mention only a few dis-ciplines in which High Performance Computingplays a key role, and the results we shall dis-cuss are relevant in assessing the performanceof a supercomputing architecture for this classof problems. On the other hand, by design ourbenchmark will not capture the performance of asystem for applications that are—for instance—disk-intensive, nor can it help in assessing theenergy efficiency of a machine.

A benchmark that is able to assess the per-formance (as defined above) of an HPC systemshould also fulfil additional criteria of reliability,practicality, and generality, among which are thefollowing:

• It should be highly portable.

• It should give the correct mathematical an-swer on the tested architecture.

• It should use code that is representative ofthe main body of the computation beingbenchmarked.

• It should give output that can be readby people who are not knowledgeable inthe area of the benchmark, so that systembuilders can run the benchmark internallyand understand its output.

• It should be sufficiently large that the ma-chine in question can demonstrate its power(i.e. not so short that initialisation timesdominate, or so small that only a minor por-tion of a machine can be used).

2

• It should run in reasonable time (30 minutes– 1 hour is ideal, although due to its age anddesign LINPACK can take over 24 hours torun on the largest current machines; thereis active research into how to reduce thistime whilst still maintaining a comparableoutput).

• It should be reliable and repeatable (if onemachine is rated as having better perfor-mance than another, it should always do sofor that metric, and repeated runnings of atest should give the same score).

Following these main guiding principles, thebenchmark discussed in this paper, BSMBench,has been designed and developed to be a robust,reliable, scalable and flexible HPC performanceassessment tool that can be used to gather in-formation for a wide variety of systems in a widerange of appropriate workloads.

3 Lattice Gauge Theory andHPC benchmarking

In order to understand the properties of the al-gorithm we are proposing and how they can beused in benchmarking, in this section we givea very gentle and short introduction to LatticeGauge Theory, and in particular to the physicalproblems it intends to address and the numeri-cal methods it implements to solve it. We referto the appendix and to the quoted literature forthe more technical details.

3.1 QCD and novel strong forces

Lattice Gauge Theory is a framework originallydevised to solve QCD. QCD is the theory de-scribing physics below the level of atomic nu-clei. All nuclear matter (i.e. protons and neu-trons) is made of quarks and gluons, with thegluons transmitting the (strong nuclear) forcethat binds the quarks together. The strongforce is one of the four fundamental forces ofnature, along with the weak force, electromag-netism (which taken together comprise the Stan-dard Model of particle physics) and gravity, andis also responsible for the interactions of nuclearmatter.

While the Standard Model is a remarkablysuccessful theory, there remain a few questionsit leaves unanswered. It is known that the elec-tromagnetic and weak interactions are unifiedat high energies to a single “electroweak” inter-action. The simplest explanation of how this

breaks down to the two forces we experience atlow energies is the existence of an Higgs field,which also gives rise to a scalar boson. Such astate was observed at the Large Hadron Colliderat CERN in 2012 [6, 7]. However, the Higgs the-ory also introduces a problem, in that the massof the Higgs boson is “unnaturally” light. The-orists therefore are looking “beyond” the Stan-dard Model, to see whether any alternative ex-planations may be found. One such avenue ofresearch suggests that rather than a Higgs-liketheory, a QCD-like theory is responsible for theelectroweak symmetry breaking.

3.2 A first look at the lattice

Lattice gauge theory is the set of techniquesthat allow QCD and other BSM theories tobe simulated computationally. A recent ped-agogical introduction to Lattice gauge theory(mainly focused on QCD) is provided in [8].As is common in scientific problems, the con-tinuous space-time of QCD is replaced by adiscrete four-dimensional lattice of points, withcontinuous derivatives replaced by finite differ-ences. Since QCD is defined in terms of a fi-nite four-dimensional space-time (three spatialdimensions, plus one time), the lattice is alsofour-dimensional.

Fields are functions of the position, and maylive on the sites or the links between them. Theirvalues may be real or complex numbers, or morecomplicated objects—in the case of QCD andBSM theories of interest, the objects are com-plex N × N matrices with determinant equalto one. These matrices form a group, whichis called the special unitary N × N group, re-ferred to as SU(N). The parameter N is aninteger number specifying the physical theory.In this context, QCD is the theory described bySU(3). Quark fields, living on the lattice sites,require a different form of SU(N) matrix to thegluon fields, which live on the links, and can havevarious dimensions, corresponding to the repre-sentation of SU(N) they transform according to(the fundamental representation of SU(3) in thecase of QCD; other representations we shall con-sider in the more general case are the adjoint,the symmetric and the antisymmetric represen-tations). They come in Nf copies (flavours) withthe same mass or different masses. A field’svalue at a position is related (in a non-trivialmanner) to the probability of observing a parti-cle at that point.

The set of values of all fields across the latticeis known as a (field/gauge) configuration, and

3

can be seen as the value of the variables of thetheory at a given time (i.e. as a snapshot ofthe system for some fixed time). The evolutionof the matter and field variables is determinedby the dynamics derived from the physical the-ory according to well known laws; the numericalsimulation realises this evolution2. Monte Carlomethods (including but not limited to Metropo-lis and Molecular Dynamics algorithms) are usedto generate new gauge configurations from old,such that the set of configurations has an appro-priate statistical distribution of values. With asuitably large number of configurations, physicalquantities of interest (observables) may be calcu-lated using appropriate programs. Since not allobservables that can be measured in reality areeasily calculable on the lattice, much research fo-cuses on devising new methods to calculate ob-servables efficiently.

3.3 Computations

The gauge and fermionic degrees of freedom(which in QCD are the gluon and quark fieldsrespectively) are stored in structured arrays. Inparticular, the fermion variables can be inte-grated out. After this has been performed, thecomputational kernel of the problem consists ofthe inversion on a particular vector of a sparsematrix D (referred to as the Dirac operator)whose entries are SU(N) matrices. In order tofind the wanted inverse, various algorithms areavailable (e.g. Conjugated Gradient, MinimumResidues, Lanczos, etc.), which mostly consistsin a recursive procedure [9]. This recurrence canbe accelerated using techniques such as domaindecomposition and preconditioning [9].

Being local, the problem is well suited for par-allelisation [8]. For portability reasons, most of-ten this parallelisation uses the Message PassingInterface, or MPI in short (see the next section),dividing the lattice into a coarse grid of space-time sub-lattices. Each sub-lattice includes aboundary layer at least one lattice site thick ineach direction, to allow computation of deriva-tives without needing to fetch data from otherprocesses. These must however be updated reg-ularly as the values are changed by adjoiningprocesses. This is part of the communicationsdemand of the code; the other is that the up-date processes use quantities which depend onfields at all lattice sites, rather than just the lo-

2It is worth remarking that—in contrast to the morefamiliar behaviour in Classical Mechanics—in QuantumMechanics and in Quantum Field Theory the evolutionis not deterministic, but probabilistic.

cal sub-lattice. This aspect is minimised, how-ever, by computing these quantities piecewiselocally, and collecting the pieces from each pro-cess. Nevertheless, the computation demandsare such that not only is the bandwidth impor-tant, but also network latencies play a majorrole.

We begin to see now that adjusting the N ofSU(N) and the fermion representation will eachalter the compute and communications demandsof the task.

4 BSMBench

4.1 The underlying physics code

BSMBench is based on a Lattice Gauge Theoryresearch code called HiRep, which has been de-veloped over the years by some of the authors ofthis paper. Compared to standard Lattice QCDcodes3, in which N = 3 is often hard-wired tosimulate a theory of SU(3) matrices, HiRep hasthe ability of using N as a parameter, whichmakes it a useful exploration tool for the physicsof possible BSM theories. The original serialversion of HiRep was firstly used in [12], whereits operations are described in detail. With thegrowing computational demands of the researchproblem it intended to solve, a parallel versionwas developed that was first used to produce theresults reported in [13–15]. Since then, muchwork has been published incorporating calcula-tions produced with HiRep and code built uponit (for a non-exhaustive set, see for instance [16–21]).

The HiRep code comprises:

• a set of C libraries/modules containingshared functions;

• a set of C and Fortran programs for specificcomputations, making use of the above li-braries;

• C++ code to generate header files for aparticular fermion representation and gaugegroup at compile time;

• makefiles to link the above together;

• additional statistical analysis tools (C++code with bash wrappers) to extract physi-cal quantities from the program output.

Compile-time parameters (e.g. gauge group,fermion representation, boundary conditions,

3See for instance [10, 11] for examples of Lattice QCDcodes.

4

use of MPI) are specified in a flags file. Run-time parameters (e.g. lattice volume and paral-lelisation, coupling and bare mass, directory forwriting configurations, algorithmic parameters)are set in an input file, which takes the formof key = value pairs, one per line. A few pa-rameters (e.g. run log file location) are set ascommand-line arguments.

As is common for lattice codes, parallelisa-tion occurs using MPI, with the division split-ting the lattice volume into chunks, one per pro-cess. In the absence of compiler-specific auto-matic multithreading optimisations, each MPIprocess is single-threaded. HiRep supports op-erating in single- and double-precision, but theprecision of current research relies on double-precision arithmetic. BSMBench considers onlydouble-precision floating-point variables. Otherimprovements currently being implemented inorder to take advantage of more recent archi-tectures include hybrid MPI-OpenMP program-ming and porting to GPUs and to Intel’s XeonPhi architecture.

4.2 The benchmark suite

The usefulness of using codes derived from Lat-tice QCD as generic benchmarking tools hasbeen known for some time [22] and variousbenckmarks are available that use Lattice QCDtechniques (see for instance [23]). In LatticeQCD, communication and computational de-mands are roughly balanced, which means thatin a benchmark based on Lattice QCD a defi-ciency in one of the two areas could be maskedby a good performance in the other. Instead,looking at different BSM theories changes therelative requirements of the ratio computationover communication. Because of the increasedflexibility, a benchmark derived from BSM Lat-tice theories lends itself more to applications toareas outside of Lattice physics.

In order to obtain valuable information onboth speed of computation and data transfercapabilities, we choose to make three differentregimes available to benchmark:

• A communications-intensive regime, SU(2)with adjoint flavours, referred to as the“comms” test,

• A balanced regime, SU(3) with fundamen-tal flavours, referred to as the “balance”test, and

• A compute-intensive regime, SU(6) withfundamental flavours, referred to as the“compute” test,

where in each case the value of N and thefermion representation is fixed. More tests canbe added starting from the HiRep code.

The strategy adopted for the benchmark isadapted from that of Lscher [23]. The program’spreparation and execution occur as follows:

• Preparation:

– The user prepares or selects a machineconfiguration file specifying their Ccompiler and appropriate flags; somesuch files are included with the distri-bution for machines with which BSM-Bench has been tested.

– (Optional) The user may set a flag toindicate that they do not require a par-allel version of the code.

– The user calls the compile script withthe parallel (and optionally the non-parallel) machine configuration file asargument(s).

– The compilation script copies the ma-chine configuration file into the appro-priate place and sets appropriate flagsin the HiRep configuration files.

– The compilation script then calls themake command to generate the threecomponents of the benchmark listedabove, in parallel (unless told oth-erwise) and (optionally) non-parallelversions.

• Execution:

– The user (or job control system) calls(via mpirun if necessary) the appropri-ate benchmark executable, pointing itat one of the supplied parameter sets.

– The code reads in the parameter set,defining the geometry (i.e. the alloca-tion of the lattice across the variousprocesses) appropriately.

– The code allocates a random gaugefield and as many random fermionfields as are specified in the parame-ter set.

– As a consistency check, the codeinverts the Dirac operator4 on thefirst random spinor5 field and ensures

4The Dirac operator (briefly exposed in the Ap-pendix) is the main computational kernel in Lattice FieldTheory calculations involving fermions.

5A spinor is a mathematical object describingfermions. For the purposes of this paper, a spinor maybe thought of as a vector.

5

that the residual drops below a giventhreshold. The input and outputspinors are then re-randomised. Thistest is only carried out if requested inthe parameter set, since it would domi-nate the execution time on smaller ma-chines (e.g. workstations), where it willgenerally be unnecessary. The totalnumber of iterations and overall timetaken is recorded.

– The code calculates the square normof each random spinor field in turn agiven number of times, repeating untila time threshold is passed. The num-ber of iterations starts at one, and dou-bles each time the process repeats, toavoid spending excessive time checkingthe progress.

– The code performs the operation ψ2 =ψ2 + cψ1, where c is a complex scalarconstant, on successive pairs of ran-dom spinor fields, repeating in thesame manner as above.

– The code performs a Dirac operatorapplication ψ2 = Dψ1 on successivepairs of random spinor fields, repeat-ing in the same manner as above.

– The code uses its results and referenceand output performance information.Performance is outputted in FLOP/s(Floating-Point OPerations per sec-ond) for numerical applications and forconvenience the comparison with a ref-erence machine (chosen to be one rackof BlueGene/P) is also provided.

For the purposes of comparing machines’ perfor-mance, the Dirac operator application test is themost revealing.

The structure of the code is as follows. In thesource distribution, there is:

• A directory containing a stripped-downHiRep distribution, with the addition of thebenchmark code and comments on the li-censing terms,

• A directory containing sample machine con-figuration files,

• A directory containing parameter sets forthe default set of tests,

• An empty directory to hold output files,

• The README and LICENSE text files ,and

• The compilation script.

The source code for BSMBench can be down-loaded from http://www.bsmbench.org, whereits development (open to contributions from anyinterested party) is hosted. The compilationscript relies on a working MPI development en-vironment able to compile C/C++ source codethrough the popular make build system. Thecompilations scripts have been tested on vari-ous Unix-like operating systems (among whichIBM AIX, GNU/Linux, and OS X), and shouldwork with minor variations on other systems.For simplicity, generic and architecture-specifictemplate makefiles are also provided.

The supplied tests work with a four-dimensional lattice of size 64× 323. This choiceallows for comparison from workstation-classmachines up to small supercomputers. Com-paring results of tests at different lattice sizesis not recommended, since the associated over-heads change. Additionally, the maximum par-allelisation recommended is that giving a locallattice of 8 × 43; further parallelisation unrea-sonably increases the overheads. This limits thestock tests to 4096 MPI processes; equivalentto a single Blue Gene/P rack in Virtual Nodemode. This limit could be removed by choos-ing a larger lattice size, as is done in the nextsection for the extended tests of Blue Gene/Q;however, this would make running the tests im-practical on workstations. 4096 MPI processesis sufficient as a proof-of-concept and a test ofmoderately-sized machines.

5 Results

In order to prove its capabilities, the code hasbeen used to systematically test a number of ma-chines:

• Two clusters provided by HPC Wales, oneIntel Westmere-based and the other pow-ered by the Intel Sandy Bridge architecture;

• The BlueIce2 cluster at Swansea University(Intel Westmere-based);

• An IBM Blue Gene/P machine (which wasused as the reference configuration); runswere split between the UKQCD machine inSwansea (part of the STFC DiRAC HPCfacility) and a similar machine operated byIBM Research in Yorktown Heights, NewYork;

• An IBM Blue Gene/Q machine, also at IBMResearch in Yorktown Heights;

6

Mach

ine

Pro

cess

or

Nu

mb

erof

core

sT

hre

ad

sp

erco

reR

AM

L2

cach

eL

3ca

che

Inte

rcon

nec

tC

om

piler

Up

per

bou

nd

HP

CW

ale

s(W

estm

ere)

Inte

lW

estm

ere

Xeo

nX

5650

12

236G

B6M

B24M

BIn

fin

iBan

dG

CC

4.6

.2Q

ueu

esi

ze

HP

CW

ale

s(S

an

dy

Bri

dge)

Inte

lS

an

dy

Bri

dge

Xeo

nE

5-2

690

16

264G

B4M

B20M

BIn

fin

iBan

dIn

tel

Qu

eue

size

Blu

eIce

2In

tel

Wes

tmer

eX

eon

X5645

12

224G

B6M

B24M

BIn

fin

iBan

dG

CC

4.4

.6M

ach

ine

size

IBM

Blu

eG

ene/

P32-b

itP

ow

erP

Cat

850M

Hz

41

4G

B8kB

4M

BH

igh

-sp

eed

3D

toru

sIB

MX

LC

9.0

for

Blu

eG

ene

Larg

est

test

size

IBM

Blu

eG

ene/

Q64-b

itP

ow

erP

Cat

1600M

Hz

16

416G

B32M

B—

Hig

h-s

pee

d5D

toru

sIB

MX

LC

12.0

for

Blu

eG

ene

Larg

est

test

size

an

dm

ach

ine

size

UL

GQ

CD

AM

DO

pte

ron

6128

81

96G

B8M

B24M

BG

igab

itE

ther

net

GC

C4.1

.2S

low

inte

rcon

nec

tsa

Mac

Pro

bIn

tel

Wes

tmer

eX

eon

X5650

12

224G

B6M

B24

MB

—L

LV

M-G

CC

4.2

.1M

ach

ine

size

Tab

le1:

Th

ese

ven

mac

hin

este

sted

by

BS

MB

ench

an

dd

etail

sof

thei

rn

od

eco

nfi

gu

rati

on

,d

ivid

edin

toth

egro

up

sin

wh

ich

they

wer

ep

lott

ed.

• The ULGQCD cluster at Liverpool Univer-sity, part of the STFC DiRAC HPC facility;

• A Mac Pro, to include a high-level worksta-tion in the comparison.

Relevant details of the machines are listed in ta-ble 1.

The primary test performed was running eachof the benchmark tests on each machine at in-creasing partition sizes, from the minimum onwhich the tests would run (with node memorybeing the primary constraint) up to an upperlimit determined by the factors listed in table1. All tests were run with one MPI process perprocessor core—sometimes referred to as VirtualNode (VN) mode—since HiRep and BSMBenchare not currently multithreaded. In standardimplementations, most queue managers ignoreany concurrent thread support of the CPUs,since MPI rarely gains an advantage by run-ning multiple concurrent threads; multithreadedcode must be written or compiled in to use thisability. Blue Gene/Q, however, does provide forthis, thus up to 64 MPI processes may be runon a single processor.

Two other tests were run: firstly, a cus-tomised set of tests with a 128 × 643 lattice onBlue Gene/Q to allow a rack-for-rack compar-ison with a Blue Gene/P; this was again runin VN mode. Finally, a set of tests were runon a single Blue Gene/Q partition, varying thenumber of MPI processes per node from 1 to 64in powers of 2, to test the performance of themultiple hardware threads of the Blue Gene/Qprocessor.

The plots are divided into groups by machineclass to avoid overcrowding them. The primarybattery of tests, shown in figures 1–8, showsa roughly linear relationship between FLOP/sperformance and number of MPI processes (i.e.processor cores in use) in all cases. All machinesshow a slow drop in performance per node as thenumber of nodes is increased, associated withthe increased overheads. The ULGQCD clustershows a rapid tail-off in performance once theparallelisation exceeds the size of one node; forthe largest size tested not only does the per-coreperformance drop, but the overall performanceof the partition drops.

The rack-for-rack comparison, shown in fig-ure 9 shows that for larger lattices Blue Gene/Qcontinues to scale as we expect. The comparisonof different processor subdivisions, shown in fig-ure 10 shows linear scaling up to 1 process percore. When further subdividing the cores, the

7

Key to symbols:

109

1010

1011

1012

1 2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

109

1010

1011

1012

1013

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1011

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 1: Results of the spinor field square normtest plotted for the whole ensemble.

108

109

1010

2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

108

109

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 2: Results of the spinor field square normtest, plotting the average performance per MPIprocess.

8

108

109

1010

1011

1012

1 2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

108

109

1010

1011

1012

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1011

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 3: Results of the spinor field multiply–add, plotted for the whole ensemble. Symbolsas on page 8.

108

109

1010

2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

108

109

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 4: Results of the spinor field multiply–add, plotting the average performance per MPIprocess. Symbols as on page 8.

9

109

1010

1011

1012

1 2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

108

109

1010

1011

1012

1013

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1011

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 5: Results of the Dirac operator applica-tion test, plotted for the whole ensemble. Sym-bols as on page 8.

108

109

1010

2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

107

108

109

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

107

108

109

1010

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 6: Results of the Dirac operator applica-tion test, plotting the average performance perMPI process. Symbols as on page 8.

10

109

1010

1011

1012

1 2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

109

1010

1011

1012

1013

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1011

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 7: Total combined results of all threetests, plotted for the whole ensemble. Symbolsas on page 8.

108

109

1010

1 2 4 8 16 32 64 128 256 512

FLO

P/s

Number of processes

(a) HPC Wales and BlueIce2

108

109

4 16 64 256 1024 4096

FLO

P/s

Number of processes

(b) Blue Gene

108

109

1010

1 2 4 8 16 32 64 128

FLO

P/s

Number of processes

(c) Others

Figure 8: Total combined results of all threetests, plotting the average performance per MPIprocess. Symbols as on page 8.

11

1010

1011

1012

1013

0.03125 0.0625 0.125 0.25 0.5 1

FLO

P/s

Number of racks

Figure 9: Rack-for-rack comparison betweenBlue Gene/P and Blue Gene/Q performance, us-ing a larger lattice size in the case of the BlueGene/Q to allow use of a full rack. Symbols ason page 8.

1010

1011

1012

1 2 4 8 16 32 64

FLO

P/s

Number of MPI processes per CPU

Figure 10: Performance comparison of a singleBlue Gene/Q partition at various subdivisions;from one process per CPU through to 64 pro-cesses per CPU (four per core). Symbols as onpage 8.

performance plateaus, showing little to no gainto using more than one thread per core.

It is worth drawing attention to some fea-tures of the data that are particularly interest-ing. For this discussion we will concentrate onthe Dphi test, since it is most representative of“real-world” performance.

For the Blue Gene/P, unexpected upticks inperformance were seen at 512 and 4096 cores(128 and 1024 nodes, or 1

4 and 1 rack respec-tively) that were not seen on the Blue Gene/Q.This is most likely due to the lattice geom-etry aligning fortuitously with the 3D torusnetwork—it would not be seen on Blue Gene/Qdue to the different network topology and coredensity. No tuning of the process mapping wasperformed for the tests; the default alignmentswere used. More careful choice of the alignmentsmay be able to give a similar (or better) speedupto other partition sizes.

The threading results on Blue Gene/Q arenot unexpected. Since HiRep and BSMBenchare single-threaded, subdividing the processors

by core is equivalent to adding more cores tothe problem. A significant speedup from multi-threading is not expected: since multithreadingonly allows efficient context switching betweenthreads rather than parallel execution (which iswhat multiple cores achieve), a large gain wouldonly be possible if the code spent much of itstime idle, which is not generally the case. Asmall gain can be found in a compute-intensivetheory by having two processes rather than oneper core; since this gain is effectively “free” thereis no reason not to use it, unless the resulting lo-cal lattice is too small—in which case the extraoverheads will outweigh the gains—or there areno more directions available to parallelise.

The Mac Pro system at 16 and 24 processes isalso multi- (or Hyper-) threading; notice thatthe performance at 16 processes drops to ap-proximately that of 8 processes. While the mul-tiple threads per core can in principle executeconcurrently (unlike on the Blue Gene/Q), inpractise the shared FPU means that very lit-tle is parallelised, giving behaviour very closeto what would be seen in a non-parallel multi-threading core. The drop in performance, ratherthan a plateau, occurs for two reasons: firstly,the threads are localised to a single core; theycannot float between the cores and use what-ever slack compute capacity there is. Secondly,the code has relatively frequent barriers to allowcommunication between the processes. Sincethe code makes heavy use of the floating-pointunits (which are shared between the “Hyper”-threads), these two factors have the effect offorcing the code to operate in lock-step on allcores, including those with only one task allo-cated, making the program run as slowly as ifall cores had two threads running. Since thecode has little idle time (even less than on theBlue Gene, since here there are no interconnectdelays at all), no performance gains over 12 pro-cesses are seen at 24 processes. To test the lock-step theory, comms and balance tests6 were per-formed with two 8- and 12-process runs in par-allel with each other (for a total of 16 and 24processes), and their FLOP counts added. Thisshowed a slight enhancement over the single-runresults, suggesting that the requirement to syn-chronise between processes does indeed reducethe potential gains from multithreading.

The HPC Wales machines, in particular theSandy Bridge system, show an increasing sep-aration between the three tests as the numberof processes is increased. This nicely illustrates

6The compute test required more memory than waspresent in the Mac Pro, so the results were not of interest.

12

the assertion made earlier that the tests differ intheir communications versus compute demands:the splitting begins once the processes no longerfit on the node, so must begin to use the in-terconnects, and increase as the interconnectsare relied upon more. The slowdown is least se-vere for the compute, and most severe for thecomms test, as we expect. This is not observedon the Blue Gene machines, whose architectureis designed from the ground up to be massivelyparallel, hence their advanced network topologyand very high-speed interconnects. In the caseof ULGQCD, the interconnects are too slow tomake an informed judgement; however, the ex-pected drop-off in performance is seen once thenode size is exceeded.

The BlueIce2 system is very similar to theHPC Wales Westmere system in its makeup;their performance results are unsurprisingly alsoclose to each other. The performance of BlueIce2in the intermediary regime (from 16 cores wherethe job no longer fits on a single node, up to 256cores) sits above the performance of the HPCWales machine—this appears somewhat surpris-ing, since its processors are slightly slower. Infact this enhancement is due to the job’s paral-lelisation: on BlueIce2 it was possible to paral-lelise with 8 cores per node, and Nproc/8 = 2n

nodes, whereas allocating a given number ofcores on HPC Wales generally returns the coreson the minimum number of nodes. Since the lat-tice is parallelised in divisions of 2, keeping thepowers of 2 in the parallelisation improves thecommunications performance significantly; thisis notable in that the comms test is the one gain-ing the largest performance boost on BlueIce2.Confirming this explanation is the downtick inperformance (to slightly below HPC Wales) at512 cores: since BlueIce2 only has 50 nodes,the neat power-of-two parallelisation is no longerpossible at the highest number of cores, meaningthe communications speed enhancement is lost.

6 Conclusions

In this work we have presented BSMBench, anovel HPC benchmark suite derived from thelatest research code used for Monte Carlo cal-culations in theoretical particle physics. BSM-Bench has the ability to vary the ratio be-tween computation and communication de-mands. This property allows the tool to pro-vide a robust assessment of the capabilitiesof different architectures under scenarios thatcan mimic a wide range of workloads. BSM-

Bench can be downloaded from http://www.

bsmbench.org and comes with an easy set ofinstructions and a limited set of dependencies inorder to allow users not so familiar with paral-lel programming to deploy the suite on variousmulticore architectures. Although we have pre-sented three particular realisations, other possi-bilities for obtaining a different ratio of commu-nication over computation can be implementedchanging a relatively small portion of the sourcecode.

In addition to exploring other parallelisationstrategies and porting to different architectures,which will be implemented following further de-velopments of HiRep, the Lattice Gauge Theorycode from which BSMBench is derived, otherfuture directions for upcoming releases of thebenchmark suite are being planned. Potentialextensions to this project would be to implementa size-independent test, where the lattice sizeis chosen based on the machine size. Since thetotal number of FLOPs is proportional to thelattice volume, a reference figure could be givenfor a base 44 lattice, and multiplied up to themachine size. Results compared to the currentversion would be approximately comparable. Itwould be beneficial to find a small set of quan-tities that would allow full characterisation ofa machine’s performance, to enable ranking ofmachines as in the TOP500 ranking (since com-parison of graphs, while enlightening, is time-consuming). Finding out what classes of HPCuse are characterised by which test would alsoexpand the benchmark’s utility. Finally, manybenchmarks now come with the ability to analyt-ically estimate the performance on a given ma-chine from the machine’s design characteristics;such a model for BSMBench would be a power-ful tool for designing new systems, in particularonce other codes’ performance can be modelledin terms of BSMBench’s tests.

Acknowledgements

We thank Jim Sexton for invaluable insights inthe early stage of this project. The simulationsdiscussed in this work were performed on theSTFC-funded DiRAC facilities, on High Perfor-mance Computing Wales machines and on sys-tems supported by the Royal Society. EB ac-knowledges the support of an IBM student in-ternship and of STFC through a Doctoral Train-ing Grant. The work of BL is supported bySTFC (grant ST/G000506/1) and by the RoyalSociety (grant UF09003). The research of CP is

13

supported by a Lundbeck Foundation Fellowshipgrant.

A Lattice Gauge Theory

A complete review of the Lattice formalism, andthe framework of Quantum Field Theory (QFT),which underlies it, can each span multiple text-books, and thus are well outside the scope of thispaper. Instead, this appendix will review the im-portant results from the fields which could assistin understanding the remainder of this paper.Those wanting to learn more about QFT andLattice techniques could do worse than lookingin, for example, [24, 25] and [26, 27].

A mathematical topic that plays a central rolein QFT is group theory. Once again, a full viewof group theory is well beyond the scope of thispaper, but for the interested reader there aremany good textbooks introducing group theoryin the context of physics [28, 29].

SU(N) groups (which are the primary groupsof interest in particle physics, and the only onesused in this paper) are defined as the groups ofcomplex, unitary N × N matrices with deter-minant one. This defining representation (alsocalled the fundamental representation) is not theonly possibility to describe the abstract SU(N)group, which may be represented also in termsof larger square matrices. Physically interest-ing higher representations include the symmet-ric, antisymmetric, and adjoint representations,which are called two-index representations, sinceeach matrix in any of those representations canbe expressed as a tensor product of two matricesin the fundamental representation. SU(N) ma-trices (of any representation) can act on vectors(or conversely the vectors can be said to trans-form under a given representation of the group).

Each representation consists of DR ×DR ma-trices, which act on DR-dimensional vectors,where DR(N) is called the dimension of the rep-resentation R. We have already seen that for thefundamental representation, DF(N) = N . Forthe adjoint representation, DA(N) = N2 − 1.For the anti-symmetric and symmetric represen-tations, DAS(N) = 1

2N(N − 1) and DS(N) =12N(N + 1) respectively.

Most observables calculable in QFT, and byextension in modern particle physics, may beexpressed in the so-called path integral formu-lation, due to Feynman, as correlation functionsin the form

〈X〉 =1

Z

∏i

(∫Dξi

)X eiS[ξi] , (1)

Figure 11: An example of a three-dimensionallattice, showing lattice sites, links between them,and elementary plaquettes formed from squaresof links.

where Z =∏i

(∫Dξi)eiS (the path integral or

partition function) enforces 〈1〉 = 1,∏i

(∫Dξi)

represents a functional integration over thespace of fields {ξi}, i is the imaginary unit de-fined by i2 = −1, and S[ξi] is the action definingthe theory.

Most theories currently of interest containgauge fields Aµ in the adjoint representation ofthe gauge group (most frequently chosen to beSU(N)), andNf fermion flavours ψi, spinor fieldslying in some representation of the gauge group.Various choices forN andNf define different the-ories. At fixed N and Nf, the representation inwhich a field lies determines the physical proper-ties of the corresponding system. For example,QCD has gauge group SU(3), and six fermionflavours in the fundamental representation. Thesame theory with the same number of adjointquark fields would have vastly and noticeablydifferent properties.

The functional integral present in Equation(1) represents an integration over all possibleconfigurations of the fields, and is in general notanalytically defined. We may, however, find arobust definition by discretising the problem—that is, placing it onto a finite grid, or lattice,with the functional integral of a field replacedby a product of integrals over the field values ateach point; i.e.∫

Dξ(x) 7→∏i

(∫dxxi

). (2)

Figure 11 shows a three-dimensional example ofsuch a lattice; however, the theories of interestin physics are most frequently four-dimensional,with three spatial directions and one temporal

14

direction, carrying an additional − sign in cal-culations of distance. Ultimately, it is becauseof this minus sign that the exponential in (1) iscomplex. Complex integrals are oscillating andgive rise to cancellations that would be difficultto treat with numerical methods. We may sim-plify things somewhat by performing a Wick ro-tation into “imaginary time”, where space andtime are on the same footing, each making a pos-itive contribution to distance calculations, justas the three spatial dimensions do in 3D calcu-lations, leading to the setup being referred to asEuclidean spacetime. This has the effect of in-troducing a factor of i into the exponent in Equa-tion (1). Now the exponential becomes damped,and for all practical purposes the integrals havesupport in a compact region in the integrationvariables. The original theory can be recov-ered by inverting the Wick rotation after thecalculation has been performed. Although bet-ter behaved numerically after the Wick rotation,the set of integrals that needs to be computedrapidly becomes intractable using grid methods,since it scales linearly with the lattice volume(i.e. quartically with the lattice’s extent), andso Monte Carlo techniques must be used.

For the lattice theories we are interested in,the Euclidean physical value of an observable Xcan be expressed as

〈X〉Eucl. =1

Z

∫DUµDφiDφ

?i X e−Seff , (3)

with

Seff = Sg[Uµ] + φ?(

(D[Uµ])†D[Uµ]

)−1

φ . (4)

Here Uµ (which represents the gauge variablesand can be seen as the path-ordered exponen-tials of the Aµ along the link considered) is afield of SU(N) matrices with support on thelinks of the lattice. φ is a vectorial quantitydescribing the fermions (φ∗ being its complexconjugate) defined over all lattice sites. It trans-forms under the fundamental or a two-index rep-resentation of the gauge group (a flavour indexrunning from one to Nf identifying the fermionspecies is also understood). D† is the Hermi-tian conjugate of the Dirac operator D and aneven number of flavours Nf has been assumedthroughout. Equations (3) and (4) exposethe main computational kernel of the problem,which is the inversion of H = (D[Uµ])

†D[Uµ].

In current calculations, this inversion dominatesthe calculation for the values of N currently ac-cessible in numerical simulations, with the partinvolving Sg[Uµ] only adds a modest computa-tional overhead.

Together with the dimension of the group N ,the representation in which the fermions trans-form determines the dimension of the Dirac op-erator, and hence the computational complexityof the problem. Given the sub-grid parallelisa-tion strategy, these parameters also affect thesize of the messages being transferred betweenneighbouring processors, and hence the commu-nication requirements.

References

[1] “HPL - a portable implementation of theHigh-Performance Linpack benchmark fordistributed-memory computers.”http://www.netlib.org/benchmark/hpl.

[2] “TOP500 supercomputer sites.”http://www.top500.org/.

[3] J. Dongarra and M. A. Theroux, “Towardsa new metric for ranking HighPerformance Computing systems,” Tech.Rep. SAND2013-4744, Sandria NationalLaboratories, Albuquerque, New Mexico87185 and Livermore, California 94550,June, 2013. http://www.sandia.gov/

~maherou/docs/HPCG-Benchmark.pdf.

[4] “The Graph 500 list.”http://www.graph500.org/.

[5] “The Green500 list.”http://green500.org/.

[6] ATLAS Collaboration, The ATLASCollaboration, “Observation of a newparticle in the search for the StandardModel Higgs boson with the ATLASdetector at the LHC,” Phys. Lett. B716no. 1, (2012) 1 – 29, arXiv:1207.7214.

[7] CMS Collaboration, The CMSCollaboration, “Observation of a newboson at a mass of 125 GeV with the CMSexperiment at the LHC,” Phys. Lett. B716no. 1, (2012) 30 – 61, arXiv:1207.7235.

[8] S. Schaefer, “Algorithms for lattice QCD:progress and challenges,” AIP Conf.Proc.1343 (2011) 93–98, arXiv:1011.5641[hep-ph].

[9] C. Pica, “Overview of iterative solvers forLQCD.”http://www.physik.uni-bielefeld.de/

igs/schools/Tools2008/pica.pdf.

15

[10] “DD-HMC algorithm for two-flavourlattice QCD.” http://luscher.web.

cern.ch/luscher/DD-HMC/index.html.

[11] “The Chroma library for lattice fieldtheory.” http:

//usqcd.jlab.org/usqcd-docs/chroma/.

[12] L. Del Debbio, A. Patella, and C. Pica,“Higher representations on the lattice:Numerical simulations. SU(2) with adjointfermions,” Phys. Rev. D81 (2010) 094503,arXiv:0805.2058 [hep-lat].

[13] L. Del Debbio, B. Lucini, A. Patella,C. Pica, and A. Rago, “Conformal versusconfining scenario in SU(2) with adjointfermions,” Phys. Rev. D80 (2009) 074507,arXiv:0907.3896 [hep-lat].

[14] L. Del Debbio, B. Lucini, A. Patella,C. Pica, and A. Rago, “Infrared dynamicsof Minimal Walking Technicolor,” Phys.Rev. D82 (Jul, 2010) 014510,arXiv:1004.3206.

[15] L. Del Debbio, B. Lucini, A. Patella,C. Pica, and A. Rago, “Mesonicspectroscopy of Minimal WalkingTechnicolor,” Phys.Rev. D82 (2010)014509, arXiv:1004.3197 [hep-lat].

[16] S. Catterall, L. Del Debbio, J. Giedt, andL. Keegan, “MCRG Minimal WalkingTechnicolor,” PoS LATTICE2010 (2010)057, arXiv:1010.5909 [hep-ph].

[17] B. Lucini, G. Moraitis, A. Patella, andA. Rago, “Orientifold Planar Equivalence:The Quenched Meson Spectrum,” PoSLATTICE2010 (2010) 063,arXiv:1010.6053 [hep-lat].

[18] F. Bursa, L. Del Debbio, D. Henty,E. Kerrane, B. Lucini, A. Patella, C. Pica,T. Pickup, and A. Rago, “Improved latticespectroscopy of Minimal WalkingTechnicolor,” Phys. Rev. D84 (Aug, 2011)034506, arXiv:1104.4301.

[19] R. Lewis, C. Pica, and F. Sannino, “Lightasymmetric dark matter on the lattice:SU(2) technicolor with two fundamentalflavors,” Phys. Rev. D85 (Jan, 2012)014504, arXiv:1109.3513.

[20] E. Bennett and B. Lucini, “Topology ofMinimal Walking Technicolor,” Eur. Phys.J. C73 no. 5, (2013) 1–7,arXiv:1209.5579. http://dx.doi.org/10.1140/epjc/s10052-013-2426-6.

[21] A. Hietanen, C. Pica, F. Sannino, andU. I. Sondergaard, “OrthogonalTechnicolor with isotriplet dark matter onthe lattice,” Phys.Rev. D87 no. 3, (2013)034508, arXiv:1211.5021 [hep-lat].

[22] P. Vranas, G. Bhanot, M. Blumrich,D. Chen, A. Gara, P. Heidelberger,V. Salapura, and J. C. Sexton, “The BlueGene/L supercomputer and QuantumChromoDynamics, Gordon Bell finalist,”in Proceedings of SC06. 2006.http://sc06.supercomputing.org/

schedule/pdf/gb110.pdf.

[23] M. Luscher, “Lattice QCD on PCs?,”Nucl. Phys. B Proc. Suppl. 106–107no. 0, (2002) 21 – 28,arXiv:hep-lat/0110007. LATTICE 2001Proceedings of the XIXth InternationalSymposium on Lattice Field Theory.

[24] M. E. Peskin and D. V. Schroeder, AnIntroduction To Quantum Field Theory(Frontiers in Physics). Westview Press,1995.

[25] A. Zee, Quantum Field Theory in anutshell. Princeton University Press, 2010.

[26] H. J. Rothe, Lattice Gauge Theories: AnIntroduction. World Scientific, 2005.

[27] J. Smit, “Introduction to quantum fieldson a lattice: A robust mate,” CambridgeLect. Notes Phys. 15 (2002) 1–271.http://www.slac.stanford.edu/

spires/find/hep/www?j=00385,15,1.

[28] H. Jones, Groups, representations, andphysics. Taylor & Francis, 1998.

[29] H. Georgi, Lie algebras in particle physics:from isospin to unified theories. WestviewPress, 1999.

16