Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

12
Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards Ogier Maitre 1 , Nicolas Lachiche 1 , Philippe Clauss 1 , Laurent Baumes 2 , Avelino Corma 2 , and Pierre Collet 1 1 LSIIT University of Strasbourg, France {maitre,lachiche,clauss,collet}@lsiit.u-strasbg.fr 2 Insituto de Tecnologia Quimica UPV-CSIC, Valencia Spain {baumesl,acorma}@itq.upv.es Abstract. A parallel solution to the implementation of evolutionary algorithms is proposed, where the most costly part of the whole evolu- tionary algorithm computations (the population evaluation), is deported to a GPGPU card. Experiments are presented for two benchmark ex- amples on two models of GPGPU cards: first a ”toy” problem is used to illustrate some noticable behaviour characteristics before a real prob- lem is tested out. Results show a speed-up of up to 100 times compared to an execution on a standard micro-processor. To our knowledge, this solution is the first showing such an efficiency with GPGPU cards. Fi- nally, the EASEA language and its compiler are also extended to allow users to easily specify and generate efficient parallel implementations of evolutionay algorithms using GPGPU cards. 1 Introduction Between nowadays available manycore architectures, GPGPU (General Purpose Graphic Processing Units) cards are offering one of the most attractive cost/per- formance ratio. However programming such machines is a difficult task. This paper focuses on a specific kind of resource-consuming application: evolution- ary algorithms. It is well-known that such algorithms offer efficient solutions to many optimization problems, but they usually require a great number of evalu- ations, making processing power a limit on standard micro-processors. However, their algorithmic structure clearly exhibits resource-costly computation parts that can be naturally parallelized. But, GPGPU programming constraints in- duce to consider dedicated operations for efficient parallel execution, one of the main performance-relevant constraint being the time needed to transfer data from the host memory to the GPGPU memory. This paper starts by presenting evolutionary algorithms, and studying them to determine where parallelization could take place. Then, GPGPU cards are presented in section 3, and a proposition on how evolutionary algorithms could be parallelized on such cards is made and described in section 4. Experiments are made on two benchmarks and two NVidia cards in section 5 and some related works are described in section 7. Finally, results and future developments are discussed in the conclusion. H. Sips, D. Epema, and H.-X. Lin (Eds.): Euro-Par 2009, LNCS 5704, pp. 974–985, 2009. c Springer-Verlag Berlin Heidelberg 2009

Transcript of Efficient Parallel Implementation of Evolutionary Algorithms on GPGPU Cards

Efficient Parallel Implementation of

Evolutionary Algorithms on GPGPU Cards

Ogier Maitre1, Nicolas Lachiche1, Philippe Clauss1, Laurent Baumes2,Avelino Corma2, and Pierre Collet1

1 LSIIT University of Strasbourg, France{maitre,lachiche,clauss,collet}@lsiit.u-strasbg.fr

2 Insituto de Tecnologia Quimica UPV-CSIC, Valencia Spain{baumesl,acorma}@itq.upv.es

Abstract. A parallel solution to the implementation of evolutionaryalgorithms is proposed, where the most costly part of the whole evolu-tionary algorithm computations (the population evaluation), is deportedto a GPGPU card. Experiments are presented for two benchmark ex-amples on two models of GPGPU cards: first a ”toy” problem is usedto illustrate some noticable behaviour characteristics before a real prob-lem is tested out. Results show a speed-up of up to 100 times comparedto an execution on a standard micro-processor. To our knowledge, thissolution is the first showing such an efficiency with GPGPU cards. Fi-nally, the EASEA language and its compiler are also extended to allowusers to easily specify and generate efficient parallel implementations ofevolutionay algorithms using GPGPU cards.

1 Introduction

Between nowadays available manycore architectures, GPGPU (General PurposeGraphic Processing Units) cards are offering one of the most attractive cost/per-formance ratio. However programming such machines is a difficult task. Thispaper focuses on a specific kind of resource-consuming application: evolution-ary algorithms. It is well-known that such algorithms offer efficient solutions tomany optimization problems, but they usually require a great number of evalu-ations, making processing power a limit on standard micro-processors. However,their algorithmic structure clearly exhibits resource-costly computation partsthat can be naturally parallelized. But, GPGPU programming constraints in-duce to consider dedicated operations for efficient parallel execution, one of themain performance-relevant constraint being the time needed to transfer datafrom the host memory to the GPGPU memory.

This paper starts by presenting evolutionary algorithms, and studying themto determine where parallelization could take place. Then, GPGPU cards arepresented in section 3, and a proposition on how evolutionary algorithms couldbe parallelized on such cards is made and described in section 4. Experiments aremade on two benchmarks and two NVidia cards in section 5 and some relatedworks are described in section 7. Finally, results and future developments arediscussed in the conclusion.

H. Sips, D. Epema, and H.-X. Lin (Eds.): Euro-Par 2009, LNCS 5704, pp. 974–985, 2009.c© Springer-Verlag Berlin Heidelberg 2009

Efficient Parallel Implementation of Evolutionary Algorithms 975

2 Presentation of Evolutionary Algorithms

In [5], Darwin suggests that species evolve through two main principles: variationin the creation of new children (that are not exactly like their parents) andsurvival of the fittest, as many more individuals of each species are born thancan possibly survive.

Evolutionary Algorithms (EAs) [9] get their inspiration from this paradigmto suggest a way to solve the following interesting question. Given :

1. a difficult problem for which no computable way of finding a good solutionis known and where a solution is represented as a set of parameters,

2. a limited record of previous trials that have all been evaluated.

How can one use the accumulated knowledge to choose a new set of parametersto try out (and therefore do better than a random search) ? EAs rely on artificialDarwinism to do just that: create new potential solutions from variations ongood individuals, and keeping a constant population size through selection ofthe best solutions. The Darwinian inspiration for this paradigm leads to borrowsome specific vocabulary from biology: given an initial set of evaluated potentialsolutions (called a population of individuals), parents are selected among the bestto create children thanks to genetic operators (that Darwin called “variation”operators), such as crossover and mutation. Children (new potential solutions)are then evaluated and from the pool of parents and children, a replacementoperator selects those that will make it to the new generation before the loop isstarted again.

2.1 Parallelization of a Generic Evolutionary Algorithm

The algorithm presented on figure 1 contains several steps that may or may notbe independent. To start with, population initialisation is inherently parallel,because all individuals are created independently (usually with random values).

Then, all newly created individuals need to be evaluated. But since they areall evaluated independently using a fitness function, evaluation of the populationcan be done in parallel. It is interesting to note that in evolutionary algorithms,evaluation of individuals is usually the most CPU-consuming step of the algo-rithm, due to the high complexity of the fitness function.

Once a parent population has been obtained (by evaluating all the individualsof the initial population), one needs to create a new population of children. Inorder to create a child, it is necessary to select some parents on which variationoperators (crossover, mutation) will be applied. In evolutionary algorithms, se-lection of parents is also parallelizable because one parent can be selected severaltimes, meaning that independent selectors can select whoever they wish withoutany restrictions.

Creation of a child out of the selected parents is also a totally independentstep: a crossover operator needs to be called on the parents, followed by a mu-tation operator on the created child.

976 O. Maitre et al.

Fig. 1. Generic evolutionary loop

So up to now, all steps of the evolutionary loop are inherently parallel but for thelast one: replacement. In order to preserve diversity in the successive generations,the (N + 1)-th generation is created by selecting some of the best individuals ofthe parents+children populations of generation N . However, if an individual isallowed to appear several times in the new generation, it could rapidly becomepreeminent in the population, therefore inducing a loss of diversity that wouldreduce the exploratory power of the algorithm.

Therefore, evolutionary algorithms impose that all individuals of the newgeneration are different. This is a real restriction on parallelism, since it meansthat the selection of N survivors cannot be made independently, otherwise asame individual could be selected several times by several independent selectors.

Finally, one could wonder whether several generations could evolve in parallel.The fact that generation (N + 1) is based on generation N invalidates this idea.

3 GPGPU Architecture

GPGPU and classic CPU designs are very different. GPGPUs come from thegaming industry and are designed to do 3D rendering. They inherit specificfeatures from this usage. For example, they feature several hundreds executionunits grouped into SIMD bundles that have access to a small amount of sharedmemory (16KB on the NVidia 8800GTX that was used for this paper), a largememory space (several hundred megabytes), a special access mode for texturememory and a hardware scheduling mechanism.

The 8800GTX GPGPU card features 128 stream processors (compared to4 general purpose processors on the Intel Quad Core) even though both chipsshare a similar number of transistors (681 million for the 8800GTX vs 582 millionfor the Intel Quad Core). This can be done thanks to a simplified architecturethat has some serious drawbacks. For instance, all stream processors are not

Efficient Parallel Implementation of Evolutionary Algorithms 977

independent. They are grouped into SIMD bundles (16 SPMD bundles of 8SIMD units on the 8800GTX, which saves 7 fetch and dispatch units). Then,space-consuming cache memory is simply not available on GPGPUs, meaningthat all memory accesses (that can be done in only a few cycles on a CPU if thedata is already in the cache) cost several hundred cycles.

Fortunately, some workarounds are provided. For instance, the hardwarescheduling mechanism allows to run a bundle of threads called a warp at thesame time, swapping between the warps as soon as a thread of the current warpis stalled on a memory access, so memory latency can be overcome with warpscheduling. But there is a limit to what can be done: it is important to haveenough parallel tasks to be scheduled while waiting for the memory. A thread’sstate is not saved into memory. It stays on the execution unit (like in hyper-threading mecanism), so the number of registers used by a task directly impactsthe number of tasks that can be scheduled on a bundle of stream processors.Then, there is a limit in the number of schedulable warps (24 warps, i.e. 768threads on the 8800GTX).

All these quirks make it very difficult for standard programs to exploit thereal power of these graphic cards.

4 Parallel Implementation on a GPGPU Card

As has been shown in section 2.1, it is possible to parallelize most of the evo-lutionary loop. However, whether it is worth to run everything in parallel onthe GPGPU card or not is another matter: in [8,7,11], the authors implementedcomplete algorithms on GPGPU cards, but clearly show that doing so is verydifficult, for quite few performance gains.

Rather than going this way, the choice made for this paper was to keep every-thing simple, and start with experimenting the obvious idea of only parallelizingchildren evaluation, based on the three following considerations.

1. Implementing the complete evolutionary engine on the GPGPU card is verycomplex, so it seems preferable to start with parallelizing only one part ofthe algorithm.

2. Usually, in evolutionary algorithms, execution of the evolutionary engine(selection of parents, creation of children, replacement step) is extremelyfast compared to the evaluation of the population.

3. Then, if the evolutionary engine is kept on the host CPU, one needs to transferthe genome of the individuals only once on the GPGPU for each generation.If the selection and variation operators (crossover, mutation) had been imple-mented on the GPGPU, it would have been necessary to get the populationback on the host CPU at every generation for the replacement step.

Evaluation of the population on the GPGPU is a massively parallel processthat suits well an SPMD/SIMD computing model, because standard evolution-ary algorithms use the same evaluation function to evaluate all individuals1.1 This is not the case in Genetic Programming, where on the opposite, all individuals

are different functions that are tested on a common data learning set.

978 O. Maitre et al.

Individual evaluations are grouped into structures called Blocks, that imple-ment a group of threads which can be executed on a same bundle. Dispatchingindividuals evaluation across this structure is very important in order to max-imize the load on the whole GPGPU. Indeed, as seen in section 3 a bundlehas limited scheduling capacity, depending on the hardware scheduling deviceor register limitations. The GPGPU computing unit must have enough registersto execute all the tasks of a block at the same time, representing the schedulinglimit. The 8800GTX card has a scheduling capacity of 768 threads, and 8192registers. So one must make sure that the number of individuals in a block isnot greater than the scheduling capacity, and that there are enough individualson a bundle in order to maximize this capacity. In this paper, the implementedalgorithm spreads the population into n ∗ k blocks, where n is the number ofbundles on the GPGPU and k is the integer ceiling of popSize

n∗schedLimit . This simplealgorithm yields good results in tested cases. However, a strategy to automati-cally adapt blocks definitions to computation complexity either by using a staticor dynamic approach needs to be investigated in some future work.

When a population of children is ready to be evaluated, it is copied onto theGPGPU memory. All the individuals are evaluated with the same evaluationfunction, and results (fitnesses) are sent back to the host CPU, that containsthe original individuals and manages the populations.

On a standard host CPU EA implementation, an individual is made of a genomeplus other information, such as its fitness, and statistics information (whether ishas recently been evaluated or not, . . . ). So the transfer of n genomes only ontothe GPGPU would result in n transfers and individuals scattered everywhere inGPGPU memory. Such a number of memory transfers would have been inaccept-able. So, it was chosen to ensure spatial locality of all genomes to be contiguous intohost CPU memory before the transfer, by copying every newly created individualin a buffer right after its creation. Then this buffer is sent to the GPU memoryin one single transfer. Some experiments showed that on a particular case, with alarge number of children, the transfer time went from 80 seconds with scattereddata down to 180µs whith a buffer of contiguous data.

Our implementation uses only global memory since in the general case, theevaluation function does not generate significant data reuse that would justify theuse of the small 16KB shared memory or the texture cache. Indeed with sharedmemory, the time saved in data accesses is generally wasted by data transfersbetween global memory and shared memory. Notice that shared memory is notaccessible from the host CPU part of the algorithm. Hence one has to first copydata into global memory, and in a second step into shared memory.

The chosen implementation strategy exhibits the main overhead risk as beingthe time spent to transfer the population onto the GPGPU memory. Hence the(computation time)/(transfer time) ratio needs to be large enough to effectivelytake advantage of the GPGPU card. Experiments on data transfer rate showthat on the 8800GTX, a 500 MB/s bandwidth was reached, which is muchlower than the advertised 4GB/s maximum bandwidth. However, experimentspresented in the following show that this rate is quite acceptable even for verysimple evaluation functions.

Efficient Parallel Implementation of Evolutionary Algorithms 979

5 Experiments

Two implementations have been tested: a toy problem that contains interestingtuneable parameters allowing to observe the behaviour of the GPGPU card,and a much more complex real world problem to make sure that the GPGPUprocessors are also able to run more complex fitness functions. In fact, the 400code lines of the real world evaluation function were programmed by a chemist,who has not the least idea on how to use a GPGPU card.

5.1 The Weierstrass Benchmark Program

Weierstrass-Mandelbrot test functions, defined as:

Wb,h(x) =∑∞

i=1 b−ihsin(bix) with b > 1 and 0 < h < 1

are very interesting to use as a test case of CPU usage in evolutionary compu-tation since they provide two parameters that can be adjusted independently.

Theory defines Weierstrass functions as an infinite sum of sines. Programmersperform a finite number of iterations to compute an approximation of the func-tion. The number of iterations is closely related to the host CPU time spent inthe evaluation function.

Another parameter that can also be adjusted is the dimension of the prob-lem: a 1,000 dimension Weierstrass problem takes 1,000 continuous parameters,meaning that its genome is an array of 1,000 float values, while a 10 dimen-sion problem only takes 10 floats. The 10 dimension problem will evidently takemuch less time to evaluate than the 1,000 dimension problem. But since eval-uation time also depends on the number of iterations, tuning both parametersprovides many configurations combining genome size and evaluation time.

Figure 2 left shows the time taken by the evolutionary algorithm to compute10 generations on both a 3.6GHz Pentium computer and an 8800GTX GPGPU

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Tim

e (

s)

Population size

Weierstrass Iteration 120 on CPUWeierstrass Iteration 120 on GPU

0

20

40

60

80

100

0 1000 2000 3000 4000 5000 6000

Tim

e (

s)

Population size

GPU Iter 10GPU Iter 70

GPU Iter 120

Fig. 2. Left: Host CPU (top) and CPU+8800GTX (bottom) time for 10 generationsof the Weierstrass problem on an increasing population size. Right: CPU+8800GTXcurve only, for increasing numbers of iterations and increasing population sizes.

980 O. Maitre et al.

card, for 1000 dimensions, 120 iterations and a number of evaluations per gen-eration growing from 16 to 4,096 individuals (number of children = 100% of thepopulation). This represents the total time (including what is serially done onthe host CPU (population management, crossovers, mutations, selections, . . . )on both architectures (host CPU only and host CPU+GPU).

For 4,096 evaluations (×10 generations), the host CPU spends 2,100 secondswhile the host CPU and 8800GTX only spends 63 seconds, resulting in a speedupof 33.3.

Figure 2 right shows the same GPGPU curve, for different iteration counts.On this second figure, one can see that the 8800GTX card steadily takes inmore individuals to evaluate in parallel without much difference in evaluationtime until the threshold number of 2048 individuals is reached, after which itgets saturated. Beyond this value, evaluation time increases linearly with thenumber of individuals, which is normal since the parallel card is already workingon a full load. It is interesting to see that with 10 iterations, the curve beforeand after 2048 has nearly the same slope, meaning that for 10 iterations, thetime spent in the evaluation function is negligible, so the curve mainly showsthe overhead time.

Since using a GPGPU card induces a necessary overhead, it is interestingto determine when it is advantageous to use an 8800GTX card. Figure 3 leftshows that on a small problem with 10 dimensions Weierstrass and 10 iterationsrunning in virtually no time, this threshold is met between 400 individuals or600 individuals, depending whether the genome size uses 40 bytes or 4 kilobytes,which is quite a big genome.

The steady line (representing the host CPU) shows an evaluation time slightlyshorter than 0.035 milliseconds, which is very short, even on a 3.6GHz computer.

The 3 GPGPU curves show that indeed, the size of the genomes has an impactwhen individuals are passed to the GPGPU card for evaluations. On this figure,evaluation is done on a 10 dimension Weierstrass function corresponding to a 40bytes size (the 8800GTX card only accepts floats). The additional genome datais not used on the 2 kilobytes and 4 kilobytes genomes, in order to isolate thetime taken to transfer large size genomes to the GPGPU for all the population.

0

0.1

0.2

0.3

0.4

0.5

0 200 400 600 800 1000 1200

Tim

e (

s)

Population size

Weierstrass Cpu Iteration 10Weierstrass 40B data Iteration 10Weierstrass 2KB data Iteration 10Weierstrass 4KB data Iteration 10

0

5

10

15

20

25

30

35

40

45

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

Tim

e (

s)

Population size

Weierstrass Iteration 10Weierstrass Iteration 70

Weierstrass Iteration 120

Fig. 3. Left: determination of genome size overhead on a very short evaluation. Right:Same curves as figure 2 right, but for the GTX260 card.

Efficient Parallel Implementation of Evolutionary Algorithms 981

On figure 3 right, the same measures are shown with a recently acquiredGTX260 NVidia card. One can see that with this card, total time is only 20seconds for a population of 5,000 individuals while the 8800GTX card takes 60seconds and the 3.6GHz Pentium takes 2,100 seconds. So where the 8800GTXvs host CPU speedup was 33.3, the GTX260 vs host CPU speedup is about 105,which is quite impressive for a card that only costs around $250.00.

5.2 Application to a Real-World Problem

In materials science, knowledge of the structure at an atomistic/molecular level isrequired for any advanced understanding of its performance, due to the intrinsiclink between the structure of the material and its useful properties. It is thereforeessential that methods to study structures are developed.

Rietveld refinement techniques [10] can be used to extract structural detailsfrom an X-Ray powder Diffraction (XRD) pattern [2,1,4], provided an approx-imate structure is known. However, if a structural model is not available, itsdetermination from powder diffraction data is a non-trivial problem. The struc-tural information contained in the diffracted intensities is obscured by systematicor accidental overlap of reflections in the powder pattern.

As a consequence, the application of structure determination techniques whichare very successful for single crystal data (primarily direct methods) is, in general,limited to simple structures. Here, we focus on inherently complex structures ofa special type of crystalline materials whose periodic structure is a 4-connected3 dimensional net such as alumino-silicates, silico-alumino-phosphates (SAPO),alumino-phosphates (AlPO), etc...

The genetic algorithm is employed in order to find “correct” locations, e.g.from a connectivity point of view, of T atoms. As the distance T −T for bondedatoms lies in a fixed range [dmin-dmax], the connectivity of each new config-urations of T atoms can be evaluated. The fitness function corresponds to thenumber of defects in the structure, and Fitness=f1 + f2 is defined as follows:

1. All T atoms should be linked to 4 and only 4 neighbouring Ts, so:

f1 = Abs(4 − Number of Neighbours)2. no T should be too close, e.g. T − T < dmin, so:

f2 = Number of Too Close T Atoms

Speedup on this chemical problem: As mentioned earlier, the source codecame from our chemist co-author, who is not a programming expert (but isnevertheless capable of creating some very complex code) and knows nothingabout GPGPU architecture and its use.

First, while the 3.60GHz CPU was evaluating 20,000 individuals in 23 secondsonly, which seemed really fast considering the very complex evaluation function,the GPGPU version took around 80 seconds which was disappointing.

When looking at the genome of the individuals, it appeared that it was codedin a strange structure, i.e. an array of 4 pointers towards 4 other arrays of 3

982 O. Maitre et al.

0

1

2

3

4

5

6

7

8

9

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Tim

e (

s)

Population size

Real problem on CPUReal problem on GPU

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Tim

e (

s)

Population size

Real problem on GPU

Fig. 4. Left: evaluation times for increasing population sizes on host CPU (top) andhost CPU + GTX260 (bottom). Right: CPU + GTX260 total time.

floats. This structure seemed much too complex to access, so it was suggested toflatten it into a unique array of 12 floats which was easy to do, but unfortunately,the whole evaluation function was made of pointers to some parts of the previousgenome structure. After some hard work, all got back to a pointer-less flat code,and evaluation time for the 20,000 individuals instantaneously dropped from 80second down to 0.13 seconds. One conclusion to draw out of this experience isthat, as expected, GPGPUs do not seem very talented in allocating, copying andde-allocating memory.

Back to the host CPU, the new function now took 7.66s to evaluate 20,000 indi-viduals, meaning that all in all, the speedup offered by the GPGPU card is nearly60 on the new GTX260 (figure 4 left). Figure 4 right shows only the GTX260 curve.

6 EASEA: Evolutionary Algorithm SpecificationLanguage

EASEA2 [3] is software platform that was originally designed to help non ex-pert programmers to try out evolutionary algorithms to optimise their appliedproblems without the need to implement a complete algorithm. It was revivedto integrate the parallelization of children evaluation on NVidia GPGPU cards.

EASEA is a specification language that comes with its dedicated compiler.Out of the weierstrass.ez specification program shown in fig. 5, the EASEAcompiler will output a C++ source file that implements a complete evolutionaryalgorithm to minimize the Weierstrass function.

The -cuda option has been added to the compiler in order to automaticallyoutput a source code that will parallelize the evaluation of the children popula-tion on a CUDA-compatible NVidia card out of the same .ez specification code,therefore allowing anyone who wishes to use GPGPU cards to do so withoutany knowledge about GPGPU programming. However, a couple of guidelinesthat were included in the latest version of the EASEA user manual, should be2 EASEA (pron. [i:zi:]) stands for EAsy Specification for Evolutionary Algorithms.

Efficient Parallel Implementation of Evolutionary Algorithms 983

1 \User classes :

2 GenomeClass {

3 float x[SIZE];

4 }

1 \GenomeClass::mutator :

2 for (int i=0; i<N; i++)

3 if (tossCoin(pMutProb)){

4 Genome.x[i]+=SIGMA*

random(0.,1.);

5 Genome.x[i]=MAX(X_MIN,

MIN(X_MAX,Genome.x[i

]));

6 }

1 \GenomeClass::crossover:

2 for(i=0; i<N; i++){

3 float a = (float)randomLoc

(0.,1.);

4 if (&child1)

5 child1.x[i] =

6 a*parent1.x[i]+

7 (1-a)*parent2.x[i];

8 }

1 \GenomeClass::initialiser :

2 for(int i=0; i<N; i++) {

3 Genome.x[i] = random

(-1.0,1.0);

4 }

1 \GenomeClass::evaluator:

2 float res = 0., b=2.;

3 float h=.25, val[SIZE];

4 int i,k;

5 for(i = 0;i<N; i++){

6 val[i] = 0.;

7 for (k=0;k<ITER;k++)

8 val[i] +=

9 pow(b,-(float)k*h) *

10 sin(pow(b,(float)k)*

11 Genome.x[i]);

12 res += Abs(val[i]);

13 }

14 return (res);

15 }

Fig. 5. Evolutionary algorithm minimizing the Weierstrass function in EASEA

followed, such as using flat genomes rather than pointers towards matrices, fit-ness functions that do not use more registers than available on a GPGPU unit,floats rather than doubles3, . . .

Adding the -cuda option to this compiler is very important, since it not onlyallows replication of the presented work, but also gives non GPGPU expert pro-grammers the possibility to run their own code on these powerful parallel cards.

7 Related Work

Even though many papers have been written on the implementation of GeneticProgramming algorithms on GPGPU cards, only three papers were found on theimplementation of standard evolutionary algorithms on these cards.

In [11], Yu et al. implement a refined fine-grained algorithm with a 2D toroidalpopulation structure stored as a set of 2D textures, which imposes restrictionson mating individuals (that must be neighbours). Other constraints arise, suchas the need to store a matrix of random numbers in GPGPU memory for future3 Although available on all recent GPGPU cards, using double precision variables will

apparently considerably slow down the calculations on current GPGPU cards, butthis has not been tested yet, as all this work has been done on an 8800GTX cardthat can only manipulate floats.

984 O. Maitre et al.

reference, since there is no random number generator on the card. Anyway, a 10time speedup is obtained, but on a huge population of 512× 512 individuals.

In [7], Fok et al. find that standard genetic algorithms are ill-suited to runon GPGPUs because of such operators as crossovers “that would slow downexecution when executed on the GPGPU” and therefore choose to implementa crossover-less Evolutionary Programming algorithm [6] here again entirely onthe GPGPU card. The obtained speedup of their parallel EP “ranges from 1.25to 5.02 when the population size is large enough.”

In [8], Li et al. implement a Fine Grained Parallel Genetic Algorithm onceagain on the GPGPU, to “avoid massive data transfer.” For a strange reason,they implement a binary genetic algorithm even though GPGPUs have no bit-wise operators, so go into a lot of trouble to implement simple genetic operators.

To our knowledge no paper proposed the simple approach of only parallelizingthe evaluation of the population on the GPGPU card.

8 Conclusion and Future Developments

Results show that deporting the children population onto the GPGPU for a par-allel evaluation yields quite significant speedups of up to 100 on a $250 GTX260card, in spite of the overhead induced by the population transfer.

Being faster by around 2 orders of magnitude is a real breakthrough in evo-lutionary computation, as it will allow applied scientists to find new results intheir domains. Then, researchers on artificial evolution will need to modify theiralgorithms to adapt them to such speeds, that will probably lead to prematureconvergence, for instance. Then, unlike many other works that are difficult (ifnot impossible) to replicate, knowhow on the parallelization of evolutionary al-gorithms has been integrated into the EASEA language. Researchers who wouldlike to try out these cards can simply specify their algorithm using EASEA andthe compiler will parallelize the evaluation.

Anyway, many improvements can still be expected. Load balancing couldprobably be improved, in order to maximize bundles throughput. Using texturecache memory may be interesting on evaluation functions that repeatedly ac-cess genome data. Automatic use of shared memory could also yield some goodresults, particulary on local variables in the evaluation function.

Finally, an attempt to implement evolutionary algorithms on Sony/Toshiba/IBM Cell multicore chips is currently being made. Its integration into the EASEAlanguage could allow to compare performance of GPGPU and Cell architectureon identical programs.

References

1. Baumes, L.A., Moliner, M., Corma, A.: Design of a full-profile matching solution forhigh-throughput analysis of multi-phases samples through powder x-ray diffraction.Chemistry - A European Journal (in press)

2. Baumes, L.A., Moliner, M., Nicoloyannis, N., Corma, A.: A reliable methodologyfor high throughput identification of a mixture of crystallographic phases frompowder x-ray diffraction data. CrystEngComm. 10, 1321–1324 (2008)

Efficient Parallel Implementation of Evolutionary Algorithms 985

3. Collet, P., Lutton, E., Schoenauer, M., Louchet, J.: Take it easea. In: Informatics:10 Years Back, 10 Years Ahead. LNCS, pp. 891–901. Springer, Heidelberg (2000)

4. Corma, A., Moliner, M., Serra, J.M., Serna, P., Diaz-Cabanas, M.J., Baumes, L.A.:A new mapping/exploration approach for ht synthesis of zeolites. Chemistry ofMaterials, 3287–3296 (2006)

5. Darwin, C.: On the Origin of Species by Means of Natural Selection or the Preser-vation of Favoured Races in the Struggle for Life. John Murray, London (1859)

6. Fogel, D.B.: Evolving artificial intelligence. Technical report (1992)7. Fok, K.-L., Wong, T.-T., Wong, M.-L.: Evolutionary computing on consumer

graphics hardware. IEEE Intelligent Systems 22(2), 69–78 (2007)8. Li, J.-M., Wang, X.-J., He, R.-S., Chi, Z.-X.: An efficient fine-grained parallel ge-

netic algorithm based on gpu-accelerated. In: NPC Workshops. IFIP InternationalConference on Network and Parallel Computing Workshops, 2007, pp. 855–862(2007)

9. De Jong, K.: Evolutionary Computation: a Unified Approach. MIT Press, Cam-bridge (2005)

10. Young, R.A.: The Rietveld Method. OUP and International Union of Crystallog-raphy (1993)

11. Yu, Q., Chen, C., Pan, Z.: Parallel genetic algorithms on programmable graphicshardware. In: Wang, L., Chen, K., S. Ong, Y. (eds.) ICNC 2005. LNCS, vol. 3612,pp. 1051–1059. Springer, Heidelberg (2005)