Parallel Genetic Simulated Annealing: A Massively Parallel SIMD Algorithm

126 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 2, FEBRUARY 1998

Parallel Genetic Simulated Annealing:A Massively Parallel SIMD Algorithm

Hao Chen, Student Member, IEEE, Nicholas S. Flann, Member, IEEE Computer Society,and Daniel W. Watson, Member, IEEE Computer Society

Abstract —Many significant engineering and scientific problems involve optimization of some criteria over a combinatorialconfiguration space. The two methods most often used to solve these problems effectively—simulated annealing (SA) and geneticalgorithms (GA)—do not easily lend themselves to massive parallel implementations. Simulated annealing is a naturally serialalgorithm, while GA involves a selection process that requires global coordination. This paper introduces a new hybrid algorithm thatinherits those aspects of GA that lend themselves to parallelization, and avoids serial bottle-necks of GA approaches byincorporating elements of SA to provide a completely parallel, easily scalable hybrid GA/SA method. This new method, calledGenetic Simulated Annealing, does not require parallelization of any problem specific portions of a serial implementation—existingserial implementations can be incorporated as is. Results of a study on two difficult combinatorial optimization problems, a 100 citytraveling salesperson problem and a 24 word, 12 bit error correcting code design problem, performed on a 16K PE MasPar MP-1,indicate advantages over previous parallel GA and SA approaches. One of the key results is that the performance of the algorithmscales up linearly with the increase of processing elements, a feature not demonstrated by any previous parallel GA or SAapproaches, which enables the new algorithm to utilize massive parallel architecture with maximum effectiveness. Additionally, thealgorithm does not require careful choice of control parameters, a significant advantage over SA and GA.

Index Terms —Genetic algorithms, simulated annealing, parallel algorithms, SIMD, combinatorial optimization, hybrid methods,traveling salesperson, massive parallelism, error correcting codes.

—————————— ✦ ——————————

1 INTRODUCTION

IMULATED annealing (SA) and genetic algorithms (GA)represent powerful combinatorial optimization meth-

ods with complementary strengths and weaknesses. Owingto the large size of problems for which SA and GA are typi-cally used, such as VLSI cell placement [15], scheduling,and protein folding, these algorithms are good candidatesfor parallel implementation. However, because of the serialnature of SA and some portions of GA, good parallel im-plementations have been difficult to develop [9]. In thispaper, a new method that combines the recombinativepower of GA and local selection of SA (via the annealingschedule) is presented. The new algorithm takes advantageof those aspects of GA that lend themselves to paralleliza-tion, and avoids serial bottle necks of GA approaches byincorporating elements of SA to provide a completely par-allel, easily scalable hybrid GA/SA method. Results of astudy using the new approach for two difficult combinato-rial optimization problems (the 100-city Traveling Salesper-son Problem (TSP), and a 24 bit 12 word error correctingcode design problem) are presented, and indicate somesignificant advantages over previous parallel GA and SAapproaches.

Both SA and GA approaches are naturally motivated,general purpose combinatorial optimization methods thatshare many similarities. Each method requires little knowl-edge of the problem to be optimized other than a fitness orcost function to be applied to candidate solutions. The twotechniques initially begin a search through the space ofcandidate solutions with randomly generated candidatesand, then, they incrementally generate new candidates byapplying operators. Each decision determining which can-didates are pursued is controlled by a probabilistic decisionprocedure that guides the method into near-optimal re-gions of the solution space.

1.1 Simulated AnnealingSA algorithms approach optimization problems by ran-domly generating a candidate solution and, then, makingsuccessive random modifications. A temperature parameteris used to control the acceptance of modifications. Initially,the temperature is at a high value and it is decreased overtime. If the modified solution is found to have a better fit-ness than its predecessor, then it is retained and the previ-ous solution is discarded. If the modified solution is foundto be less fit than its predecessor, it is still retained with aprobability directly related to the current temperature. Asexecution of the algorithm continues and the temperaturebecomes cooler, it becomes less likely that unfavorable so-lutions are accepted. By using this approach, it is possiblefor an SA algorithm to move out of local minima early inexecution, and more likely that good solutions will not bediscarded late in the algorithm’s execution.

1045-9219/98/$10.00 © 1998 IEEE

¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥¥

• H. Chen is with SingleTrac Entertainment Inc.• E-mail: [email protected].• N.S. Flann and D.W. Watson are with the Department of Computer Sci-

ence, Utah State University, Logan, UT 84322-4205. E-mail: [email protected], [email protected].

Manuscript received 7 Dec. 1995; revised 25 Nov. 1996.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number 100063.

S

CHEN ET AL.: PARALLEL GENETIC SIMULATED ANNEALING: A MASSIVELY PARALLEL SIMD ALGORITHM 127

Although SA algorithms are conceptually simple, find-ing optimal parameters for SA, i.e., initial temperature, a (aconstant value between 0.0 to 1.0 that is multiplied to thecurrent temperature at fixed intervals to obtain new tem-perature values), etc., is by no means simple or straightfor-ward. First of all, setting parameters for SA is problem de-pendent, and it is best accomplished through trial and er-ror. Furthermore, previous studies with SA have demon-strated that SA algorithms are very sensitive to parameters,and their performances are largely dependent on fine tun-ing of the parameters [12]. The problem dependent natureof setting parameters for SA and SA algorithms’ sensitivityto parameters limit the effectiveness and robustness of SAalgorithms.

Because there is a single solution that is modified overtime, the SA algorithm is naturally serial and difficult toimplement on parallel systems with appreciable speedup.Many approaches have been tried without fundamentallychanging the serial nature of the SA algorithm, with somelimited success [2] and [27]. There are three major ap-proaches to parallelizing SA, serial-like, altered generation,and asynchronous [9]. Serial-like algorithms seek to partitionthe problem of evaluating the candidates, applying the op-erators, or accepting the modifications into independentsubtasks that can be computed in parallel. Such fine-grained decomposition is often ineffective, because com-munication costs can outweigh any improvement. Resultsdemonstrate speedups of up to three, regardless of thenumber of processors. Altered generation approacheschange the algorithm so that multiple candidates and mul-tiple modifications are considered in parallel. Often, thesemethods rely on domain specific decompositions whichlimit their applicability. However, speedups of up to eighton a 30 processor system have been reported [6]. Asyn-chronous algorithms perform modifications in parallel,where each processor ignores modifications performed bythe other processors. This method induces errors in costcalculations, but with the advantage of reduced communi-cation. Speedups have approached 50 for some specializedproblems, making the asynchronous approach one of themost effective for parallelizing SA [5]. It is important tonote that, with the exception of [27], all parallel implemen-tations of SA require extensive recoding of the serial do-main dependent implementations.

1.2 Genetic AlgorithmsBy contrast to SA, in a GA approach to solving combinato-rial optimization problems, a population of candidate solu-tions are maintained. To generate a new population, candi-date solutions are randomly paired. For each pair of solu-tions, a crossover operator is first applied with a moderateprobability (crossover rate) and two new solutions are gen-erated. Each new solution is then modified using a mutationoperator with a small probability (mutation rate). The re-sulting two new solutions replace their parents in the oldpopulation forming a temporary new population. Each so-lution in the temporary population is ranked against othersolutions based on the fitness. A lottery process is then usedto determine a new population identical in size to the pre-vious population, such that higher ranked candidates are

allowed to assume more places in the new population. GAalgorithms iterate over a large number of generations (newpopulations) and, in general, as the algorithm executes,solutions in the population becomes more fit, resulting inbetter candidate solutions.

Like SA, GA is also sensitive to parameters [12]. Adjust-ments in population size, crossover probability, mutationrate, and other parameters can influence GA’s behavior andlead to significantly superior (or inferior) results. Such ad-justments are problem dependent and are best accom-plished through trial and error. The difficulty and uncer-tainty of finding optimal parameters for GA is a significantproblem for GA.

GA is better suited towards implementation on parallelarchitectures because the population of candidate solutionscan be easily distributed across processors in the system.Additionally, the operations performed for each member ofthe population are nearly identical across all candidates.One of the challenges of parallelizing GA is in the selectionof the new population. Because the new candidates, gener-ated from crossover-mutation operations, must be rankedand globally shared before a new population can be chosen,the selection process represents a serial component of thealgorithm. Memory and communication limitations forprocessors in massively parallel systems can compound theproblem; for large populations distributed among proces-sors in a parallel system, there is often insufficient memoryto completely replicate the population for each processor,making implementation of the lottery process difficult. Thisserial bottle-neck can be minimized, but not eliminated bypartitioning the population into subpopulations, wherecrossover and selection occur, independent of other sub-populations. To prevent stagnation of each subpopulation,individuals are exchanged between the subpopulationsperiodically [23], [7], [22].

1.3 Comparing GA and SAAlthough GA and SA are similar, there are some importantdifferences between the two methods. One important dis-tinction is that SA possesses a formal proof of convergenceto the global optima, which GA does not have [25]. Thisconvergence proof relies on a very slow cooling schedule ofsetting the temperature Ti = T0/log i, where i is bound bythe number of iterations and T0 is sufficiently large [12].While this cooling schedule is impractical, it identifies auseful trade-off where longer cooling schedules tend tolead to better quality solutions. There is no such controlparameter in GA and premature convergence to a local op-tima is a significant problem.

Another difference between GA and SA is that SA ac-cepts newly generated candidate solutions probabilisticallybased on their fitness and only accepts inferior candidatessome of the time. This is in contrast to GA, where new can-didates are always accepted, even if they are significantlyinferior to older candidates. This characteristic can lead todisruption, where good candidates are lost or damaged,preventing optimal performance.

Despite these shortcomings, there are some distinct ad-vantages of GA over SA. GA maintains a population ofcandidate solutions while SA maintains only one solution.


This has many significant impacts on how the solutionspace is searched. GA can retain useful redundant infor-mation about what it has learned from previous searches byits representation in individual solutions in the population[14]. Critical components of past good solutions can becaptured, which can be combined together via crossover toform high quality solutions. SA, on the other hand, retainsonly one solution in the space and exploration is limited tothe immediate neighborhood.

Finally, and most significantly in the context of thisstudy, many portions of GA are more suitable for imple-mentation on parallel architectures, and can be adapted torun on massively parallel computers [7], while SA is aninherently serial algorithm and attempts to parallelize ithave led to relatively small speedups [9]. Moreover, someGA implementations have been shown to exhibit the ap-pearance super-linear speedup over serial versions of thealgorithm [11], [18].

1.4 Hybrid GA/SA MethodsGA and SA have complementary strengths and weak-nesses. While GA exhibits parallelism and is better suited toimplementation on massively parallel architectures, it suf-fers from poor convergence properties and a serial bottle-neck due to global selection. SA, by contrast, has good con-vergence properties, but it cannot easily exploit parallelism.However, SA does employ a completely local selectionstrategy where the current candidate and the new modifi-cation are evaluated and compared.

In this paper, a hybrid method that combines the recom-binative power of GA (via the crossover operator) and an-nealing schedule of SA is presented. The new method, re-ferred to as Genetic Simulated Annealing or GSA, has beenshown to overcome the poor convergence properties of GAand perform better than either GA or SA alone in a recentstudy with ten difficult optimization problems [3].

The method is related to other recent hybrid methodsdescribed in [17], [18], [19], and [20]. Of particular interest isthe so-called Cellular Genetic Algorithm [28]. Cellular GAis similar to GSA in several ways. They both maintain onesolution per Processing Element (PE). In both algorithms,each PE accepts a visiting solution from other PEs forcrossover and mutation. However, there are some key dif-ferences between Cellular GA and GSA. In Cellular GA, aPE selects the best solution of its immediate neighbors forcrossover and mutation. To avoid serialization, all of thePEs’ neighbors must be transferred and stored, followed bylocal comparisons to determine the best individual formating. In GSA, all PEs receive a visiting solution from thesame direction, and its mating is not limited to the immediateneighborhood. In Cellular GA, a greedy selection is usedwhere the offspring replaces the original individual if it has ahigher fitness. In GSA, an SA-type probablistic selection pro-cedure is used. These differences affect the convergenceproperties of both algorithms. While GSA can be shown toretain the proof of convergence of SA (see [18]), Cellular GAhas no such proof and can suffer from the premature conver-gence problems of standard GA algorithms.

The principal goal of this paper is to demonstrate thatusing GSA enables the implementation of a powerful, mas-

sively parallel implementation that overcomes the inherentweaknesses of parallel implementations of both GA andSA. By exploiting the population-based model and cross-over-mutation operator of GA, the algorithm is naturallyparallel. By exploiting the local selection strategy of SA, theprocessing bottle-neck of global selection is eliminated. InGSA, selection decisions are made locally, and informationsharing is performed regionally. An important aspect of theapproach developed in this study is that, although the for-mation of the solution is determined in parallel, the indi-vidual crossover/mutation and fitness functions are repli-cated, independent copies of serial algorithms. This is asignificant contribution of the study, because none of theproblem-specific portions of algorithm require paralleliza-tion; existing serial implementations can be incorporated asis. Only non-problem-specific portions of the algorithmrequire careful parallelization. The algorithm implementedfor one application can be easily modified to function forother applications.

The rest of the paper is organized as follows. First, weprovide details on the target architecture, the MasPar MP1machine. Second, the GSA algorithm is introduced and itsimplementation on a MasPar described in detail. Third, thetwo problem domains, the traveling salesperson (TSP) anderror correcting code design are given. Then, an empiricalstudy is presented where GSA is applied under varying con-ditions to demonstrate how the performance of the method isaffected by parallelism and by the choice of parameters.

1.4.1 Target Architecture: 16K Processor MasPar MP-1The target architecture for implementation of the algorithmis the 16K processor MasPar MP-1 at the Parallel ProcessingLaboratory at the Purdue University School of ElectricalEngineering. The MasPar MP-1 [1] is an SIMD (Single In-struction stream—Multiple Data stream) machine withprocessing elements (PEs—processor/memory pairs) ar-ranged in a 128 by 128 toroidal grid. Each PE contains 16Kof local memory and has connections to each of its eightnearest neighbors.

Instructions on the MasPar are broadcast to the PEs fromthe Array Control Unit (ACU). PEs can be enabled and dis-abled for individual instructions, but PEs can only executeinstructions broadcast by the ACU; i.e., there is a singlethread of control. One result of having a single instructionstream is the introduction of serialization for conditionalstatements [26].

This effect is especially pronounced for if-then-elsestatements where the branch condition is based on local PEdata. Only those active PEs for which the condition is truecan be enabled while the instructions for the “then” clauseare broadcast from the ACU. Conversely, only those activePEs for which the condition is false can be enabled for the“else” clause. Because there is only a single thread of con-trol, the “then” and “else” clauses cannot be executed con-currently. To reduce the impact of this effect, the use ofconditional classes in the GSA algorithm is minimized.

Interconnections among the processors are accomplishedwith the use of an Xnet interconnection scheme, illustratedin Fig. 1. PEs in the figure, represented by boxes, can sendand receive data with each of eight nearest neighbor PEs,


accomplished transparently to the user via the switchingelements (dots in the figure) in the network. Inter-PE trans-fers are performed implicitly in code by referencing anotherPE in one of the eight cardinal directions via an XNet call.For example, the call XNetNW(3).shmoo would reference theparallel expression shmoo on the third PE in the northwestdirection for all enabled PEs.

One benefit of implicit PE synchronization becomes ap-parent when inter-PE data transfers are needed. When onePE sends data to another PE, all enabled PEs send data.Therefore, the send and receive commands are implicitlysynchronized. Because all enabled PEs follow the same sin-gle instruction stream, each PE knows from which PE themessage has been received and for what use the message isintended. As a result, no explicit synchronization and iden-tification protocols are needed for each inter-PE transfer.

An alternative to the SIMD mode of parallel processingis MIMD (Multiple Instruction stream, Multiple Datastream) mode. MIMD machines have no ACU—each proc-essor contains an independent set of instructions (i.e., aprogram) in its own local memory. Typically, the set of in-structions is the same for each processor, although there isin general no restriction against having different programson different processors. MIMD machines have be used withsuccess on genetic programming problems (e.g., [25], [22],[4]). The genetic simulated annealing method introduced inthis paper could be adapted to run on an MIMD machinewith little modification. The approach would maintainsmall populations of solutions on each machine, then per-form periodic exchanges between the populations. This issimilar to the parallel GA implementation described in [23].

2 GENETIC SIMULATED ANNEALING

Genetic Simulated Annealing is a naturally parallel algo-rithm that maps well onto SIMD machines such as theMasPar MP-1. The main algorithm is given in Table 1, and

is run on each PE on the MasPar. There are three input pa-rameters: max_iteration, which is the number attemptedmodifications to the solution; random_seed, which is a globalconstant that ensures communication-free coordinationamong the PEs; and max_distance, which determines themaximum distance in the two dimensional grid of PEs forwhich information can be exchanged. The max_distance istypically chosen to limit sharing of information to regionslocal to each PE. The cooling schedule is initialized using arandom technique described below by setting the parame-ters temperature and a. Then each PE creates a unique ran-dom solution, referred to as the resident solution, which isiteratively improved for max_iteration iterations. To termi-nate, the solution with the minimum value, discovered overall PEs, is returned.

During each iteration, each PE first generates a directionand distance in the PE grid to obtain a pairing from anotherPE for its resident solution. Note that, because each PE usesthe same random seed, each PE generates exactly the samedirection and distance as all other PEs for that iteration.This enables coordination among the PEs without explicitcommunication and ensures that each PE chooses a uniquepairing candidate. Additionally, when the pairing candi-date is then accessed via the XNet call, no PE will havemore than one “send” and “receive” operation, preventingdelay due to idle PEs. The crossover_mutation operatoruses the resident candidate r and the visiting candidate v togenerate two new solutions n0 and n1 via a crossover op-erator followed by mutation. These operators can be do-main independent, as in the case for error correcting codedesign, or domain specific, as in the case of the TSP problemand chip layout problems. Details of the TSP and ECC codeproblems are provided in Section 3.1.1 and Section 3.1.3.

The selection process chooses a single candidate solutionthat replaces the resident candidate from the current resi-dent candidate and the new candidate. Note that this selec-tion process uses only locally available information to makeits decision—no global coordination or ranking is required.The process is adapted from the one used in standard SA.

The standard stochastic selection criterion is modifiedfor use in the GSA algorithm in two ways. First, a determi-nistic selection criterion is used, where a new candidate isaccepted if the cost increase is less than or equal to the cur-rent temperature. This is implemented in function f(sn, so,

Fig. 1. The connection grid of PEs in the MasPar MP-1 used in thisstudy. The boxes represent PEs, the grids represent the XNet localconnection grid.

TABLE 1GSA ALGORITHM RUNNING ON EACH PE

OF A MASPAR MACHINE

1 begin2 temperature := _initial_temperature();3 r := _random_solution();4 for i := 1 to max_iteration do5 begin6 direction := random(0, 7, random_seed);7 distance := random(1, max_distance, random_seed);8 v := XNet_direction(distance).r;9 {n0, n1} := crossover_mutation(r, v);10 r := select(r, n0, n1, temperature);11 temperature := temperature × a;12 end13 return r14 end


temperature), given in Table 2. This deterministic criterion[21] is much more efficient to implement and has beenfound empirically to perform equivalently to the more ex-pensive stochastic criterion. Second, because two parentand two children candidates are used in the selection, theselection function f(sn, so, temperature) must be applied mul-tiple times to identify the single surviving candidate. Thisprocess is represented in Table 3. In the selection function,the child candidates n0 and n1 are compared with eachother, then with the resident candidate r. Transitivity ofacceptance is then applied to determine a single winningcandidate. For example, consider line 4 in Table 3, where ris accepted over n0, but n1 is accepted over both r and n0.Hence, n1 is accepted overall and returned. Where transi-tivity does not uniquely determine a candidate, one is cho-sen arbitrarily according to the table. This situation onlyoccurs at high temperature, where the affect of temperaturedominates, rather than individual solution cost. Note thatthe need for conditional execution and its resulting ineffi-ciency is minimized. Only f() contains conditional code.

The final step of the iteration loop is to reduce the tem-perature. A geometric reduction in temperature is chosenhere where a is some value between 0.0 and 1.0. While notguaranteed to produce convergence to the optimal solution,it has been shown to produce near-optimal solutions quickly[12]. In traditional SA, the initial temperature and a are setthrough a trial and error process. Previous studies, describedin [4], demonstrate that cooling schedules can be set using arandom process when population sizes are large and result insuperior performance. The initial temperature is computed ininitial_temperature(), which samples the effect of operators inthe domain and returns a temperature such that initial prob-ability of an “uphill” change being accepted is between 0.5and 1.0. The a parameter is calculated from a randomly gen-erated final temperature, the initial temperature, and themax_iteration parameter.

3 EMPIRICAL STUDY

The previous sections defined the GSA algorithm and de-scribed its parallel implementation. In the following sec-tions, we present experiments designed to explore severalinteresting aspects of GSA under a realistic massively par-allel environment. The algorithm is implemented in MPL, alanguage based on ANSI C, with extensions to allow dataparallel programming and is executed on a Maspar MP-1.

3.1 Experimental DesignThe effectiveness of GSA depends on:

1) Whether or not GSA can take advantages of GA, SAand overcome their weaknesses;

2) Whether or not GSA can utilize massively parallel ar-chitecture with high efficiency;

3) Whether GSA can be effective across a diversity ofdifferent tasks.

A serial implementation of GSA has been found to be supe-rior to both GA and SA in an empirical study [3]. Therefore,this study focuses on aspects of GSA that are unique fromGA, SA, and GSA’s serial counterpart.

Previous studies with SA and GA have demonstratedthat the performance of these algorithms is largely depend-ent on fine tuning of the parameters [12]. Although tech-niques or “rules of thumb” exist regarding the initializationof parameters, in most cases, finding the optimal set of pa-rameters remains problem dependent and is accomplishedthrough trial and error. Usually, the effect of parameteradjustments is not known until the algorithm terminates,making the process extremely time-consuming and makingthe algorithms less robust. Furthermore, the need for pa-rameter tuning for each problem domain makes comprehen-sive studies among optimization methods difficult. A signifi-cant step, therefore, is to design algorithms that are equallyeffective, yet require no need for optimizing parameters. Ex-periment Set A explores the potential of GSA to eliminate allparameters that are required to be initialized before execu-tion (e.g., initial temperature and cooling factors).

In Experiment Set A, two annealing schedules are com-pared. The first schedule uses a variation of techniques in-troduced in [4]. To precompute the initial temperature, oneparallel operator is applied to a randomly generated popu-lation and the average uphill cost is computed. The initialtemperature is then set such that the initial acceptance prob-ability for an average uphill move is 0.97. Final temperature

TABLE 2ALGORITHM FOR SELECTING BETWEEN A PARENT soAND CHILD SOLUTION sn, GIVEN THE TEMPERATURE

f(sn, so, temperature)If Cost(sn) - Cost(so) £ temperature

Then return snElse return so

TABLE 3ALGORITHM SELECT, FOR CHOOSING SURVIVING SOLUTION GIVEN THE TWO PARENT SOLUTIONS,

THE TWO CHILDREN SOLUTIONS, AND THE CURRENT TEMPERATURE

select (r, n0, n1, temperature )

f(r, n0, temperature)} f(r, n1, temperature)} f(n0, n1, temperature) return1 r r n0 r2 r r n1 r3 r n1 n0 r4 r n1 n1 n15 n0 r n0 n06 n0 r n1 r7 n0 n1 n0 n08 n0 n1 n1 n1


is set similarly with a final acceptance probability of 10-10.Temperature is then lowered after each parallel operator isapplied, with the parameter a set such that the final tem-perature is reached at algorithm termination. This approachis termed the Uniform Temperature schedule, because theinitial temperature, final temperature, and a are replicatedacross all PEs. This established technique is compared witha different approach, where the initial temperature and fi-nal temperature are set with plural random values (whereeach PE has a different value). The parameter a is calcu-lated from the initial and final temperature for each PE as inthe Uniform Temperature schedule. For this approach, theinitial temperatures are determined by random initial prob-abilities between 10-10 and 1.0, and the final temperaturesare set with random final acceptance probability between0.0 and 10-10. This approach is termed the Random Tem-perature schedule. To insure accuracy of the findings, thetwo annealing schedules are compared with differentpopulation sizes of 16, 64, 256, 1K, 4K, and 16K.

The second aspect of GSA addressed by this empiricalstudy is to determine how the performance of GSA scalesup with an increase in the number of PEs (identical to thepopulation size in GSA). The primary advantage of a mas-sively parallel architecture is the large number of PEs;therefore, it is significant for a parallel algorithm to demon-strate performance that improves with the number of proc-essors. GSA makes progress by not only allowing each in-dividual solution to evolve independently, but also bycombining partial solutions (building blocks) to form bettersolutions via the visiting solution and crossover operator. Itis clear that the number of possible combinations of partialsolutions increase exponentially with the increase of popu-lation size, thus allowing GSA to search a significantlylarger portion of the solution space. A second advantage ofincreasing the population size is the preservation of diver-sity among the candidate solutions in the population,which is crucial to the effectiveness of recombination. Ex-periment Set B seeks to answer a critical question aboutGSA: whether better performance and/or shorter executiontime can be achieved by simply adding processors.

In Experiment Set B, the number of PEs used for com-putation were changed from 256 to 16K. Each populationsize is tested until a given percentage of the global optimalis reached. The number of parallel operators required toreach that percentage for each population size are thencompared. The objective of these experiments is to deter-mine whether or not GSAs with larger population size canreach the same percentage to global optimal with fewerparallel operations.

The experiment is repeated for two quite different do-mains. The first domain is the well known standard, the trav-eling sales person problem. The other domain is the design ofoptimal error correcting codes. Both problems are known tobe NP-complete and therefore have no reasonable algorithmthat can guarantee success. The rest of this section describesthe two test problems and their implementation in GSA.

3.1.1 Task One: Traveling Sales Person ProblemThe Traveling Salesperson Problem (TSP) is a combinatorialoptimization problem often used in evaluating the perform-

ance of stochastic search algorithms such as SA, GA, and, inthis case, GSA. Although definition of TSP is very simple, itis an NP-Complete problem and has numerous local opti-mal solutions. The TSP is defined as follows:

Given: A set of cities at different locations on a 2D plane,

Find: A tour through all of the cities, visiting each exactlyonce, and returning to the originating city, such that thetotal distance traveled is minimized.

Fig. 2 illustrates Krolark’s 100 city Traveling SalespersonProblem [16] with the known global optimal solution. It isused as the test problem in this study.

3.1.2 Implementation ComponentsCrossover Operator: The technique employed in the study

is the Partial Matched Crossover (PMX) [10]. Notice thata standard two point crossover (simply exchanging sec-tions of parent solutions) would be almost guaranteed togenerate illegal solutions (duplicate cities and missingcities in a tour). PMX ensures that all child solutionsgenerated are legal and still permits the exchange ofpartial ordering between pairs of parents. Consider thefollowing example from [10], for a 10 city problem:

o0 = 9 8 4 | 5 6 7 | 1 3 2 0,

o1 = 8 7 1 | 2 3 0 | 9 5 4 6,

where, o0 and o1 are two alternative tours through thecities, numbered 0 through 9. The section bound by thetwo |s is the crossover section, which is randomly cho-sen. The PMX operator performs position-wise ex-changes, starting from the first city in the crossover sec-tion (i.e., exchange 5 and 2). Tour o0 is then searched forthe position of city 2 and is replaced with 5. Similarly,Tour o1 is searched for the position of 5 and is replacedwith 2. This “exchange” and “replace” continues untilthe end of the crossover section is reached. Two newsolutions, n0 and n1 are generated as the result:

n0 = 9 8 4 | 2 3 0 | 1 6 5 7

n1 = 8 0 1 | 5 6 7 | 9 2 4 3.

Mutatation Operator: is applied to the two new solutionsgenerated after a crossover. In a mutation, a sub-sequence of the tour is chosen and the ordering reversed.

Fig. 2. The 100 city TSP problem with optimal solution used to evaluateGSA algorithm.


Optimization Criteria: simply computes the completeEuclidean distance of the whole tour, returning to thestarting city.

It is acknowledged that operators implemented here arenot the only operators possible for this problems, and indi-vidual improvements may be possible by implementingdifferent variations of the operators.

3.1.3 Task Two: Error Correcting Code DesignThe Error Correcting Code design (ECC) is a practical com-binatorial optimization problem that has been attempted bystochastic search algorithms such as SA [8] and GA [13]. Dis-covering good codes is of fundamental importance fortransmitting messages through a noisy channel as reliablyand quickly as possible [13]. An ECC is an assignment ofcodewords to an alphabet that both minimizes the length oftransmitted messages which use the codewords and providesmaximal correction of single uncorrelated bit errors whensuch messages are transmitted through a noisy channel. Theassignment of codewords to an alphabet is known to both thetransmitter and receiver. To transmit a message, each char-acter in the alphabet is replaced with its code. To receive andreconstruct a message, each received word is mapped to acharacter by matching against all words in the codebook andthe choosing the character with the closest match.

There are many kinds of error correcting codes; in thisstudy we solve for binary linear block codes. The designproblem is difficult because there are two conflicting re-quirements. First the code words should be as short as pos-sible to ensure rapid communication. Second, the Ham-ming distance between each codeword must be maximizedwith all other codewords to ensure good error correction. Ablock error correction code can be represented as a three-tuple, (n, M, d), where n is the number of bits in each wordin a code; M is the number of words in the code; and d isthe minimum Hamming distance between any pair ofwords in the code. The maximum number of single bit er-rors that a code word with distance d can correct is d-1

2 .An optimal code is one that maximizes d, given n and M.

There is no known algorithm that can find an optimal code.

If exhaustive search is used then the search space is 2n

M�� .

For a code n = 12 and M = 24, the search space will be409624��

�� , approximately 1087. This is the problem we solve in

this study.The ECC problem is defined as follows:

Given: A value for n and M.Find: A set of M binary words, each of length n, such that

the minimum hamming distance between each wordand all other words is maximized.The optimal solution found by GSA for this ECC design

problem is illustrated in Fig. 3.

3.1.4 Implementation ComponentsCrossover Operator: In this domain, each solution is repre-

sented as a large binary word of length n ¥ M. Hence, thestandard domain independent two-point crossover canbe used.

Mutation Operator: is the standard domain independentoperator that randomly flips each bit in the solutionbased on the mutation-rate probability.

Optimization Criteria: could be simply the minimumHamming distance d taken over all pairs of distinctcodewords. However, this value will only depend upona few words and provides very little information as tothe progress towards the optimal. A better approach is tomeasure how well the M words are placed in the cornersof n-dimensional space [13] by considering the minimalenergy configuration of M particles (where di,j is thehamming distance between words i and j):

Cost C

dijj i j

M

i

M0 5 =

= π= ÂÂ1

121;1

.

3.2 Results3.2.1 Experiment Set AThe results for Experiment Set A are summarized in Fig. 4,which shows the best of population solution as a functionof the number of parallel operators applied. Several inter-esting observations can be made. First, it is clear that theRandom Temperature approach produces results similar toor better than the Uniform Temperature scheme, regardlessof the population size. Second, the Random TemperatureGSA makes steady progress immediately after the algo-rithm starts, whereas the Uniform Temperature GSA makeslittle progress during the initial period due to the high tem-perature at the beginning of the run.

Words Matrix of distance with other words000011101010 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 12000100111111 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 12 6000111010100 6 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 12 6 6001000110000 6 6 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 12 6 6 6001001001101 6 6 6 6 0 6 6 6 6 6 6 6 6 6 6 6 6 6 6 12 6 6 6 6001110000011 6 6 6 6 6 0 6 6 6 6 6 6 6 6 6 6 6 6 12 6 6 6 6 6010000000110 6 6 6 6 6 6 0 6 6 6 6 6 6 6 6 6 6 12 6 6 6 6 6 6010010011001 6 6 6 6 6 6 6 0 6 6 6 6 6 6 6 6 12 6 6 6 6 6 6 6010101100001 6 6 6 6 6 6 6 6 0 6 6 6 6 6 6 12 6 6 6 6 6 6 6 6011011110111 6 6 6 6 6 6 6 6 6 0 6 6 6 6 12 6 6 6 6 6 6 6 6 6011101011010 6 6 6 6 6 6 6 6 6 6 0 6 6 12 6 6 6 6 6 6 6 6 6 6011110101100 6 6 6 6 6 6 6 6 6 6 6 0 12 6 6 6 6 6 6 6 6 6 6 6100001010011 6 6 6 6 6 6 6 6 6 6 6 12 0 6 6 6 6 6 6 6 6 6 6 6100010100101 6 6 6 6 6 6 6 6 6 6 12 6 6 0 6 6 6 6 6 6 6 6 6 6100100001000 6 6 6 6 6 6 6 6 6 12 6 6 6 6 0 6 6 6 6 6 6 6 6 6101010011110 6 6 6 6 6 6 6 6 12 6 6 6 6 6 6 0 6 6 6 6 6 6 6 6101101100110 6 6 6 6 6 6 6 12 6 6 6 6 6 6 6 6 0 6 6 6 6 6 6 6101111111001 6 6 6 6 6 6 12 6 6 6 6 6 6 6 6 6 6 0 6 6 6 6 6 6110001111100 6 6 6 6 6 12 6 6 6 6 6 6 6 6 6 6 6 6 0 6 6 6 6 6110110110010 6 6 6 6 12 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 6 6 6 6110111001111 6 6 6 12 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 6 6 6111000101011 6 6 12 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 6 6111011000000 6 12 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0 6111100010101 12 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 0

Fig. 3. The n = 12, M = 24 problem with optimal solution used to evalu-ate GSA algorithm.


Fig. 4. Experiment A: The Uniform Temperature Annealing schedule is compared to the Random Annealing Schedule. Six runs of 5,000 paralleloperators each, with population sizes varying from 16 to 16K, are presented. The best of population is plotted. Random Annealing Schedule pro-duced similar or better results than the Uniform Temperature Schedule, regardless of the population size.


Fig. 4 demonstrates a significant aspect of GSA: thatrandomized parameters work at least as well as ones setwith established heuristic techniques. Several features ofGSA contribute to this advantage. The most significant onesare the GA originated population, recombination of the can-didate solutions, and the SA originated probablistic decisionprocedure. The population of GSA not only stores diversitiesof useful genetic material, but, in the case of Random Tem-perature approach, GSA stores a diversity of annealingschedules. At PEs where initial temperature, final tempera-ture, and a are high, inferior solutions are accepted at agreater probability, thus helping to maintain a steady diver-sity among the population, which is critical to the effective-ness of recombination. On the other hand, PEs with lowertemperatures and a only accept inferior solutions with smallprobability; thus, they help to keep disruption low by pro-tecting good solutions from being destroyed. The population,combined with data exchanges among PEs and the probabil-istic decision procedure, help to maintain a balance betweendiversity and disruption, which contributes to the effective-ness of the Random Temperature approach.

This balance of diversity and disruption also enables theRandom Temperature GSA to make progress immediatelyafter the algorithm starts. Consider the Uniform Tempera-ture approach, where all PEs have the same annealingschedule. The balance of diversity and disruption is diffi-cult to maintain. When the temperature is high, PEs acceptinferior solutions at a higher rate, thus maintaining a highdiversity, but good solutions are also disrupted at a higherrate. On the other hand, when temperature is low, PEs ac-

cept inferior solutions at a small probability. Good solu-tions are better protected, but diversity is quickly lost.Therefore, the initial period of slow progress is necessaryfor the Uniform Temperature approach when all PEs evolveslowly but maintain sufficient diversity. The Random Tem-perature GSA, however, maintains the balance by distrib-uting the annealing schedules among the PEs. At PEswhere temperature is low, good progress can be madequickly, and they act as safe havens for good solutions,while those PEs with high temperature provide diversity.

3.2.2 Experiment Set BFig. 5 shows the results of runs of the GSA algorithm on theMasPar MP-1 for population sizes ranging from 256 to 16K,where each PE holds one member of the population. Eachline on the graphs represents the number of iterations re-quired by the method to reach a certain solution quality.For example, the line marked as 108 percent of optimalshows how long it takes the method to find a solution thatexceeds the optimal solution by only 8 percent.

One of the most striking characteristics is the inverselyproportional relationship between the number of PEs andthe time needed to reach a near-optimal solution. In the TSPdomain, the constant of proportionality approaches one forfinding a solution within 2 percent of the optimal. Thismeans that we halve the execution time by doubling thenumber of processors. This rule applied to the whole rangeof processors and was only limited by the number of PEavailable in the MP-1. This affect was observed in the ECCproblem to a lesser degree. The characteristic of improved

(a) (b)

Fig. 5. The effect of population sizes (number of PEs) are compared solving (a) the 100 city TSP problem and (b) the n = 12, M = 24 ECC prob-lem. The population sizes increase from 256 to 16K, by a factor of four. The number of parallel operators applied to reach a certain percentage ofthe global optimal is plotted against the population size. For ease of comparison, a log/log plot is presented.


performance with increased population size appearsunique to GSA. It has not been observed with GA systems,or with other hybrid methods, such as [18], where studiesshowed that, when population sizes exceeded a few hun-dred, performance begins to decrease.

Recombination is one of the key elements that contrib-utes to GSA’s scale-up performance. GSA maintains apopulation of solutions that not only allow each solution toevolve independently through mutation, but, more impor-tantly, allow partial solutions from two different PEs to becombined to form better solutions. The partial solutions arethe building blocks of GSA’s search and, because the possi-ble combination of such building blocks increases expo-nentially with population size, GSAs with larger populationsizes can search a significantly larger portion of the solutionspace.

However, recombination alone does not guarantee goodscale-up behavior. In fact, the recombination method usedby GSA is derived from traditional GA, yet GA does notexhibit monotonic improvement with increased populationsize. The other key reason lies in the SA originated anneal-ing schedule that GSA utilizes. It is difficult for GA tomaintain a steady diversity over time. For example, pro-portional selection selects solutions into a new populationin a roulette wheel process. Each solution occupies a certainpercentage of the wheel space, based on its fitness. Thispercentage is independent of population size. As a result,above average solutions replicate themselves in the newgeneration, and low-fitness solutions gradually drop out.Diversity can be quickly lost in the selection process. In-creasing the population size does not help keep diversity.Loss of diversity undermines the effectiveness of recombi-nation and, consequently, GA does not exhibit monotonicimprovement. GSA, by contrast, maintains a healthy diver-sity by using SA-type annealing schedule in place of selec-tion. This is particularly true in the case of the Random An-nealing Schedule approach, where a diversity of annealingschedules exist. PEs with higher temperatures accept newsolutions at greater rate, and help maintain high diversity.PEs with lower temperatures act as “safe havens” for goodsolutions. Furthermore, GSA is designed such that no vis-iting solution (a solution transferred from another PE) isallowed to replace the resident solution, thus minimizingthe ability for an individual good solution to propagate anddominate the population.

4 CONCLUSIONS

This paper has introduced a parallel implementation of anew algorithm, GSA, that is a hybrid of GA and SA. Thealgorithm combines features of SA and GA to overcome theweaknesses of the either method alone. Companion studieshave shown this method, when implemented serially, toperform equal or better than SA and GA for 10 representa-tive optimization problems [3].

This study has considered a massively parallel imple-mentation of the GSA algorithm. An analysis of the algo-rithm shows that it overcomes limitations with similar GAparallel implementations, where regional selection pro-duces a serial bottleneck. GSA overcomes limitations of

parallel SA implementations because of its ease of imple-mentation and natural parallelization. The parallel portion ofthe algorithm is implementation independent, so that prob-lem-specific crossover/mutation and fitness functions can beincorporated in their serial form. GSA has characteristics thatmake it ideal for SIMD architectures. First, GSA requires noglobal coordination—the only communication is a singlesynchronized “send” and “receive” operation per iterationbetween nearby PEs. Second, GSA employs a selection proc-ess over candidate solutions that are local to each PE. Third,GSA is almost entirely a single control thread algorithm,minimizing nonconcurrent computation.

A study of the algorithm has demonstrated significantpromise. GSA uses randomized control parameters, dem-onstrating that GSA is a significant step towards a robustyet effective method with no need for time consuming pa-rameter optimization. Furthermore, the effectiveness ofGSA improves monotonically with the number of PEs, aremarkable feature that enables GSA to utilize massive par-allel architecture with maximum efficiency.

In summary, the contributions of this study to the fieldof parallel algorithms for combinatorial optimization arefour-fold. First, a completely parallel algorithm is presentedthat eliminates all serial bottle-necks found in comparableapproaches. Second, the algorithm is simple and natural toimplement on SIMD architectures, with all task-dependentcode requiring no changes. Third, the method presented isrobust with no parameter tuning required. Finally, GSA’sperformance scales robustly with number of processorsapplied to the task.

ACKNOWLEDGMENTS

The authors wish to the thank the Parallel Processing Labo-ratory at the Purdue University School of Electrical Engi-neering and the U.S. National Science Foundation, NSFParallel Infrastructure Grant number CDA-9015696, forproviding access to the MasPar MP-1 used in this study.

REFERENCES

[1] T. Blank, “The MasPar MP-1 Architecture,” Proc. IEEE Compcon,pp. 20-24, Feb. 1990.

[2] A. Casotto, F. Romeo, and A. Sangiovanni-Vincentelli, “A ParallelSimulated Annealing Algorithm for Placement of Macro-Cells,”IEEE Trans. Computer-Aided Design, vol. 6, no. 5, pp. 838-847, 1987.

[3] H. Chen and N.F. Flann, “Parallel Simulated Annealing and Ge-netic Algorithms: A Space of Hybrid Methods,” Proc. Int’l Conf.Evolutionary Computation–PPSN III, pp. 428-438, 1994.

[4] J.P. Cohoon, S.U. Hegde, W.N. Martin, and D.S. Richards,“Distributed Genetic Algorithms for the Floorplan Design Prob-lem,” IEEE Trans. Computer-Aided Design, vol. 10, no. 4, pp. 483-491, Apr. 1991.

[5] F. Darema, S. Kirkpatrick, and V. Norton, “A Parallel Techniquefor Chip Placement by Simulated Annealing on Shared MemorySystems,” Proc. Int’l Conf. Computer Design, pp. 87-90, 1987.

[6] E. Felten, S. Karin, and S.W. Otto, “The Traveling SalespersonProblem on a Hypercubic MIMD Computer,” Proc. 1985 Int’l ConfParallel Processing, pp. 6-10, 1985.

[7] T.C. Fogarty and R. Huang, “Implementing the Genetic Algo-rithm on Transcomputer Based Parallel Processing Systems,” Lec-ture Notes in Computer Science: Parallel Problem Solving from Nature,pp. 145-149, 1991.

[8] A.A. El Gamal, L.A. Hemachandra, I. Shperling, and V.K. Wei,“Using Simulated Annealing to Design Good Codes,” IEEE Trans.Information Theory, vol. 33, no. 1, Jan. 1987.


[9] D.R. Green, “Parallel Simulated Annealing Techniques,” PhysicaD, vol. 42, pp. 293-306, 1990.

[10] D.E. Goldberg, Genetic Algorithms in Search, Optimization & Ma-chine Learning. Addison-Wesley, 1989.

[11] J.H. Holland, Adaptation in Natural and Artificial Systems. Cam-bridge, Mass.: MIT Press, 1975.

[12] L. Ingber and B. Rosen, “Genetic Algorithms and Very FastSimulated Reannealing: A Comparison,” Mathematical ComputerModeling, vol. 16, no. 11, pp. 87-100, 1992.

[13] K. Dontas and K. De Jong, “Discovery of Maximal Distance CodesUsing Genetic Algorithm,” IEEE Trans. Information Theory, 1990.

[14] K. De Jong, “An Analysis of the Behavior of a Class of GeneticAdaptive Systems,” PhD thesis, Univ. of Michigan, 1975, Diss.Abstr. Int. vol. 36, no. 10, 5140B, Univ. Microfilms no. 76-9381.

[15] S.A. Kravitz and R.A. Rutenbar, “Placement by Simulated An-nealing on a Multiprocessor,” IEEE Trans. Computer-Aided Design,vol. 6, no. 4, pp. 534-549, July 1987.

[16] P. Krolark, W. Felts, and G. Marble, “A Man-Machine Approachto Solving the Traveling Salesman Problem,” Comm. ACM, vol. 14,no. 4, pp. 327-224, 1971.

[17] F.T. Lin, C.Y. Kao, and C.C. Hsu, “Incorporating Genetic Algo-rithms into Simulated Annealing,” Proc. Fourth Int’l Symp. Artifi-cial Intelligence, pp. 290-297, 1991.

[18] S.W. Mahoud and D.E. Goldberg, “Parallel Recombinative Simu-lated Annealing: A Genetic Algorithm,” IlliGAL Report No.92002, 1993.

[19] O. Martin and S. Otto, “Partitioning of Unstructured Meshes forLoad Balancing,” Concurrency: Practice and Experience, vol. 7, no. 4,1995.

[20] O. Martin and S. Otto, “Combining Simulated Annealing withLocal Search Heuristics,” Metaheuristics in Combinatorial Optimiza-tion, Annals of Operations Research, G. Laporte and I. Osman, eds.,vol. 60, 1996.

[21] P. Moscato and J.F. Fontanari, “Stochastic versus DeterministicUpdate in Simulated Annealing,” Physics Letters A, vol. 146, no. 4,pp. 204-208, 1990.

[22] L. Merkle and G. Lamont, “Comparison of Parallel Messy GeneticAlgorithm Data Distribution Strategies,” Proc. Fifth Int’l Conf. Ge-netic Algorithms, pp. 191-198, 1993.

[23] H. Muhlenbein, M. Georges-Schleuter, and O. Kramer, “EvolutionAlgorithms in Combinatorial Optimization,” Parallel Computing,vol. 7, 1988.

[24] S.E. Raik, “Parallel Genetic Life,” Proc. Int’l Conf. Parallel and Distrib-uted Processing Technique,s and Applications, pp. 1,175-1,185, Aug.1996.

[25] G. Rudolf, “Convergence Properties of Canonical Genetic Algo-rithms,” IEEE Trans. Neural Networks, vol. 5, no. 1, pp. 96-101, 1994.

[26] H.J. Siegel, J.B. Armstrong, and D.W. Watson, “Mapping Com-puter-Vision-Related Tasks onto Reconfigurable Parallel Proc-essing Systems,” Computer, vol. 25, no. 2, pp. 54-63, Feb. 1992.

[27] G.S. Stiles, “On the Speedup of Simultaneously Executed Ran-domized Algorithms,” IEEE Trans. Parallel and Distributed Systems,1993.

[28] D. Whitley, “Cellular Genetic Algorithm,” Genetic Algorithms: Proc.Fifth Int’l Conf. (GA 93), 1993

Hao Chen received a BS and an MS in com-puter science from Utah State University. Hisresearch interests include genetic algorithms,simulated annealing, hybrid GA/SA methods,and parallel implementation of such algorithms.Currently, he is actively engaged in the researchand development of innovative AI techniques incommercial games. He is a student member ofthe IEEE.

Nicholas S. Flann received the BS degree inelectrical engineering from Coventry Polytechnic(United Kingdon) and received the MS and PhDdegrees from Oregon State University in 1986and 1991, respectively. He is an associate pro-fessor in the Department of Computer Science atUtah State University. He has coauthored anumber of conference papers, journal articles,and book chapters on machine learning, combi-natorial optimization, and autonomous robotics.He is the PI of grants from the Department of

Energy and a major equipment manufacturer studying the develop-ment of intelligent autonomous vehicles. He is a member of the IEEEComputer Society.

Daniel W. Watson received the BS degree inelectrical engineering from Tennessee TechUniversity in 1985 and received the MSEE andPhD degrees from the School of Electrical Engi-neering at Purdue University in 1989 and 1993,respectively. He is an assistant professor in theDepartment of Computer Science at Utah StateUniversity. He has coauthored a number of con-ference papers, journal articles, and book chap-ters on the topics of mixed-mode parallelism andheterogeneous computing. Dr. Watson has been

the recipient of faculty fellowships at Phillips and NRaD research labo-ratories for his work in heterogeneous computing. He is a member ofthe IEEE Computer Society, a member of the Gamma Beta Phi, TauBeta Pi, and Eta Kappa Nu honorary societies, and is a recipient of thedistinguished Ronald G. Harber Service Award from HKN’s BetaChapter.

Parallel Genetic Simulated Annealing: A Massively Parallel SIMD Algorithm

Documents

Transcript of Parallel Genetic Simulated Annealing: A Massively Parallel SIMD Algorithm