A parallel depth first search branch and bound algorithm for the quadratic assignment problem

12
ELSEVIER European Journal of Operational Research 81 (1995) 617-628 EUROPEAN JOURNAL OF OPERATIONAL RESEARCH Theory and Methodology A parallel depth first search branch and bound algorithm for the quadratic assignment problem Bernard Mans a,c, Thierry Mautor a,b,, and Catherine Roucairol a,b a INRIA, Domaine de Voluceau, Rocquencourt BP 105, F-78153 Le Chesnay Cedex, France b MASI-UniversitE de Versailles 45, Avenue des Etats-Unis, 78000 Versailles, France c Carleton University, School of Computer Science, Ottawa K1S 5B6, Canada Received January 1993; revised March 1993 Abstract We propose a new parallel Branch and Bound algorithm for the Quadratic Assignment Problem, which is a Combinatorial Optimization problem known to be very hard to solve exactly. An original method to distribute work to processors using the notion of Feeding Tree is presented. When adequately used, it allows to reduce memory contention and load unbalance. Therefore, a linear speed-up in the number of processors is reached on a shared memory multiprocessor, the Cray 2 and the optimality of solutions for famous problems of size less than 20 (Nugent 16, Elshafei 19, Scriabin-Vergin 20,...) is proved by this program. The implementation analysis shows that these results are more than an improvement due to hardware evolution and confirms the usefulness of our parallel Branch and Bound algorithm for larger Quadratic Assignment Problems. Keywords: Branch and bound; Combinatorial optimization; Parallel algorithm; Quadratic assignment problem 1. Introduction The Quadratic Assignment Problem (QAP)is a combinatorial optimization problem introduced by Koopmans and Beckman, in 1957, [12]. It has numerous and various applications, such as location problems, VLSI design [18], architecture design [9].... The objective is to assign n units to n sites in order to minimize the quadratic cost of this assignment, which depends both on the distances between the sites and on the flows between the units. It can be formulated as follows. Given two (n × n) matrices, F = (fij) where f/j is the flow between units i and j, D = (dkt) where dkt is the distance between sites k and l, * Corresponding author. 0377-2217/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0377-2217(93)E0334-T

Transcript of A parallel depth first search branch and bound algorithm for the quadratic assignment problem

ELSEVIER European Journal of Operational Research 81 (1995) 617-628

EUROPEAN JOURNAL

OF OPERATIONAL RESEARCH

T h e o r y a n d M e t h o d o l o g y

A parallel depth first search branch and bound algorithm for the quadratic assignment problem

B e r n a r d M a n s a,c, T h i e r r y M a u t o r a,b,, a n d C a t h e r i n e R o u c a i r o l a,b

a INRIA, Domaine de Voluceau, Rocquencourt BP 105, F-78153 Le Chesnay Cedex, France b MASI-UniversitE de Versailles 45, Avenue des Etats-Unis, 78000 Versailles, France

c Carleton University, School of Computer Science, Ottawa K1S 5B6, Canada

Received January 1993; revised March 1993

A b s t r a c t

We propose a new parallel Branch and Bound algorithm for the Quadratic Assignment Problem, which is a Combinatorial Optimization problem known to be very hard to solve exactly. An original method to distribute work to processors using the notion of Feeding Tree is presented. When adequately used, it allows to reduce memory contention and load unbalance. Therefore, a linear speed-up in the number of processors is reached on a shared memory multiprocessor, the Cray 2 and the optimality of solutions for famous problems of size less than 20 (Nugent 16, Elshafei 19, Scriabin-Vergin 20, . . . ) is proved by this program. The implementation analysis shows that these results are more than an improvement due to hardware evolution and confirms the usefulness of our parallel Branch and Bound algorithm for larger Quadratic Assignment Problems.

Keywords: Branch and bound; Combinatorial optimization; Parallel algorithm; Quadratic assignment problem

1. Introduction

The Quadratic Assignment Problem (QAP)is a combinatorial optimization problem introduced by Koopmans and Beckman, in 1957, [12]. It has numerous and various applications, such as location problems, VLSI design [18], architecture design [9] . . . .

The objective is to assign n units to n sites in order to minimize the quadratic cost of this assignment, which depends both on the distances between the sites and on the flows between the units. It can be formulated as follows. Given two (n × n) matrices,

F = ( f i j ) where f/j is the flow between units i and j,

D = (dkt) where dkt is the distance between sites k and l,

* Corresponding author.

0377-2217/95/$09.50 © 1995 Elsevier Science B.V. All rights reserved SSDI 0377-2217(93)E0334-T

618 B. Marts et al. / European Journal of Operational Research 81 (1995) 617-628

find a permutation p of the set N = {1, 2 . . . . , n} which minimizes the global cost function:

Cost(p) : ~ ~f i jdp( i )p ( j ) . i=l j=l

2. P a r a l l e l s o l u t i o n m e t h o d s

The QAP has shown itself to be computationally a very difficult problem. This problem, of which the Travelling Salesman problem is a specific case, is NP-hard [13]. Moreover, finding an e-approximate solution is also NP-hard [22].

But its theoretical complexity is not a sufficient description of the extreme difficulty of this problem, for which even applications of moderate size (n -- 20) cannot, up to now, be solved exactly.

Therefore, many heuristics have been developed in the past thirty years. More recently, due to the development of parallel architectures, many parallel algorithms have been fruitfully implemented to speed up the search and, thus, overcome the difficulty of the problem.

2.1. Parallel exact methods

The best way to solve exactly a quadratic assignment problem is, to date, to use a Branch and Bound algorithm, as other methods such as cutting planes methods have not been successful, failing to solve exactly problems of size greater than eight.

As a consequence, Branch and Bound algorithms have been the only exact parallel solution methods suggested for this problem. They are due to Roucairol, in 1987, [21] and to Crouse and Pardalos, in 1989, [8]. We will describe later the main characteristics of these algorithms (see Section 4.2). Nevertheless, these parallel Branch and Bound algorithms are also limited to the solution of small size problems (size < 15).

This strong limitation explains why most recent approaches to this problem have been heuristics.

2.2. Parallel heuristics

Among sequential solution methods, the adjustments of the recent meta-heuritics (Simulated Anneal- ing, Tabu Search and Genetic Algorithms) to the QAP provide the best approximate results, outstripping the results of the first heuristic solution methods (construction methods, exchange methods).

Thus, logically, implementations of parallel heuristics are issued from these meta-heuristics. In 1989, Brown, Huntley and Spillane [3], on a 32-nodes Intel iPCS/2 hypercube, and Miihlenbein

[17], on a 64 processors system with distributed memory, have developed parallel genetic algorithms. In 1991, Taillard [24], on a network of 10 transputers T800C, has proposed a parallel Tabu search,

while Chakrapani and Skorin-Kapov [7] have developed a massively parallel Tabu search on a Connec- tion Machine using 16K processing units.

These algorithms always find the optimal solution for problems of small sizes from the literature. Since the optimality cannot be proved for higher instances, it is hard to estimate the quality of the results. Nevertheless, for special instances built with the knowledge of the optimum (Palubetskes [19], Burkard et al. [5]), the results of these heuristics are very close to optimal.

Let us also mention a connectionist approach, developed by Wang, in 1990, [25] on a microVax computer, which simulates n 2 processing units performing nonlinear transformations.

B. Mans et aL / European Journal of Operational Research 81 (1995) 617-628 619

3. Sequential Branch and Bound

Let us briefly recall the main principles of the sequential algorithm introduced by Mautor and Roucairol, in 1992, [16,15], which currently provides the best sequential results for the quadratic assignment problem.

Meanwhile, we discuss important features, such as lower bounds and branching strategies, concerning any Branch and Bound algorithm.

3.1. Bounding

Undoubtedly, the computation of the lower bound is one of the major difficulties in the exact solution of the quadratic assignment problem.

Indeed, up to the present time, this bound is either too loose, the number of nodes of the search tree becoming quickly huge as the size of the problem increases, or, when the bound is slightly tighter, the time needed to compute the bound on one node is prohibitive.

The oldest lower bound, developed independantly in 1962 by Gilmore [11] and Lawler [13], is obtained by solving a linear assignment problem on a (n x n) matrix C = (Cig), where Cig is the lower bound of the assignment of facility i to site k and is given by the computation of the ranked product of the row fi. of F and the colunm d.k of D.

This bound is, therefore, quickly computed in O(n 3) but its results are not very tight. For instance, the Gilmore-Lawler bound is more than 20% away from the best solution for Nugent's problems of size greater than 20.

The most interesting other lower bounds are based on an eigenvalue approach (Finke, Burkard and Rendl [10], Rendl and Wolkowicz [20]) or on equivalent dual formulations of the problem (Assad and Xu [2], Carraresi and MaluceUi [6]). If they provide a slightly bet ter evaluation, it is at the cost of a very significant increase in computational time. The best ratio qual i ty/ t ime is, therefore, still achieved by the Gilmore-Lawler bound.

For this reason, the most successful Branch and Bound algorithms (Burkard and Derigs [4], Roucairol [21], Crouse and Pardalos [8]) use the Gilmore-Lawler bound. Mautor and Roucairol use this bound and concentrate their effort on more efficient approaches to reduce the enumeration.

3.2. Reduction tests

Symmetric equivalences In most classical applications of QAP, the sites are located on a regular figure - grid, circle or line -

on which several symmetric or isometric equivalences can be detected. We want to avoid creating, visiting and bounding these different but equivalent nodes in different

branches of the search tree. For this purpose, a simple test which is quickly computed, has been introduced by Mautor and

Roucairol [16]. For any partial solution, this test identifies the different isometric classes. Thus, a unit can be assigned to only one site of each isometric class.

Reduction test using the search gap This classical test forbids some assignments and, therefore, reduces the size of the B & B tree.

Moreover, it allows to choose an efficient branching for the next level (see Section 3.3).

620 B. Mans et al. / European Journal of Operational Research 81 (1995) 617-628

Let us denote: lb: the value of the lower bound obtained, bks: the value of the best known solution (upper bound).

Test. If the alternative cost of the assignment of an unit to a site is greater or equal than the search gap (difference between the value of the best known solution and the lower bound: b k s - lb), then this assignment can be forbidden (the proof can be found in [16]).

3.3. Branching strategies

Depth first search strategy As mentioned above, exact methods can only deal with problems of small size, for which efficient and

recent heuristics, such as Tabu search, almost always find the optimal solution or, at least, a solution extremely close to optimal.

As a consequence, since only the nodes of the critical tree (evaluation lower than the best solution) have to be examined, a depth first search strategy is more efficient than a best first search strategy. This approach has two advantages. First, the data structures are lighter. Second, we avoid incessant sortings of partial solutions and reconstructions of independant subproblems (our experiments have shown that building the matrices at each exploration of a B & B node takes at least 40% of the global computation time).

Polytomic branching The branching scheme is polytomic, that is a selected unit is assigned to all the free (not already

assigned), not forbidden (the alternative cost is in the search gap) and isometrically different sites. This selected unit is the one with the lowest number of sons to be examined (highest number of forbidden elements).

This scheme has several advantages. First, with respect to a dichotomic branching, fewer nodes are created.

Second, the memory requirement is reduced by avoiding the unnecessary information on revoked assignments generated by classical dichotomic branching schemes, where an unit is assigned or not to a site.

Moreover, this scheme discriminates between the lower bounds of a node and its sons and allows early pruning of some branches. Our experiments have shown that this strategy produces a significant decrease in the size of the search tree.

3. 4. Size of the search tree

We can see in Table 1 that the reduction tests and the branching strategy, described above, lead to a drastic decrease in the total number of created nodes compared to the previous most efficient exact algorithms (Burkard and Derigs, Roucairol, Crouse and Pardalos).

Table 1 Total number of created nodes

Algorithm Nugent 8 Nugent 12 Nugent 15 Elshafei 19

Burkard-Derigs 403 36 966 2 064 415 not proved Roucairol 428 83 379 not proved not proved Crouse-Pardalos 798 42 706 1596 353 not proved Mautor-Roucairol 32 3 474 97 287 491

B. Mans et al. / European Journal of Operational Research 81 (1995) 617-628 621

4. Parallel Branch and Bound

411. Main principles

We have parallelized the Mauto r -Rouca i ro l sequential algorithm on an asynchronous shared memory multiprocessor, the Cray 2. Our main concern has been to minimize the waiting t ime (inactivity) of the processors and, thus, to obtain the highest speed-up. Thus, our approach considers the major difficulties appearing in parallelizing B & B methods [14]: task allocation, choice of granularity, overhead detection . . . .

As previously argued, we use a depth first search strategy. In a parallel implementation, an additional reason appears, since the granularity (relative number of operations done between synchronizations) defined by a parallel best first search strategy would correspond only to the exploration of a B & B node, which is quickly done. Therefore, the partial sort of the newly generated nodes, at each branching step, in a global shared list would create many contentions on the memory access, and hence a bottleneck for processors.

For these reasons, we propose that each processor should execute the same depth first search algorithm on different B & B subtrees. Each processor is, therefore, assigned to the root of a subtree and it develops a local depth first search on the corresponding subtree.

When a processor has completed its own exploration o f a subtree, it accesses a shared data structure, the feeding tree, in which it takes a new node, the root of a new subtree to develop. Since this task allocation cannot be done statically, processors schedule themselves (self-scheduling technique) and trigger their own work demand. The consistency and fairness of this shared list of tasks is ensured by synchronizing with classical primitives as locks.

The procedure stops when the feeding tree is empty and all processors are idle. The main features of our algorithm are illustrated on the Nugent et al. problem of size 15. Indeed, for

this problem, the size of the tree and the computat ional t ime are significant enough for the implementa- tion choices to be carefully analyzed. It should be noted that we have obtained similar results with other problems, in that the speed-up has been linear in all cases.

4.2. Task allocation

Different approaches can be adopted for the task allocation. In Roucairol 's parallel algorithm [21], a global heap is used to memorize the generated B & B nodes. Since a best first search strategy is used, each exploration of a B & B node requires that the shared data structure be accessed once to obtain the required node and be accessed several times to insert the newly generated nodes. In order to limit this global access, Crouse and Pardalos [8] initially create several heaps in the global memory, so that each processor can select one and explore it completely locally. In this case, the parallel B & B execution terminates when all heaps have been explored and all processors are idle.

Since our approach is quite different, we describe the way we distribute the work amongst the processors, through the notion of the feeding tree.

z

The feeding tree Definition 4.1 The Feeding Tree, shared data structure, is the u p p e r part of the B & B tree developed down to depth (or level) i. The leaves of the Feeding Tree (nodes of level i of the B & B tree) are the roots of the subtrees allocated to the processors (see Fig. 1).

622 13. Marts et al. /European Journal of Operational Research 81 (1995) 617-628

Fig. 1. Feeding Tree and allocated subtrees.

The first free processor initializes the left part of the Feeding Tree until it generates, by successive branchings, the leftmost node at the chosen depth i. Then, this processor unlocks the shared structure and begins its own exploration on the subtree, whose the node is the root. While the exploration of the allocated subtree is not completed, the processor does not need to access the global shared structure.

Gradually, the other processors access the feeding tree, in a mutual ly exclusive way, and develop it until a new node of depth i - the next depth-i-node to the right of the last allocated node - with eventual back-trackings in the feeding tree. Likewise, as soon as a processor becomes idle, the exploration of its last subtree being completed, it tries to access again the feeding tree to build a new "depth-i-node".

When the development of the feeding tree is completed (at the first level, it is not possible to branch on a new parti t ioned subproblem), each processor that becomes idle terminates (the main program terminates when the last processor becomes idle).

Since an assignment is fixed at each branching step, the maximal depth of the B & B tree is n - 1, where n is the size of the QAP problem. The maximal depth of allocated subtrees is, therefore, equal to (n - i - 1).

Due to the depth first search strategy, in the feeding tree, only the nodes on the path from the root to the last allocated node (the facilities assigned to reach this depth and the set of the remaining available locations) have to be memorized. Similarly, the context memorized by a processor is the path between the current node and its ascendant of depth i.

Thus, the memory requirement for this parallel implementat ion is slightly increased with respect to the sequential implementation. Indeed, this increase is only in O ( p ) ((i + p ( n - i - 1) / (n - 1)), where p is the number of processors.

Algorithm

procedure Parallel B & B begin

/ * initialize Feeding Tree * / Nprocs ld le := 0 Create f id , the first leftmost descendant of root with depth i for each of the N p r o c s processors d o / * self-schedule * /

lock (Feeding Tree) Create rfid right sibling of rid

B. Mans et al. ~European Journal of Operational Research 81 (1995) 617-628 623

if rfid ~ null then Keep context of rfid rid := rfid unlock (Feeding Tree)

else / * no more tasks in this branch * / Backtrack until creating a right sibling ancfid if ancfid ¢ null then

Create rf/d, the first leftmost descendant of ancfid with-depth i Keep context of rfid

..= ' r i d unlock (Feeding Tree)

else / * no more tasks to assign * / Nprocsldle .'= Nprocsldle + 1 unlock (Feeding Tree) Terminate

endif endif / * Depth First Search exploration * / Expand the whole B & B subtree

if a new best solution nbs is found then lock (bks) if (nbs < bks) bks := nbs unlock (bks)

endif endo / * Nprocsldle = Nprocs * /

end Parallel B & B

4.3. Granularity - memory contentions

As previously mentioned, we want to minimize the waiting time of the processors. Now, a processor has to wait in two cases: • when the processor needs to access a global shared structure, locked by another processor (memory

contention), • during the termination phase, when the processor is idle and has to wait for other processors to

complete their last local exploration (termination wait). Let us discuss these two cases in more detail.

Memory contention The overhead problem, fully dependant on the parallel implementation, is decisive on the Cray 2. A

lock operation requires 200 CPU cycles when the lock is free, while the conflict management requires 4000 CPU cycles.

Since the possible updatings of the best known solution are extremely rare and quickly done, only the dynamic development of the feeding tree can take so much time to induce a significant waiting time for other processors.

But the total waiting time of processors is a function of level i of the feeding tree. The deeper this level is, the smaller the average granularity of the tasks allocated to the processors

will be. The processors will then access the feeding tree more frequently. Moreover, the average time for

624 B. Mans et al. / European Journal of Operational Research 81 (1995) 617-628

0.2

0. I

. . . . . . . . . . . . o .... . . . . . . . . . . . . . . . . . . . . . . . ! . . . . . . . . . . . . . . . . . , s h a r e d l e v e l

Fig. 2. Conflict ratio depending on the level (Nugent 15).

a processor to develop the feeding tree and generate a new "depth-i-node" will increase, due to numerous backtrackings.

This phenomenon is shown in Fig. 2, where the access conflict ratio (percentage of access require- ments when the feeding tree is locked) is given, with respect to the shared level,

For example, when the depth of the feeding tree is equal to 5, nearly 30% of the access requirements result in waiting time.

Termina t ion wai t

This delay occurs when a processor without work (no more task) waits for the completion of the subtree(s) allocated to other processor(s). The maximal idle time is the time needed to explore the largest subtree which can be assigned to a processor.

When the level i of the feeding tree decreases, the average granularity of the tasks (subtrees) allocated to the processors will increase. Therefore, the probability of a long termination wait becomes more and more significant.

This is illustrated in Fig. 3, where the difference in the numbers of processed nodes (generated and explored) by the most and the least active processors is given. The effect of the chosen depth i on the

0 . 4 - -

0 . 3

0 . 2

O. 1

I nI n- n_ 0 . 0 . . . . . . . . . I . . . . . . . . . . . . . . . . . . I . . . . . . . . . I . . . . . . . . . I . . . . . . . . . I

O 1 2 3 4 5 6

s h a r e d l e v e l

Fig. 3. Difference in the numbers of processed nodes with respect to the shared level.

[: ' e x p l o r e d I g e n e r a t e d

B. Mans et al. / European Journal of Operational Research 81 (1995) 617"628 625

load ba l anc ing is conf i rmed , since, wi th a small f eed ing t r e e ( l e v e l 1 or 2), this d i f f e rence can r each 30% (a p roces so r has exp lo red 5000 s u b p r o b l e m s m o r e t han a n o t h e r ) whi le a d e e p e r f eed ing t r ee p r o d u c e s a very good b a l a n c e of the load.

5. Experiments

O u r a lgor i thm, wr i t t en in F o r t r a n , was run on the Cray 2 which is a 4 -p rocessor a synchronous ma c h ine wi th sha red memory .

T h e tes t d a t a used for the c o m p u t a t i o n a l resul ts is as follows: • N u g e n t e t al. [18]: classical p r o b l e m s for Q A P of sizes n = 12, n = 15, and n = 16 ( the flow mat r ix is

ex t rac ted f rom the p r o b l e m of size 20 and the si tes a re loca ted on a 4 × 4 square) , • Elshafe i : hosp i ta l l ayout p r o b l e m of size 19 f rom [9], • Scr iab in and Verg in : economic layout p r o b l e m of size 20 [23], f rom A r m o u r and Buffa [1].

5.1. Computational results

W e c o m p a r e d the resul ts o b t a i n e d by our m e t h o d wi th those o b t a i n e d by one of the fas tes t s equen t i a l a lgor i thms ( B u r k a r d and Der igs [4]) and by the two prev ious ly mos t eff ic ient pa ra l l e l a lgor i thms avai lab le (Crouse and P a r d a l o s [8], and Rouca i ro l [21]). Since B u r k a r d and D e r ig s ' a lgor i thm has b e e n tes ted in 1980 on a Cyber 76 and b e c a u s e of the evolu t ion in c o m p u t a t i o n a l pe r fo rma nc e s , we r an the i r s equen t i a l a lgo r i thm on the Cray 2. W e also r e p o r t on two d i f fe ren t sets of resul ts f rom Crouse and Parda los . T h e first is o b t a i n e d f rom a sequen t i a l r u n of this a lgo r i thm and the s e c o n d f rom runn ing the a lgor i thm in pa ra l l e l on a fou r -p roces so r mach ine .

In T a b l e 2, we p r e s e n t a compa r i son of the runn ing t imes of these a lgor i thms. Some tes ts done on a Cray Y M P are also p r e s e n t e d.

R e m a r k . Since ou r d e d i c a t e d use of the Cray 2 (i.e. a s ingle execu t ion on the m a c h i n e ) was l imi ted to 400 seconds , c u m u l a t e d for all u sed processors , pa ra l l e l execut ions o f N u g e n t of size 16, and of Scr iab in and Verg in o f size 20 cou ld no t be done with a r ea sonab ly busy mach ine . The a m o u n t of t i m e n e e d e d for pa ra l l e l execu t ions of these p r o b l e m s d e p e n d s on the n u m b e r of users work ing on the mach ine . Thus, in these two cases , -we could no t r each the p r e d i c t a b l e l i nea r s p e e d u p s h o w n for N u g e n t e t al. o f size 15.

Table 2 Comparison of running times (in seconds)..

Algorithm Machine No. Nugent Nugent Nugent Elshafei Scr.-Ver. procs size 12 size 15 size 16 size 19 size 20

Optimal solution 578 1150 1550 17 212548 110 030

Burk. Der. Cray 2 1 24 1290 not proved not proved not proved Roucairol Cray XMP 4 312 out of time not proved not proved not proved Cr. Pard. IBM 3090 1 34 2005 not proved not proved not proved Cr. Pard. IBM 3090 4 10 out of space not proved not proved not proved Mans et al. Cray 2 1 2.68 109 969 1.04 1189 Mans et al. Cray 2 4 0.99 28 436 0.68 560 Mans et al. Cray YMP 1 not tested 62 not tested not tested not tested Mans et al. Cray YMP 6 not tested 11 not tested not tested not tested

626 B. Mans et aL /European Journal of Operational Research 81 (1995) 617-628

38 i i i i i i i

3 7

3 6

35

3 4

Time 33 in sec

32

31, "~

3O

29

28 I 2 3 4

shared level

F i g . 4. R u n n i n g t i m e obtained on Nugent 15 with respect to the level i .

Among the classical quadratic assignment problems from the literature, the problem of Nugent of size 15 was considered to be the hardest to solve by an exact method. Only the most efficient Branch and Bound algorithms managed to solve this problem and they needed a long computational time (more than 20 minutes) to compute the solution.

Our parallel algorithm manages to solve this problem in a few seconds (11 seconds on the Cray YMP, 28 seconds on the Cray 2).

This result is a good illustration of the efficiency of this parallel algorithm, which, on all the classical problems, obtains the fastest running time.

Moreover, this algorithm obtains the best solution and the proof of the optimality of this solution for problems of size 16 to 20 which have never been solved exactly before: Nugent of size 16, Scriabin and Vergin of size 20, Elshafei of size 19.

We have also run our algorithm to solve randomly generated problems with previously known optimal solutions, (Palubetskes [19], Burkard and al. [5]). Problems of size 18 were solved sequentially in less than 6 minutes, while they were solved in parallel with a linear speedup.

5.2. Parallel analysis

In order to analyze the effects of the granularity and the sizes of the allocated subtrees, we considered the level /-depth of the feeding tree-as a parameter and ran our program with different values of i.

1101 , , i ~ , '

i00 \

90 level 3 " ~ level 4 -b--

Time 70 in see. 60

50

40 30

20 i J t I l

2 3 4 Number of Processors

F i g . 5. Running time (Nugent 15) with respect to the number of processors.

B. Mans et al. / European Journal of Operational Research 81 (1995) 617-628 627

4

3.5

3

Speed 2.5 up

2

1.5

t i i i i

level 3 ~ ~ ' ~ lord 4 + - / / J

2 3 Number of Processors

Fig . 6. S p e e d u p o n N u g e n t 15.

The tests have been done on the Nugent 15 problem, while our program was the only one executed by the machine.

As previously seen, the level i must not be set to extreme values, neither too high due to memory contentions (see figure 2), nor too low for a good balance of the load (see figure 3).

This influence of the level i is also shown in Fig. 4, where the running times obtained with different depths of the feeding tree are presented.

On the Nugent 15 problem, the best compromise is achieved when the depth is equal to 3 (or 4). Of course, the value of the level i leading to the best behaviour of the program depends on the

problem and on the structure of the corresponding search tree. Nevertheless, all the experiments we have done show that a middle-valued depth of the feeding tree, equal to [n /5 ] or [n /4 ] (where n is the size of the problem), leads to very good performances.

Therefore, as it can be seen in Fig. 5, with an appropriate choice of i, we obtain a significant decrease in the running time, as the number of processors increases.

A linear speed-up in the number of processors is, thus, achieved (see Fig. 6). We emphasize the fact that the comparisons of the parallel running times are done with the running time of the best sequential implementation of the algorithm.

6. Conclusions

We have presented an original and very efficient parallel algorithm for solving the Quadratic Assignment problem. We introduced the notion of the feeding tree which allows a good distribution of work to processors.

This algorithm which is the parallel version of a sequential B &B algorithm leads to a linear speedup (nearly equal to the number of processors), with very little overhead. Since the heuristics for QAP give very good solutions, no oversearch is pursued (only the critical tree is explored). The parallel implemen- tation of the Depth First Search strategy is scalable in CPU time and memory requirement.

Nevertheless, whereas the parallelism is fully used, a global improvement of the algorithm is still required to solve problems of size greater than 20. Without considering the hardware evolution, these solutions can not be reached with the existing tight bounds.

Acknowledgements

We thank Franz Rendl, Professor at the Technische Universit~it of Graz, Austria, for his helpful comments and suggestions on this work during the last few years.

628 B. Mans et aL / European Journal of Operational Research 81 (1995) 617-628

The authors are indebted to Cray France (esp. G. Simeoni) and to CCVR (Centre de Calcul Vectoriel pour la Recherche, Palaiseau, France), for supporting this research with computer time and for helpful discussions.

References

[1] Armour, G., and Buffa E., "A heuristic algorithm and simulation approach to the relative allocation of facilities", Management Science, 9 (1963) 294-309.

[2] Assad, A., and Xu, W., "On lower bounds for a class of quadratic 0-1 programs", Operations Research Letters 4 (1985) 175 - 180.

[3] Brown, D., Huntley, C., and Spillane, A., "A parallel genetic heuristic for the quadratic assignment problem", in: Proc. of the 3rd Conference on Genetic Algorithms, 406-415, Arlington, 1989.

[4] Burkard, R., and Derigs, U., Assignment and Matching Problems: Solution Methods with Fortran Programs, Springer Verlag, Berlin 1980.

[5] Burkard, R., Karisch, S., and Rendl, F., "Qaplib - a quadratic assignment problem library. European Journal of Operational Research 55 (1991) 115-119.

[6] Carraresi, P., and Malucelli F., "A new lower bound for the quadratic assignment problem", Operations Research 40 (1992) $22-$27.

[7] Chakrapani, J., and Skorin-Kapov, J., "Massively parallel tabu search for the quadratic assignment problem", Technical Report HAR-91-06, Harriman School for Management and Policy, NY 11794, 1991.

[8] Crouse, J., and Pardalos, P., "A parallel algorithm for the quadratic assignment problem", in: Proceedings of Supercomputing 89, 351-360, ACM, 1989.

[9] Elshafei, A., "Hospital layout as a quadratic assignment problem", Operational Research Quarterly 28 (1977) 167-179. [10] Finke, G., Burkard, R., and Rend1, F., "Quadratic assignment problems", Annals of Discrete Math. 28 (1987) 61-82. [11] Gilmore, P.C., "Optimal and suboptimal algorithms for the quadratic assignment problem", SIAM Journal on Applied Math. 10

(1962) 305-313. [12] Koopmans, T.C., and Beckman, M.J., "Assignment problems and the location of economic activities", Econometrica 25 (1957)

53-76. [13] Lawler, E., "The quadratic assignment problem", Management Science 9 (1963) 586-599. [14] Mans, B., "Contribution a l'algorithmique non num6dque parallele: Parall61isations de m6thodes de recherche arborescentes",

Thbse d'universit6, Universit6 Paris VI, 4, place Jussieu, 75252 Pal-is cedex 05, June 1992. [15] Mautor, T., "Contribution ~ la r6solution des probl~mes d'implantation: Algorithmes s6quentiels et paraU~les pour l'affecta-

tion quadratique", Th~se d'universit6, Universit6 Paris VI; 4, place Jussieu, 75252 Paris Cedex 05, February 1993. [16] Mautor, T., and Roucairol C., "A new exact algorithm for the solution of quadratic assignment problems", Discrete Applied

Mathematics to appear, 1993. MASI-RR-92-09-Universit6 de Paris 6, 4 place Jussieu, 75252 Paris C6dex 05. [17] Mulhenbein, H., "Parallel genetic algorithms, population genetics and combinatorial optimization", in: J.D. Becket I. Eisele,

and F.W. Mundemann, (eds.), Parallelism, Learning, Evolution; Workshop on Evolutionary Models and Strategies and Workshop on Parallel Processing: Logic, Organization and Technology WOPPLOT89. Springer-Verlag, Berlin, July 1989.

[18] Nugent, C., Vollmann, T., and Ruml. J., "An experimental comparison of techniques for the assignment of facilities to locations", Operations Research 16 (1968) 150-173.

[19] Palubetskes, G.S., "Generation of quadratic assignment test problems with known optimal solution". Zhurnal Vychislitel'noi Matematiki i Matematicheskoi Fisiki 28 11 (1988) 1740-t743, (in Russian).

[20] Rendl. F., and Wolkowicz, H . , "Applications of parametric programming and eigenvalue maximization to the quadratic assignment problem", Mathematical Programming (1992) 63-78.

[21] Roucairol, C., " A parallel branch and bound algorithm for the quadratic assignment problem", Discrete Applied Mathematics 18 (1987) 211-225.

[22] Sahni, S., and Gonzalez, T., "P-complete approximation problems", Journal of the ACM 23 (1976) 555-565. [23] Scriabin, M., and Vergin, R.C., "Comparison of computer algorithms and visual based methodes for plant layout",

Management Science 22 (1975) 172-187. [24] Taillard, E.,' "Robust tabu search for the quadratic assignment problem", Parallel Computing 17 (1991) 443-455. [25] Wang, Jun, "A parallel distributed processor for the quadratic assignment problem", in: Proceedings of the INNC90,

International Neural Network Conference 1, 278-281. Paris. France, July 1990.