A parallel asynchronous Newton algorithm for unconstrained optimization

JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 77, No~ 2, MAY 1993

A Parallel Asynchronous Newton Algorithm for Unconstrained Optimization 1

D. C O N F O R T I 2 A N D R . M U S M A N N O 3

Communicated by H. Y. Huang

Abstract. A new approach to the solution of unconstrained optimization problems is introduced, tt is based on the exploitation of parallel computation techniques and in particular on an asynchronous communication model for the data exchange among concurrent processes. The proposed approach arises by interpreting the Newton method as being composed of a set of iterative and independent tasks that can be mapped onto a parallel computing system for the execution.

Numerical experiments on the resulting algorithm have been carried out to compare parallel versions using synchronous and asynchronous communication mechanisms in order to assess the benefits of the proposed approach on a variety of parallel computing architectures. It is pointed out that the proposed asynchronous Newton algorithm is preferable for medium and large-scale problems, in the context of both distributed and shared memory architectures.

Key Words. Newton method, unconstrained optimization, asynchronous parallel algorithms, parallel computation.

1. Introduction

This p a p e r p resen ts a para l l e l a synch ronous N e w t o n a lgor i thm for so lv ing the fo l lowing uncons t r a ined non l inea r p r o g r a m m i n g p rob lem:

m i n f ( x ) , x~N",f:E~-~N,feC2. (1)

~This research work was partially supported by the National Research Council of Italy, within the special project "Sisterni Informatici e Calcolo Paralleto," under CNR Contract No. 90.00675. PF69.

2Assistant Professor, Dipartimento di Elettronica, Informatica e Sistemistica, Universit~ della Calabria, Rende, Cosenza, Italy.

3Research Fellow, Dipartimento di Elettronica, Informatica e Sistemistica, Universit/t della Calabria, Rende, Cosenza, Italy.

305 0022-3239/93/0500-0305507.00/0 '~@ 1993 Plenum Publishing Corporation

306 JOTA: VOL. 77, NO. 2, MAY 1993

Although several methods have been designed for solving the above problem, the Newton method remains one of the most attractive, because of its excellent theoretical termination properties. Given a solution estimate x k e •", the Newton method generates the next estimate x k+l ~ ~n by

x k+l = x k + a k s k. (2)

The method starts from an arbitrary guess x ° ~ R n; the kth iteration proceeds as follows:

Step 1. Computation of the gradient vector Vf(x k) at the current iteration point x k.

Step 2. Computation of the Hessian matrix V2f(x k) at the current iteration point x k.

Step 3. Generation of the search direction s k ~ R ~ by solving the linear system vZf(xk)s=--'~f(xk), where the matrix V2f(x k) is assumed to be nonsingular;

Step 4. Exploration of the search direction, seeking a steplength ak c ~+ which represents a suitable approximation of a local minimum of the function q ~ ( a ) = f ( x k + a S k ) .

The scheme is iteratively repeated until a user-defined termination criterion is satisfied at x k+~. The further condition on the matrix V 2 f ( x k)

plays an important role to guarantee the correct termination of the algorithm: if ? 2 f ( x k ) is positive definite, for each k, this implies that the search directions s k are downhill.

If some appropriate improvement, such as the trust-region technique (Ref. 1), is incorporated in the basic Newton method, it possesses sound theoretical convergence properties, at least when the starting point is sufficiently close to the solution. Nevertheless, if the objective function to minimize does not have a particular structure or if the number of decision variables is very large (typically of order of hundred or thousand), the method is characterized by a heavy computational burden. This is mainly due to the need to supply and store, at each iteration, a second partial derivatives matrix. In fact, the Hessian matrix computation (or even a suitable modification of it), which is a very time-consuming operation for large-size problems, must be completed before the next iteration is started. In this situation, parallel computing may represent a useful and powerful tool to overcome this difficulty especially for large problems. In recent years, in fact, different parallel implementations of the Newton method have been proposed (Refs. 2-5).

In Ref. 6, a parallel asynchronous model of the Newton method has been described for the unconstrained minimization of a twice continuously differentiable uniformly convex function. On the basis of the same approach,

JOTA: VOL. 77, NO. 2, MAY 1993 307

we derive a parallel Newton algorithm in which four basic computational steps of the sequential version (computation of the gradient and steplength, evaluation of the Hessian matrix, Choleski factorization of the Hessian matrix, and computation of the search direction) may operate simultaneously and cooperatively to solve the problem (1). In particular, each step corresponds to a local and iterative algorithm (task) in which the execution of the computational phase for each iteration is preceded by the input phase of data and followed by the output phase. The main characteris- tic of the proposed parallel algorithm is that the tasks, after an appropriate initialization procedure, never wait for input at any time, but continue execution or terminate according to whatever information is currently available.

Our major goal is the development of numerical implementations for investigating the parallel asynchronous behavior of the Newton method for the solution of large-scale unconstrained optimization problems on distributed and shared memory parallel computing systems and to assess, with the support of the experimental evidence, the possible benefits of the asynchronous approach.

The increasing availability of new computational environments is the driving force for analyzing and developing new methods and algorithms. In the case of parallel asynchronous algorithms, we refer, in particular, to distributed memory architectures (multicomputers), hierarchy of parallel architectures, large geographically distributed network of computers (in general, different types of computers with different capabilities). These computational environments are the most suitable for the execution of parallel asynchronous algorithms. Due to the communication penalty, it might be necessary to overlap computations and communications and to guarantee that each task can usefully exploit outdated information. The totally asynchronous structure in communication is attractive, since it may have strong implications on the efficiency of parallel algorithms; in fact, the overhead of synchronization mechanisms among the tasks can be remarkably reduced. Furthermore, the asynchronous approach allows efficient use of the architectural resources and makes it easier to handle some typical problems of the synchronous implementation (e.g., load balancing and data locality). However, the use of outdated information at each iteration may be counterproductive because of the expense of more frequent exchanges of informations among the tasks. Moreover, since the iterates generated by the algorithm do not satisfy any deterministic recur- rence relation (different from sequential or synchronous counterpart), a general theory concerning convergence properties can be obtained under quite severe assumptions. For more details about asynchronous iterative algorithms, see Ref. 7.

308 JOTA: VOL. 77, NO. 2, MAY 1993

The paper is organized as follows. In Section 2, we present the model parallel asynchronous Newton method, paying particular attention to the theoretical aspects and the numerical implementation strategies, with a detailed description of the proposed algorithm. The computational experiments, carried out on distributed and shared memory parallel machines, are reported and discussed in Sections 3, 4, and 5. Concluding remarks in Section 6 complete the paper.

2. Parallel Asynchronous Newton Method

In Ref. 8, a computational model for the implementation of asynchronous parallel algorithms on multicomputer systems has been proposed. This model can be used to apply the asynchronous idea to the Newton method. In fact the parallel asynchronous Newton method can be viewed as composed of the following four independent tasks:

Task T~. Computation of the Hessian matrix at the current iteration point.

Task T2. Choleski factorization of the Hessian matrix. Task T3. Generation of the search direction by solving triangular

linear systems. Task T4. Computation of the gradient vector at the current iteration

point and exploration of the search direction.

Each task T~ corresponds to a set of sequential steps which execute an iterative scheme; each iteration j is defined by the input operations, the computational phase, and the output operations to communicate the results to the other tasks. A basic assumption is that the time of computation for each task and communication delays among the tasks are totally unpredictable.

In the following, we describe the structure of each task, the optimality test, and the convergence properties of the method.

Task T1. Task 7"1 computes the elements of the Hessian matrix G at the current iterate x k to form a symmetric n x n matrix. On the basis of the asynchronous interaction, it is possible that the current iterate for Task T~ may change before the computation of all the components of the matrix is completed; therefore, we obtain only a partial updating of the Hessian at any point generated by the algorithm. The components of G are computed by using the following central finite-differences approximation formulas:

V2f(xij) = [ f ( x + hie ~ + h~e j) - f ( x + hi ei) - f ( x + hie j) + f ( x ) ] / h , h j . (3a)

VZf(x.) = [ f ( x + h~e ~) - 2f(x) + f ( x - h~e~) ]/ h~ , (3b)

JOTA: VOL. 77, NO. 2, MAY 1993 309

where e i (#) is the ith (jth) unit vector and hi(hj) is set to C[max(1, Ix, l )](C[max(1, [x j[)]), with C = lO -4 ,

Task 7"2. Task T2 computes the Choleski factorization of the matrix G (received from the task 7"1), that is, G = L L T. A unique nonsingular lower-triangular factor L will be found if and only if the matrix G is positive definite. Due to the asynchronous execution of the tasks, in general, it is not possible to guarantee the positive definiteness of G at each iteration. In fact, the components of G are generally computed in different points available during an unpredictable time period. In this case, two different strategies have been used:

(a)

(b)

restarting the factorization procedure until the input matrix G is positive definite; modifying the Choleski factorization (Ref. 9) to produce a positive-definite matrix differing from the original only if it is necessary. In particular, the n x n positive-definite matrix G is defined as

i G/j = ~ l:k ~k, i-<-j, (4)

k=l

where

/ lu , if i # j , 1~ = tmax(l/~k e), if i =j ,

l,~ is the component of L, the Choleski factor of G, and e is the machine precision.

Task T3. Task T3 has in input the vector g from Task T4 and the triangular matrix L from Task T2. It computes the vector s by solving the system LLTs = - g ; here, s is used as current search direction.

Task T4, On the basis of s and of the current iterate x k, Task T4 computes the vector gradient Vf(x k) by the following central finite- differences approximation formula:

V f ( x j ) = [ f ( x + hje j) - f ( x - hjeJ) ] /2hj (5)

(e j and hj have the same meaning as in T1), and finds the new iterate x k+~ by exploring the current search direction s; the output is sent to Tasks T1 and T3. The line search is carried out in such a way that the steplength produces a sufficient decrease of the objective function, which can be

310 JOTA: VOL. 77, NO. 2, MAY 1993

satisfied by setting conditions on the value of o~k at Task T4. In particular, we can refer to the Armijo-Goldstein condition (Ref. 9),

f ( x k -FOlkS k) - f ( x k ) <-- pak • f ( x k) Ts k. (6)

A typical class of methods for computing ak is based upon an iterative scheme, defined by the choice of the initial step ao and a scalar ~- in order to establish the next trial estimate. We seek the value ak as the first element of the sequence {~-~t~0} , i = 0 , 1 , . . . , for which the sufficient decreasing condition (6) is satisfied.

In our context, we use an iterative algorithm (Ref. 6), by letting r~ = 1/2 i, Vi = 0, 1 , . . . , and considering as optimality test the satisfaction of the condition (6). The usual convention is to choose ao equal to unity to achieve the same performance of Newton-type methods.

We emphasize that this steplength algorithm would not be extremely efficient, in terms of the achieved reduction at each iteration. It turns out to be theoretically sound if the objective functions are restricted to the simple uniformly convex model. However, we must consider the balance between the effort expended for computing a good estimate value at each iteration and the resulting advantage for the method. This balance essentially depends on the quality of the search direction s available for Task T4, which is not always acceptable; hence, s might not be adequate to ensure a sufficient decrease. For this reason, we prefer to adopt a quite simple line search procedure instead of more accurate techniques. The output x4 is instantaneously available as input for the next iteration at the same Task T4. Therefore, no communication delay is allowed.

2.1. Optimality Test. To ensure that the parallel algorithm terminates after a finite number of iterates generated at Task 7"4, we adopted the satisfaction of one of the following termination criteria:

(a) [Ig(x*)l l<ea; (7a)

(b) t f ( x g - ' ' ) - f ( x k ) [ < eB. (Vb)

The first is classically related to establishing if the current estimate x k of the solution is acceptable according to the user-supplied tolerance CA, based on the necessary optimality condition llg(x*)ll -= 0. The second is introduced to guarantee that the computation stops when the progress of the algorithm appears to be intolerably slow.

Particular attention must be payed to the choice of the value m. If m is too small, it could stop the algorithm when it was still able to produce a suitable improvement of the solution; vice versa, if m is too large, there is the risk of producing an excessive number of trial points with no substan- tial progress for the descent method and, therefore, a loss of efficiency. We

JOTA: VOL 77, NO. 2, MAY 1993 311

have run experiments on different value of m and eB. We have chosen m = 30, which seems sufficient to mediate the above conditions.

2.2. Convergence. It is difficult to obtain a general theory concerning conditions for convergence and speed of convergence because of the asynchronous structure of the algorithm. In Ref. 6, a superlinear convergence rate has been proved under the following basic assumptions:

(A1) f ( x ) is uniformly convex, i.e., 3/.t and ~/, with 0 < ~t < 7/, such that

~HxH2<-xrV2f(y)x<-~[[xl l 2, foral lx , y ~ N ", x ¢ o ; (8)

if v2f(y) is positive definite, this assumption is implied by the Lipschitz condition l ] V f ( z ) - V ( y ) l [ < - ~ l l z - y i ] on the function Vf(y).

(A2) Vx ° c It~ ", the set

S(x °) = {x c N" if(x) - f ( x ° )} (9)

is compact, with f ( x ) , Vf(x), and V2f(x) uniformly continuous o n S(x°).

(A3) An Armijo-Goldstein condition is imposed on the sufficient decreasing of the objective function; specifically, we seek an approximation ak ~ ~+ such that

f ( x k + aks k) -- f ( x k) <<- pak V f ( x k) rs k, (1O)

where 0 < p < 1/2 is a constant. (A4) There are no bounds on the asynchronous communication delay.

The assumption on the uniform convexity of the objective function implies the existence of a unique global minimizer x*. The objective function is strictly unimodal on the set in which it is defined, with positive curvature restricted to a finite range of values. Hence, the structure of the Hessian matrix does not substantially change on the domain of the objective function.

The uniform convexity of the objective function is quite important in relation to the parallel asynchronous Newton method, because it concerns the following aspects, strongly affecting the behavior of the algorithm:

(a) the need to have the uniqueness of the solution, that is, the absence of multiple points of attraction for the asynchronous Newton iterations generated by the algorithm;

(b) the global convergence is always guaranteed, even though the matrix G may not be, at any given time, positive definite.

With reference to (b), it is trivial to observe that the following fact is always true.

312 JOTA: VOL. 77, NO. 2, MAY 1993

Fact 2.1. A finite time instant T > 0 exists such that every component of the Hessian approximation matrix is updated at least once.

This implies that, after a reasonable time interval, the matrix G becomes the exact Hessian at the current point and, if the function is uniformly convex, then G will be positive definite.

In spite of the assumptions on the structure of the objective function, the proposed algorithm is characterized by the descent property. In fact, the following theorem can be stated.

Theorem 2.1. Let x k be the current estimate of the solution at Task 7"4. If Vf(x k) ¢ 0, then the algorithm finds the new iterate x k+~ such that

f(xk+l) < f (xk) .

Proof. We show that, for Task T4, it is always possible to get a search direction s such that Vf(xk)rs<O. In this case, the Armijo-Goldstein condition,

f ( x k + aks k ) - f ( x k) <_ pakVf(xk)rs k,

is well defined for one value of cek > 0. If the current point does not change, for Fact 2.1, there exists a time instant T > 0 such that the vector g and the matrix (3, inputs for Task T2, correspond respectively to

g = Vf(xk), G = V2/(xk).

I f G is not positive definite, it is possible to replace it with a suitable positive-definite matrix G, so the method is a descent algorithm with the search direction s defined by

Gs = -Vf (xk ) ;

hence, we have

Vf(xk)rs = - -Vf (xk)rG- lVf (xk) < O, VVf(x k) # O. []

2.3. Outline of the Algorithm. We now are ready to give a detailed description of the parallel asynchronous Newton algorithm.

Task 1. Computation of the Hessian Matrix Approximation. Initialization: x = x0, G = I, i = 1,j = 1. Iteration k:

(A) (B)

Input: x (iterate). Computational Phase: (B1) Compute 3/, the component (i , j) of the Hessian matrix

approximation at x, by (3a) and (3b).

JOTA: VOL. 77, NO. 2, MAY 1993 313

(B2) I f i > j , t h e n j = j + l ; else, i f i = j , then i = ( i m o d n ) + l , j = l .

(C) Output: y at task T2 instead of G ( i , j ) and G(j , i).

Task 2. Choleski Factorization. Initialization: L = I Iteration k:

(A) Input: G (Hessian matrix approximation). (B) Computational Phase:

(B1) First Alternative. If G > 0 , compute the lower triangular matrix L such that G = L L r ; else, go to (A).

(B2) Second Alternative. Compute the lower triang, ular matrix ~ A T £ in such a way to obtain G = L L , where G is defined

in (4); (C) Output: L(L) at task T3.

Task 3. Computation of the Search Direction. Initialization: s = 0 Iteration k:

(A) Input: L (Choleski factor), g (gradient approximation). (B) Computational Phase:

(B1) Compute w such that Lw = - g .

(B2) Compute s such that Lrs = w.

(C) Output: s at task 7"4.

Task 4. Computation of the Gradient Approximation and the Next Iterate.

Initialization: y = xo (starting point), g = 0 Iteration k:

(A) Input: x (iterate), s (search direction). (B) Computational Phase:

(B1) If x is a sufficient approximation of the solution of (1), stop.

(B2) I f ( x = y ) and ( y ~ x o ) g o t o (A). (B3) Compute the gradient approximation g at x by (5). (B4) If gTs --- 0 go to (A). (B5) Find the smallest/3 >-0 such that

f ( x + e~s) ~ f ( x ) + pag rs,

where ~ = (1/2) ¢ and p, 0 < p < 1/2, is a fixed constant. (B6) x = x + as, y = x.

(C) Output: x, g at Tasks 7"1, T3, T4.

314 JOTA: VOL. 77, NO. 2, MAY 1993

_-© Fig. 1. Directed graph for the parallel asynchronous Newton method.

An obvious way to represent the parallel asynchronous algorithm is a directed graph, where the nodes represent the tasks and the arcs show the data dependences and the structure of asynchronous interaction among the tasks (Fig. 1).

3. Design of Computational Experiments

The design of the computational experiments aims to assess the behavior of the parallel asynchronous algorithm in terms of computational complexity and comparison with the synchronous counterpart, obtained by assuming no communication delay for the data exchange among the tasks (synchronous interaction). Under this aspect, a key decision has been the choice of the parallel computing architectures. We explored the performance of the algorithm on distributed and shared memory parallel machines to concentrate, in this way, on the class of MIMD (multiple instructions multiple data) parallel computing systems (Ref. 10).

Two coded versions of the parallel asynchronous algorithm have been tested.

GAPNM1. First and second derivatives are approximated by central- finite differences formulas (3a), (3b), and (5); Choleski factorization is used with a restart condition [strategy (a) in Task 2] in case the Hessian matrix is not positive definite; p = 0.1 is set in the line search to find the smallest integer ~k > 0 such that

f ( x k + akS k) - - f ( x k ) <-- p a k V f ( x k ) r s k, Vk,

where ak = (1 /2 )~ ; tolerance values in the termination conditions (7a) and (7b) are set to eA = 10 -4 and eB = 10 -8, respectively.

JOTA: VOL. 77, NO. 2, MAY 1993 315

Table 1. List of test functions.

Problem Name Reference

1 Quadratic function 11 2 Repeated Rosenbrock function 12 3 Chained Rosenbrock function 12 4 Extended Powetl function 12 5 Extended Wood function 12 6 Variably-dimensioned function t2 7 Trigonometric function 12 8 Oren function I3 9 Brown function 12

10 Dixon function 13

GAPNM2. This is the same as GAPNM1, except that the Choleski factorization is implemented as the modified Choleski factorization (4).

The numerical results have been collected with the aim of pointing oat the comparison, in terms of performance, between the synchronous and asynchronous versions of the parallel algorithm: LSPNM1 versus GAPNM1, and LSPNM2 versus GAPNM2. LSPNM1 and LSPNM2 are, respectively, the synchronous versions of G A P N M t and G A P N M 2 codes.

To test the algorithm, we have considered ten different problems with increasing dimensions. It is worth stressing that they are very ill-conditioned, so severe difficulties are faced by the algorithm. The test functions are listed in Table 1 and their explanation is given in Refs. 11-13.

4. Distributed Memory Parallel Computing System

4.1. Computational Environment. The parallel asynchronous algorithm can be efficiently implemented on distributed memory parallel computers (multicomputers). These systems offer a good architectural support for the execution of parallel algorithms both in terms of computat ion and communication.

In order to define a suitable computational environment, we have chosen the Occam Transputer integrated system that represents a highly efficient solution for implementing parallel applications, due to its flexibility and suitability for several kinds of problems.

Occam is a high-level programming language based on the concept of concurrent processes: an application in Occam can be viewed as a collection of concurrent processes, which interact by a message-passing communication mechanism.

316 JOTA: VOL. 77, NO. 2, MAY 1993

The Transputer is a complete microcomputer integrated in a single VLSI chip. It has eight unidirectional communication links (four for input and four for output), which allow interconnection to the other processors in such a way as to form distributed memory parallel systems. The Transputer architecture is designed to support many of Occam concurrent processes and to provide communication channels between the processes via hardware communication links.

Unfortunately, the interprocesses communication in Occam is synchronous; therefore, if the communication between two parallel tasks is realized by a direct connection, it is impossible to guarantee an asynchronous exchange of messages.

In Ref. 8, an efficient solution for the designing of asynchronous interaction among processes has been proposed. It is based on the definition of buffer processes (one for each pair of communicating tasks), which are executed in parallel to control the output operations for each task of the asynchronous algorithm.

To carry out the computational experiments, a mesh network with four Transputers T-800/25 MHz and 1 Mbyte of core memory has been used. By this choice, it is possible to solve in the simplest way the problem of mapping the structure of the parallel algorithms on the architecture to avoid the solution of complex routing problems.

4.2. Numerical Results. To assess the asynchronous idea for the parallel Newton algorithm, numerical experiments have been carried out on the basis of two objective functions (Problems 2 and 4). In particular, the repeated Rosenbrock and extended Powell functions are suitable for estimating the robustness of the asynchronous parallel algorithm. In addi- tion, by increasing the numbers of variables, it is possible to examine how the asynchronism behaves as the dimension of the problems is varying.

Due to the memory size of the Transputer systems, we choose the dimension of the problems in the range [20, 100]. Furthermore, we compare the performance of the asynchronous version GAPNM1 with the synchronous counterpart LSPNM1.

The results obtained are represented in Figs. 2 and 3, where the speed-up values S = T~yn/T, syn are reported as functions of the dimension n of the problems.

4.3. Discussion. Even if we have not obtained remarkable progress in the asynchronous execution of the code with respect to the synchronous counterpart, we observe that the computational cost for GAPNM1 tends to become cheaper than the synchronous version. This is especially notice- able when the dimension of the problems becomes larger. We can conclude that, when the dimension n is sufficiently large, there are indications that

JOTA: VOL. 77, NO. 2, MAY 1993 317

Fig. 2.

S

1.2 I "

0,8 ~ ~

0.6

0.4

0.2 I 0 I I { -----In

20 40 60 80 I00

Speed-up values for Test Problem 2 as a function of the problem dimension.

realizing an asynchronous version of the parallel Newton method is beneficial with respect to the synchronous counterpart, at least for some challenging classes of problems.

Using the shared memory parallel computing system, we are able to extend the set of test problems and the relevant dimensions.

5, Shared Memory Parallel Computing System

5.1. Computational Environment. On shared memory parallel computers (multiprocessors), it is worth noting that the main problem is related to the definition of an efficient and computer-independent communication mechanism. In multiprocessors, the interaction among tasks is realized through the global memory, which is viewed by all the processors. The

Fig. 3.

2.5

2 / / ' " m f

1.5 ~,,j"/ I . . . . . . . . . . . . . . . .

0.5

0 { ' l ~,'- ~ In

20 40 60 80 100 Speed-up values for Test Problem 4 as a function of the problem dimension.

318 JOTA: VOL. 77, NO. 2, MAY 1993

exchange of informations is based essentially on a read procedure of data from the locations of memory (executed by the consumer task during the input phase for each iteration), and a write procedure performed during the output operation through which the producer task modifies a subset of the global variables.

To ensure the correctness of the algorithm, the writing operations on global variables must be coded to guarantee private use of shared variables, in order to prevent contemporary accidental access to inconsistent information by the reading processes. However, due to the asynchronous condition in the communication, the tasks may not be blocked from accessing shared variables.

We have solved this problem by providing a message buffer with very limited number of memory locations, in such a way that the same memory location may not be simultaneously accessed by the producer and the consumer tasks. In particular, it has been shown (Ref. 14) that a message buffer consisting of three memory locations is sufficient to manage the asynchronous communication among each pair of parallel tasks.

The computational environment used is based on an Alliant FX/80, a vector-parallel architecture with four vector processors (each with 23 Mflops of peak performance) sharing a common memory of 32 Mbytes. The operat- ing system is the Concentrix 5.6, an extension of the Unix BSD 4.3. The compiler is the FX/Fortran 4.2.40, which is also able to automatically produce an object code capable of fully exploiting the vector-parallel potentials of the supercomputer.

5.2. Numerical Results. Computational experiments have been carried out on the test problems with the number of variables varying according to the following values: 100, 250, 400, 550, 700. In the synchronous case, the parallelization is automatically obtained by exploiting the optimization facilities provided by the FX/Fortran compiler.

To compare the different results of the asynchronous and the synchronous codes, the following index of performance is used (see Ref. 15):

q R = Y~ R J q ,

i ~ 1

where q is the number of the test problems and

Ri = Tsyn/'/~asyn,

Ri = 2 - T~y~/ T~yn,

R~ = 1,

R~ = O,

R~ = 2,

if Tsyn -< Tasyn,

if Tsyn > Ta~yn,

if the two codes find different local minima,

if the asynchronous code fails,

if the synchronous code fails,

J O T A : VOL. 77, N O . 2, M A Y 1993 3 1 9

Table 2. Performance indexes for LSPNM1 versus GAPNM 1,

R~ n = 100 n = 250 n = 400 n = 550 n = 700

R 1 0.957 0.928 0 .834 0.932 0 .684

R2 1.923 1.863 1,831 1.835 1.805

R3 1 1 1 1 1

R4 1.789 1.715 1.712 1.654 1.643

R~ 1 1 1 1 I R 6 1.955 1.993 1.992 t . 996 t . 996

R 7 1.780 1.986 1.991 t .991 1.992

R 8 1,904 1.883 1.906 1.94I 1.930 R9 1.991 1.994 1.989 1.992 1.995

Rio 1 1 1 1 1

R 1.534 1.536 1.526 1.534 1.504

where T~y, and T ~ : are, respectively, the CPU time in seconds of the synchronous and asynchronous codes.

On the basis of the above definition, R and R~ are real numbers in the range [0, 2], and their values show how much one code is better, with respect to the CPU time, than the other. For values in [0, 1), the synchronous codes are superior to the asynchronous codes, and values near 0 show much better performance; on the contrary, for values in (1, 2], the asynchronous codes get better performances and values increasing to 2 show much better performance. R or Ri equal to 1 means the same performance, or conven- tionally the case in which the codes find different local minima. In case of

T a b l e 3. P e r f o r m a n c e i n d e x e s f o r L S P N M 2 v e r s u s

G A P N M 2 .

R~ n = 100 n = 250 n = 400 n = 550 n = 700

R t 0.975 1.164 0 .827 0 .796 0 .717

R 2 1.924 1.877 1.830 1.827 1.808

R 3 1 1 1 1 1

R 4 1.782 1,729 1.689 1.632 1.637

R 5 1 1 1 1 1

R 6 1.988 1.993 1.993 1.996 t .996

R 7 1.899 1.983 1.992 1.993 1.993

R 8 1.920 t .887 1.893 1.930 1.930

R 9 1.511 1.741 1.871 1.913 1.930

Rio 1 t 1 t t

R 1.500 1.537 1,509 1.508 1.501

320 JOTA: VOL. 77, NO. 2, MAY 1993

failure of the asynchronous code, Ri is equal to 0, whereas Ri = 2 means that the synchronous code fails.

The obtained values of R and R~ are shown in Tables 2 and 3.

5.3. Discussion. On the basis of the numerical results and their comparison, the following conclusions can be drawn.

(a) The numerical implementations of the parallel asynchronous Newton method on the shared memory parallel computing environment are satisfactory from the convergence point of view. The asynchronous algorithms have been able to find a local solution for all test problems. We observe that the test problems, except for the quadratic convex function, do not fit the class of uniformly convex functions. This means that the proposed asynchronous implementation strategy is effective and useful for more general classes of problems.

(b) The computational performances of the parallel asynchronous algorithm are strongly dependent upon the analytical structure of the problems, more than their synchronous counterparts. In fact, significant differences in the performances exist among the several test problems. The efficiency of the asynchronous algorithms is directly related to some features of the problems, which are quite impossible to identify since the random behavior is caused by the asynchronism,

(c) The benefits of the parallel asynchronous implementations are insensitive to the dimension of the problems. In fact, nearly the same performance has been obtained for the different dimensions considered in the computational experiments. This means that the asynchronism can be usefully exploited also in case of medium-scale dimension of the problems.

(d) From the performance point of view, the two implementation strategies, proposed to guarantee that the Hessian matrix is positive definite, have shown the same behavior. Hence, it does not seem worthwhile to force the positive definiteness of the approximation of the Hessian matrix at each iteration.

(e) In the case of the quadratic convex function, the better performance of the synchronous version of the algorithm confirms the superlinear convergence rate of the asynchronous version, in the case of uniformly convex functions (Ref. 6). The synchronous version is characterized by a quadratic convergence rate on these functions.

6. Concluding Remarks

Asynchronous parallel implementations of the Newton algorithm for solving nonlinear optimization problems have been presented. The

JOTA: VOL. 77, NO. 2, MAY 1993 321

algorithm has been developed and tested on distributed and shared memory parallel computing systems, with the aim to investigate the numerical performance on different computat ional environments.

On the basis of the numerical results, we conclude that the proposed asynchronous algorithm turns out to be preferable with respect to the synchronous counterpart for medium and large-scale problems, both in the context of distributed and shared memory parallel computing systems. In spite of the assumptions on the objective function given in Ref. 6 to guarantee the convergence of the method, we conclude that the proposed asynchronous algorithm converges to the local minima of a larger class of objective functions.

References

1. DENNIS, J. E., and SCHNABEL, R. B., Numerical Methods for Unconstrained Optimization and Nonlinear Equations, Prentice-Hall, Englewood Cliffs, New Jersey, 1983.

2. LOOTSMA, F. A., and RAGSDELL, K. M., State of the Art in Parallel Nonlinear Optimization, Parallel Computing, Vol. 6, pp. 133-155, 1988.

3. LOOTSMA, F. A., Parallel Newton-Raphson Methods for Unconstrained Minimiz- ation with Asynchronous Updates of the Hessian Matrix or Its Inverse, Delft University of Technology, Faculty of Technical Mathematics and Informatics, Report No. 91-02, 1991.

4. NASH, S. G., and SOFER, A., A General-Purpose Parallel Algorithm for Uncon- strained Optimization, SIAM Journal on Optimization, Vol. 1, pp. 530-547, 1991.

5. ZENIOS, S. A., and PINAR, M. C., Parallel Block-Partitioning of Truncated Newton for Nonlinear Network Optimization, University of Pennsylvania, The Wharton School, Decision Sciences Department, Report No. 89-09-08, 1989.

6. FISHER, H., and R1TTER, K., An Asynchronous Parallel Newton Method, Mathe- matical Programming, Vol. 42, pp. 363-374, 1988.

7. BERTSEKAS, D. P., and TSITSIKLIS, J. N., Parallel and Distributed Computation, Prentice-Hall, Englewood Cliffs, New Jersey, 1989.

8. CONFORTI, D., GRANDINETTI, L., MUSMANNO, R., CANNATARO, M., SPEZZANO, G., and TALIA, D., A Model of Efficient Asynchronous Parallel Algorithms on Multicomputer Systems, Parallel Computing, Vol. 18, pp. 31-45, 1992.

9. GILL, P. E., MURRAY, W., and WRIGHT, M. H., Practical Optimization, Academic Press, New York, New York, 1981.

10. HOCKNEY, R. W., and JESSHOPE, C. R., Parallel Computers 2, Adam Hilger, Bristol, England, 1988.

1l. MORt~, J. J., GARBOW, B. S., and HILLSTROM, K. E., Testing Unconstrained Optimization Software, ACM Transactions on Mathematical Software, Vol. 7, pp. 17-41, 1981.

322 JOTA: VOL. 77, NO. 2, MAY 1993

12. CONFORTI, D., GRANDINETI'I, L., and MUSMANNO, R., Parallel Asyn- chronous Quasi-Newton Algorithm Jbr Nonlinear Optimization Problems, Proceed- ings of the IMACS-IFAC International Symposium on Parallel and Distributed Computing in Engineering Systems, Edited by S. Tzafestas, P. Borne, and L. Grandinetti, Elsevier Science Publisher, Amsterdam, The Netherlands, pp. 161-167, 1992.

13. GRIPPO, L., LAMPARIELLO, F., and LUCIDI, S., A Truncated Newton Method with Nonmonotone Line Search for Unconstrained Optimization, Journal of Optimization Theory and Applications, Vol. 60, pp. 401-419, 1989.

14. CONFORTI, D., GRANDINE'I~FI, L., and MUSMANNO, R., Optimal Decision Making for an Asynchronous Communication Mechanism on Shared Memory Computer Architectures, Universit~ della Calabria, Dipartimento di Elettronica, Informatica e Sistemistica, Report No. 103, 1991.

15. AL-BAALI, M., A Rule for Comparing Two Methods in Practical Optimization, Universit~ della Calabria, Dipartimento di Elettronica, Informatica e Sistemis- tica, Report No. 119, 1991.

A parallel asynchronous Newton algorithm for unconstrained optimization

Documents

Transcript of A parallel asynchronous Newton algorithm for unconstrained optimization