A parallel Householder tridiagonalization stratagem using scattered square decomposition

INTERNATIONAL JOURNAL FOR NUMERICAL METHODS IN ENGINEERING, VOL. 26, 857-873 (1988)

A PARALLEL HOUSEHOLDER TRIDIAGONALIZATION STRATAGEM USING SCATTERED ROW

DECOMPOSITION*

H. Y. CHANGt

Departnwni of Civil and Architectural Engineering, Universitj of Miami, Coral Gables, Florida 33124, U .S .A

S . UTKIJ:

Deparinieni qf Civil Engineering and Department .f Computer Science, Duke University, Durham, North Carolina 27706. U.S.A.

M. SALAMAg A N D D. RAPP‘.

Jet Propulsion Lahorators. Calforniu Institute qf Technology. Pasadena. California. 91 109, U .S .A .

SUMMARY

Householder’s method for tridiagonalizing a real symmetric matrix, a major step in evaluating eigenvalues of the matrix, is modified into a parallel algorithm for a concurrent machine of message passing type. Each processor of the concurrent machine has its own CPU, communications control and local memory. Messages are passed through connections between processors. Although the basic algorithm is inherently serial, the computations can be spread over all processors by scattering different rows of the matrix into processors, hence the term “Scattered Row Decomposition”. The steps in the serial and the parallel algorithms are identified. Expressions for efficiency and speedup are given in terms of problem and machine parameters. For a concurrent machine of ring type interconnection, a selected representative problem of large order exhibits efficiency approaching 66 per cent.

1. INTRODUCTION

Complex structures have many of degrees of freedom, and as a consequence, the dynamic analysis of such structures is very computer intensive. In the future, the use of controlled flexible structures for antennae will require even greater computational resources.

One of the major steps in a dynamic analysis is the evaluation of the eigenvalues and eigenvectors of a large real symmetric matrix. In the solution of algebraic eigenvalue problems, it is always beneficial to reduce the nth order matrix by similarity transformations into upper Hessenberg form first, and then apply the preferred iteration process. (When the matrix is full

*This paper presents one phase of research carried out at the Jet Propulsion Laboratory, California Institute of

‘Assistant Professor *Professor $Member of Research Technical Staff $Senior Research Scientist

Technology, under contract NAS7-100 sponsored by NASA

0029-598 1 /88/040857-17$08.50 0 I988 by John Wiley & Sons, Ltd.

Received 16 March 1987 Revised 29 June 1987

858 H. Y. CHANG, S. UTKU, M. SALAMA A N D D. KAPP

initially this takes an effort less than n3, however in the ensuing iteration process, the cost of each iteration step may drop from n3 to nz (or from n3 to n if the matrix is symmetric) In the case of real symmetric matrices the Hessenberg form is the tridiagonal matrix. One of the most versatile and effective methods for obtaining the upper Hessenberg form is the Householder method. After tridiagonalization, the eigenvalues and eigenvectors may be evaluated by several techniques.

The tridiagonalization step requires a significant fraction of the computational resources for the whole process of eigenvalue extraction, and the cost of doing large calculations on sequential machines has become prohibitive. It is believed that the advent of inexpensive concurrent processing arrays will facilitate dynamic analysis of complex structures. To make use of such computer hardware, however, methods must be developed for efficient computation on concurrent machines. The objective of this paper is to present an efficient algorithm for the Householder algorithm for tridiagonalization of a large real symmetric matrix A of order n, on a concurrent machine.

In the Householder method, (n - 2) successive similarity transformations are performed, using Hermitian unitary elementary matrices, to reach the final tridiagonal form.’.’ Each such transformation annihilates all elements in a particular row and the corresponding column, except for the tridiagonal elements. These zeroed elements also remain zero in the succeeding transformations. Unfortunately, the sequence of transformations is inherently serial. Each transformation acts on elements which are the result of previous transformations. Therefore our new strategy must be to exploit potential concurrency within each transformation by distributing the computational load to as many processors as possible without incurring excessive data transmission time.

There are three fundamental potential sources of inefficiency in converting a sequential program to a concurrent algorithm:

(1) extra computations in the concurrent algorithm; (2) idle processors (lack of load balance); (3) communication between processors. In the method we propose, extra computations are negligible, so this is not a problem. However,

there are difficulties in load balancing because the natural way to distribute the load is to store several rows per processor. As calculation proceeds, the upper rows do not require further change and processors containing such rows tend to remain idle after a certain point. Idleness can be largely alleviated to a great extent by storing some upper and lower rows in the same processor.

The starting point for the method is the usual sequential Householder technique for a uniprocessor. This method is reviewed in Section 2. The outline of a concurrent stratagem is presented in Section 3. The performance of this stratagem is discussed in Section 4 in terms of the estimated speedup and efficiency for the method. Section 5 describes an alternate stratagem. The last section lists the conclusions resulting from this study.

2. SERIAL TRIDIAGONALIZATION

Householder’s method uses (n - 2) successive transformations to reduce an nth order real symmetric matrix A to tridiagonal form. Hermitian unitary elementary matrices are used in the similarity transformations. Let Pk denotes the kth such transformation matrix, and let A, + denote the resultant matrix after applying the kth transformation to A,:

+ 1 = P k A k P k (1)

HOUSEHOLDER TRIDIAGONALIZATION STRATAGEM 859

where A, is the original matrix. The particular form chosen for P, is'

where w"w= 1 (3)

It may easily be verified that P k = PF = P; The effect of transformation equation (1) on A, will be to produce a matrix A,+, for which all rows and columns 1, 2, . . . , k are in tridiagonal form. Further applications of the transformation do not affect previously tridiagonalized rows and columns. (I - 2WWH), is a Hermitian unitary elementary matrix of order (n- k) and I' is an identity matrix of order k. The elements of w can be obtained from manipulation of the elements of the kth row (or column) of A,. Because similarity transformations are used, the final tridiagonalized matrix A, - has the same eigenvalues as the original matrix A, =A. An effective method for deriving A, + , from A, is the following p r~cedure :~

where

and

(6) 1 r

2 =U- WTUW, U =-A,W

The procedure for the choice of w and r is detailed in the algorithm as shown in Table 1. In the algorithm, only the upper triangle of A is to be considered because all the A, are symmetric. The algorithm transforms the input matrix by row order and starts from the first row downward.

The vectors v, u and z require separate memory locations in the kth transformation. The first k entries of these vectors are not needed at all in the kth transformation. If memory is limited, one can store z in the same locations that held u. Furthermore, v can be stored in the kth row of A beyond the diagonal element. In the kth transformation, - T is the value of the superdiagonal (or subdiagonal) element in the kth row. (If we delete the last instruction in Step D and store - Tin another array, the transformation matrices may be easily recovered because vector v is saved in the kth row, exclude diagonal elements, of the upper triangle of A.)

The parallel stratagem given in the next section is derived from the above algorithm with the aim of keeping idle processors and communications to a reasonable low level. For clarity, the notation will be the same.

3. CONCURRENT STRATAGEM

Concurrency can be introduced into Steps C, E, F, G and H of the algorithm presented in Table I. If the size of the input matrix is very large, Step E will be the most time consuming step, followed by Step H.

The first step in designing a parallel algorithm is to decide on a scheme to distribute the elements of the matrix over the processors. Since Step E, the most time consuming step, involves multiplication of an (n- k) x (n- k) matrix and an (n-k) vector, the natural way to apply Householder's algorithm to a concurrent machine is to distribute rows of the matrix among the processors. In the kth transformation, the elements of rows k to n undergo change. To a considerable extent, one can calculate the changes in various rows, in parallel, in various

860 H. Y. CHANG. S. UTKU, M. SALAMA AND D. RAPP

Table I Serial Householder’s algorithm in pseudocode format

Step Operation Comment

(A) (B) Set k , t k + l

For k = 1,. . . , n - 2 do steps B to H

(D) If a,,,, <O then T + - T 1

2r R = 2 r 2 , w=-v

Set a k , k , c a k , k i + T Set R + a k v k , x T Set u , + a k i , for i = k l , . . . , n Set u ~ , , ~ + - T

For i = k l , . . . ,n 1 1 R k r u=-A v=-A kw (E)

set u i t (aij x u j ) j = i

For i = k + 2 , . . . , n i - 1

set ui t ui + C (uji x u j ) j = k t

For i = k , , . . . ,n

set ui + - ui 1

R n

(F) Set C + ( u i x u i ) i = k l

c = V T U

1 2 R

(G) For i = k , , . . . ,n z = u - -VTUV

C 2R

set z i c u i - - x vi

. . (H) For i = k l , . . . ,n, J = I , . . . , n A k , l = A k - v Z T - Z v T

Set a i j e a i j -u i z j - z iu j

processors. However, if one naively stored groups of rows in processors starting at the top of the matrix sequentially, as the tridiagonalization progresses the prcessors containing upper rows would be idle while the processors containing lower rows were still active. Therefore, one should distribute upper, middle and lower rows among the processors to balance the computing loads.

In general, let the order of matrix A be n, and let N be the number of processors. Consider a special case where n = 2N so that two rows must be stored per processor. In this case, a reasonable plan is to store the following pairs of rows in separate processors: (1, n), (2, n - l), . . . , (42, n/2 + 1). We will call this storing scheme scattered row decomposition of type 1. Note that since A is symmetric, only the upper half (including diagonals) of the matrix is needed. The number of storage locations required in each processor for its share of A is then n + l . During the course of a transformation, the processor loads are not balanced. For k = 1, the processor containing rows (1, n) has a light load; however, all the other processors are in balance. As k increases, the imbalance

HOUSEHOLDER TRIDIAGONALIZATION STRATAGEM 86 1

between processors increases. However, the higher values of k involve less calculation. As a simple example, consider the case where n = 8 and N=4. The computation complexities in terms of number of elements to be recalculated during each transformation in processors are shown in Table 11. The total number of serial elements to be recalculated is (9 + 9 + 9 + 4 + 3 + 2) = 36. In a uniprocessor, the six transformations would require (7+ . . . + 1)+(6+ . . . + 1)+ . . . +(2+ 1) serial elements to be recalculated. This adds up to 83 elements. The speedup is therefore - 83/36 =2-3 and the efficiency is -57 per cent.

An alternative method for storing two rows per processor is to store rows (1, n / 2 + l), (2, n/2 + 2 ) , . . . , ( 4 2 , n) in separate processors. We will call this storing scheme scattered row decomposition of type 2. The storage requirements differ from processor to processor, but there is an increase in efficiency. Similarly, for the case of n = 8 and N =4, the computation complexities in terms of number of elements to be recalculated during each transformation in processors are shown in Table 111. Here, the number of serial elements to be changed is (10 + 8 + 6 + 4 + 3 + 2) = 33 and the speedup is - 83/33 = 2-5 and the efficiency is - 63 per cent. It can easily be shown that the efficiency increases substantially as the number of rows per processor increases. For example, if n = 6 and N = 2 so that there are three rows per processor, storage of rows (1,3,5) in one processor and (2, 4, 6) in the other gives an efficiency of over 80 per cent.

Table I1 Example of operation complexities for the case of n = 8, N = 4, and row scattering type 1

~~ ~~

Processor P E # 1 P E # 2 P E # 3 P E # 4 Row assignment 1,8 2 , l 3,6 4,5

Number of matrix elements to be recalculated ~~ ~~

1 1 9 9 9 2 1 2 9 9 3 1 2 3 9

Transformation 4 1 2 3 4 5 1 2 3 0 6 1 2 0 0

Table 111. Example of computation complexities for the case of n = 8, N = 4, and row scattering type 2

Processor PE#1 P E # 2 P E # 3 P E # 4

Number of matrix elements to be recalculated

1 4 10 8 6 2 4 3 8 6 3 4 3 2 6

Transformation 4 4 3 2 1 5 0 3 2 1 6 0 0 2 1

862 H. Y. CHANG, S. UTKU, M. SALAMA AND D. RAPP

Other schemes of decomposing the matrix by scattering rows into processors were examined. The decomposition of type 2 was found to be the most efficient one. It also results in simple computer code. Because of this advantage, in the sequel, the term ‘scattered row decomposition’ means the decomposition of type 2.

In the general case of arbitrary n and N , where n > 2 N , the following procedure seems appropriate. It is required that n be an integer multiple of N. If it is not, one may increase n by adding zero rows with identity diagonals to matrix A. Thus, one may write

n N

s=- (7)

where s is an integer.

represented by PE # i, i = 1, . . . , N . We assign Each of the N processors is assumed to bear an identification number (p). The processors are

rows 1 , 1 + N , . . . , l +(s- l )N to PE# 1

rows 2 , 2 + N , . . . ,2+(s-l)N to P E # 2

rows N , 2 N , . . . , s N to P E # N

In more compact form, let rows (1 + iN) be assigned to PE # 1, where 1 varies from 1 to N and i varies from 0 to s - 1.

In the examples given previously, we stored only the upper triangle of A in the processors. That is sufficient to carry through the computation, but as will be shown later, it does not allow us to maximize concurrency. If entire rows are stored in the processors (not just the upper triangular parts), it is possible to improve the concurrency of the calculation. We shall assume that sufficient memory exists to store entire rows. Under this data assignment, each individual processor holds s rows of A. Therefore, it requires each processor to have memory for a s x n matrix. Some extra storage is also needed for vectors v, z and part of u.

We will assume there is a local variable B in each processor. B is a s x n matrix. In each processor, matrix B will host the assigned s number of rows of A. Taking figure 1 for example, the first and the fifth rows of A become the first and the second rows of B of PE # 1. The second and the sixth rows of A become the first and the second rows of B of PE # 2, and so on.

In many stages of the stratagem, one needs to identify the processor holding an arbitrary row of A. This is done by the function

M(i)=MOD(i- 1, N ) + 1 (8) where MOD gives the remainder after dividing the first argument by the second. By using equation (8), we can easily identify the processor holding matrix element aij as PE # (M(i)).

The relation between the input matrix A and the local matrix B in each processor is established next. Let

I(i)=INT($)+ 1 (9)

where INT represents integer division. An arbitrary matrix element uij may be found as b,(il, in PE :(M(i)).


x x x x x x x x

x x x x x x x x

x x x x x x x x

x x x x x x x x 91 :: -3 X x x x

x x x x x x x x l + 4

Figure 1. Scattered Row Decomposition for the Case of n = 8 and N = 4

On the other hand, elements of B may be mapped back into A through

D(i)=(i- 1)N+p (10)

where p is the identification number. Thus, b, represents actually uD(i), j .

There are (n - 2) successive transformations in the concurrent algorithm, with index k = 1 to ( n - 2) defining which row is being tridiagonalized. The kth transformation will be presented. The kth transformation starts from calculating the Euclidean norm of the last (n-k) elements of the kth column of A as shown in Step C of Table I. This is done concurrently in all processors. Each processor will calculate

One may observe that each bik in equation (1 1) may be mapped back to a corresponding element in the kth column of A. This process is not 100 per cent efficient because the numbers of elements which need operation in various processors will not be balanced. However, with the method used to distribute rows over processors, it should not be far out of balance and the efficiency is expected to be closer to 100 than 50 per cerit.

The sums of squares of elements in each processor must now be combined into one sum. In general this can be achieved in log, N steps. The simplest way to do this is to send each sum to a single processor for addition. T is obtained by taking the square root of the final sum in one processor.

Note that if only the upper triangular parts of the rows are stored in the processors, T may be calculated in the one processor that holds row k. All the other processors will be idle.

864 H. Y. CHANG, S. UTKU, M. SALAMA AND D. RAPP

The next step is Step D of the serial algorithm, the formation of scalar R and vector v. Since T is calculated in PE # ( M ( k ) ) , R is easily obtained from T and a,,, k l . Vector v can be formed from the last (n - k) elements of row k of A. The vector v is formed such that the k,th entry of v is the largest magnitude possible. Note that the first k entries are no longer needed. The last (n- k) components of v and scalar R will be transmitted to all the other processors in order to carry out Step E of the serial algorithm. In Step E, the lower right (n - k) x (n - k) submatrix of A is multiplied by the last (n - k) elements of v. This is done concurrently by sequentially multiplying each of the rows of A (last n - k elements only) in a processor times the last (n - k) elements of v at the same time that this is done in all other processors. Therefore,

Again, each b, in the above equation is actually an element of the lower right (n - k) x (n - k) square of A.

Although the loads are not balanced, because of the non-uniform numbers of rows which undergo change in each processor, the loads approach balance as the number of rows per processor increases. The product of a row of A times v produces one element of u after division by R. If only the upper triangular parts of the rows are stored in the processors, calculation of any element of u would involve a number of processors. As a result, it would increase the imbalance of the loads and require complicated data communication. Due to this, the entire rows are chosen to be stored in the processors.

With the elements of the u vector distributed over the processors with the indices as the rows of A, it is now desired to calculate the constant C (as in Step F of the serial program given in Table I). To do this, we multiply corresponding elements of v and u and sum the products, in the various processors, concurrently. The task of each processor is

s

c p = 1 "D(i)uD(i) i = 1

(14) D ( i ) ; k

Then, the sums are passed to a single processor for addition as before when calculating T. (A more efficient process involving alternating transmissions and additions is not implemented here.) Thus,

N c = c c p

p = 1

In addition, the same processor will compute the constant Y

C y=- 2R

Then Y must be broadcast to all other processors so concurrent calculation of the last (n- k) elements of z can be effected:

i = l , . . . , s

The final step in the kth transformation is the calculation of the last (n - k) rows of Ak+ from A,, as shown in Step H of Table I. To do this, the last (n-k) elements of z must be available to all processors. Thus, each processor must broadcast its part of z to all other processors. The components of the upper triangle, including diagonals, of the transformed (n -k) x (n - k) matrix


will be first computed in all processors. The assigned task to each processor is the calculation

The components of the lower triangle, excluding diagonals, of the transformed (n - k) x (n - k) matrix are also updated for the next transformation by the results of equation (1 8) with proper data transmission between processors. After modifying the (n - k) x (n - k) matrix, the kth transformation is completed.

The pseudocode of the parallel stratagem described above is given in TableIV. In the pseudocode the loading and unloading are not included. It is assumed that data assignment is done as in the example shown in Figure 1 before any operation starts. Note that the communication in Step 16 ofTable IV is done as follows: each processor transmits any element b, whose indices have the property of j > D(i) > k to PE # (MG)) where it is received as bI ( j ) , D ( i ) .

At the end of the tridiagonalization, each processor except PE # N will contain s number of diagonal and s number of superdiagonal elements; whereas PE # N will contain s number of diagonal elements and (s- 1) number of superdiagonal elements of the tridiagonalized matrix. In each processor, bi, D(i ) , i = 1, . . . , s are the corresponding diagonal elements u,,(~), D ( i ) of A, - *. At the same time, bi, D ( i ) + 1, i = 1, . . . , s are the corresponding superdiagonal elements aD(i), D ( i ) + of An-2.

Table IV Pseudocode for parallel Householder’s algorithm using scattered row decomposition

Operation

Set k e O Set k e k + l a n d k , + k + l Compute TP by equation ( I 1) Transmit TP to PE#(M(k)) Compute T by equation (12) If b , ( , , * , , < 0 then T+ - T Set b , ( k ) , k i

Set R e b I ( k ) . k1

Set b, (k) . ki -

b, (k) , k i + Set ~ ~ e b , ( ~ ) , ~ for i = k , , . . . ,n

Transmit R and the last (n - k) components of v to all other processors Compute parts of u by equation (1 3) Compute C p by equation (14) Transmit Cp to PE # (M(k)) Compute C and Y by equations (1 5) and (16) Transmit Y to all other processors Compute parts of z by equation (17) Transmit parts of z to all other processors Compute updated parts of B by equation (1 8) Update the components of the lower triangle of the transformed (n - k) x (n - k) submatrix If k < n- 2 then GO TO STEP 2 else STOP

Processor

All All

~~

tCorresponding item label in Table I

866 H. Y. CHANG. S. UTKU. M. SALAMA A N D D. RAPP

4. SPEEDUP AND EFFICIENCY

The speedup g of the parallel stratagem on a concurrent machine is defined as

T, T,

Execution time of Table I with 1 processor Execution time of Table IV with N processors

g=-=

and the efficiency e as

9 N

e = -

(19)

In this section, T, and T, will be calculated in terms of a set of times required for individual operations, as shown in Table V. Note that qm will be also used for division, and qa also for subtraction. Storing cost q, will be considered only when producing duplicates of arrays.

In Table VI the total execution times of the steps of the algorithm on a single processor are given in terms of the quantities defined in Table V, and quantities H , , H 2 and H , . The quantity H1 is the sum of number of elements in n-2 vectors of orders n - 1 to 2:

Table V. Times Required For Floating Point Operations

Operation Time required

Floating point multiplication q m Floating point addition 4. Square root 4, Logic operation 41 Storing 4. One multiplication and one addition Q = q m + q a

Table V1. Execution times of serial Householder’s algorithm

Step no. in Table I

Times for (n -2) transformations

HOUSEHOLDER TRlDlAGONALlZATlON STRATAGEM 867

The quantity H, is the sum of number of elements in n-2 square arrays of orders n- 1 to 2

2n3 - 3n2 + n- 6 H2= 1 1 2 = 1=2 6

The quantity H, is the sum of number of elements in n-2 triangular arrays of orders n - 1 to 2:

"-1 i(i+ 1) n3-n-6 1=2 2 6

H,= Z-=

With these, the sum of execution times for (n-2) transformations of the Steps B to H listed in Table VI may be expressed as

where

The total time required for executing the parallel stratagem in a concurrent machine may be stated as

T,= TI + T, (26)

where TI is the computational time and T, the data transmission time. In Table VII the total execution times of the computation steps of the algorithm on a concurrent

machine are given in terms of the quantities defined in Table V, and quantities h,, h, and h,. Quantities h,, h, and h, in the concurrent algorithm correspond to H,, H2 and H, (equations 21-23) in the serial algorithm. If there are s rows stored per processor and one operation is performed on one element in each row, the operation count for each of the first (N-1) transformations is s, for each of the next N transformations is (s- l), etc., and for each of the last (N - 1) transformations is 1. Then, the total count is

h,= N(1+2+ . . . +s)-(s+ 1) (27)

The quantity h, is the total operation count in multiplying an (n- k) x (n- k) matrix times an (n - k) vector through the range of k (i.e. k = 1, 2, . . . , n - 2):

h2=s[(n-l)+(n-2)+ . . . +(n-N+1)]

+(s-l)[(n-N)+(n-N-1)+ . . . +(n-2N+1)]

(28) +l[N+(N-l)+ . . . +2]

In a similar manner, h, may be calculated as

1 N2s3 N2s2 N2s Ns2 Ns

6 4 1 2 4 4 h3=--+-+

868 H. Y. CHANG, S. UTKU, M. SALAMA A N D D. R A W

Using equations (27), (28) and (29), the sum of the execution times for (n -2 ) transformations of the computation steps listed in Table VII may be expressed as

where

Next T2 must be estimated. We know that implementing the same stratagem on different concurrent machines will result in different data transmission times. Data transmission time depends not only on how fast a data item can be transmitted from one processor to another, but also on the architecture of the concurrent machine. Here we assume a constant value, q t , to be the time required for transmitting a data item from any processor to any other processor. We will assume that if more than one data item is transmitted between processors, the required time is the product of qt and the number of data being transmitted. For example, if N processors are used, transmitting an array of data of length I from one processor to all other processors takes time I(N - l )q t . If several processors, (say m processors) are required to transmit an array of data of same length I from their own memories to all other N-1 processors, the required time is assumed to be mZ(N - l)qt, (The above assumptions do not include pipelining and concurrency of data transmission. In order to account for such possibilities one may use a smaller ql.) The communication steps in Table VII are computed on this basis. The sum of the communication time, T2, may be obtained from Table VII by summing as

Note that the entry of Step 16 in Table VTI indicates that all elements in the upper triangle of A, are transferred from its host to other processors at the end of the kth transformation. It is a conservative estimate, because some of these elements need not be transmitted since they are already in the proper processors. (Also note that if the machine architecture is specified, the total data transmission time may be calculated more exactly.)

From equations (19) and (26), the speedup of the parallel stratagem is

g=- Ts Tl + T2

Then, using equation (20), the efficiency expression becomes

T.

(34)

(35)

HOUSEHOLDER TRlDlAGONALIZATION STRATAGEM 869

Table VII. Total execution time of parallel Householder's algorithm using scattered row decomposition

Step no. Sum of maximum times in n - 2 Computation Communication in Table IV transformations steps steps

2 - 2)q, J

3 h iQ J

9 hiQ J

13 h l Q J

14 H , ( N - J

17 (n-2)(qa t 41) J

By substitution from equations (24), (30) and (33), one may obtain the efficiency expression for the parallel stratagem. However, if the number of rows contained in each processor is large enough, by ignoring the terms with small magnitude, the asymptotic efficiency expression may be expressed as

e = (36) 1

870

where

H. Y. CHANG. S. UTKU, M. SALAMA A N D D. RAPP

41 p=- Q (37)

If p = O , equation (36) indicates excellent efficiency when s is large. However, communication between processors always takes time, therefore p is never zero. When p Z 0 , efficiency is poor because N and N / s appear in the factor of p in equation (36). N in the factor of p is completely caused by the communication time consumed in Step 16 of Table IV. In the present stratagem, at each transformation, only the elements of the upper matrix are computed (Step 15) and elements of the lower are updated by communication between processors (Step 16). In order to eliminate N in the factor of p in equation (36), complete rows should be updated by computation in each processor, at the expense of some additional computation time. The algorithm corresponding to this modification is described below.

5. ALTERNATIVE STRATAGEM

As mentioned previously, we could eliminate data transmission time of Step 16 in Table IV at the expense of some additional computation time. This is done by modifying equation (1 8) used in Step 15 of Table IV to

which achieves the update of B without data communication, but with additional computation. Then, Step 16 of Table TV should be eliminated.

The computation time of new Step ‘15 of Table 1V may be found to be 2h2Q. All the other computation times in Table IV remain unchanged. The new total computation time is

3 1 9 3 3 T‘ - NZs3Q 1 +- - - - ~ + - - - 1- { 4s 4s’ 4 N s 4NsZ N2s3

1 + -

(2;s 2 N s 2 N 2 s 2 N 2 s 3 Q

(39)

Meanwhile, total transmission time is decreased by the amount of time spent in Step 16 of Table IV, leading to T2:

The efficiency is

HOUSEHOLDER TRlDlAGONALlZATlON STRATAGEM 871

With these and large s, the asymptotic efficiency expression of the alternative stratagem becomes

2 ~

3 e' = 3 1 qs N-1

1+-+--++- 4s 2sQ S

Note that when + =0, in this case for large s maximum efficiency is 67 per cent (instead of 100 per cent which one would obtain in the previous stratagem). However, the following examples demonstrate the superiority of the alternative stratagem. First, suppose the times for floating point arithmetic operations are

q m a r =q = q =4,=qs=1-O (43) For a matrix of order 10,OOO to be tridiagonalized, if the concurrent machine has 100 processors (N = 100) then each processor will contain 100 rows (s = 100) of the input matrix. The efficiency will be e' = 63 per cent if qt = 0-1 and e' = 44 per cent if qt = 1.0. (The corresponding efficiencies for the previous stratagem are e=43 per cent if q l = O * l and e = 7 per cent if 41= 1.0). If the concurrent machine has 10 processors and each processor contains lo00 rows, then the computation time totally dominates: the efficiency is almost a constant value of e' = 66 per cent regardless of whether qt is 0 1 or 1.0. (The corresponding efficiencies for the previous stratagem are e=89 per cent if qt=O*l and e=44 per cent if qt= 1-0.)

An implementation of the alternative stratagem on a ring architecture4 is given in the Appendix to demonstrate that, when the machine architecture is specified, the transmission time may be smaller than the one used in equation (42).

6. CONCLUSION

The Householder method of tridiagonalization may be implemented on a concurrent machine efficiently by scattered row decomposition provided that the number of rows contained in each processor, s, is large enough, and the ratio of the time to transmit one data item from one processor to any other processor to the time to perform a floating point arithmetic operation, p, is small enough. It is also demonstrated that if the architecture of the concurrent machine is specified, the efficiencies (say, e' in equation (42)) may be further improved.

APPENDIX

Suppose the concurrent machine is a hypothetical ring processing array4 whose interconnection network is shown in Figure 2. The alternative parallel Householder's tridiagonalization stratagem using scattered row decomposition described in Section 5 is specialized on this special architecture. First, let q; be the time required to transmit one number from a processor to its nearest neighbour. Note that data may be transmitted from one processor to the processors on both sides and the maximum distance between processors is N/2.

It can be shown that transmitting one number from one processor to all other processors (as in Step 12 of Table IV) takes only time (N/2)4;. Transmitting one number from each of the (N- 1) processors to the remaining processor (as in Steps 4 and 10 of Table IV) takes time (N- 1)q;. Transmitting an array of data of length I from one processor to all other (N - 1) processors, (as in Step 7 of Table IV) takes time (N/2 + 21 - 2)q;. Transmission time in Step 14 of Table IV will be approximated as the time used in Step 7 of Table IV. With these, one may rewrite the times required

872 H. Y. CHANG, S . UTKU. M. SALAMA AND D. RAPP

1. Data is transmitted through solid lines connecting processors. 2. A processor can only transmit or receive data during any clock cycle. 3. Number of processors is an even number.

Figure 2. Interconnection Network of a Hypothetical Ring Processing Array

for communication steps in Table VII as

Step 4: (n-2)(N-l)q;

1 1)-2(n+ 1) q;

Step 10: ( n - 2 ) ( N - l ) q i

Step 12: (n-2)-q; N 2

1)-2(n+ 1)

Thus, the total required transmission time may be expressed as

Since the total computation time is already given by (39), the asymptotic efficiency expression may be found as

2

e” = 3 1 4 s 2 1+-+--+p’- 4s 2sQ s

where

p’=- 4; Q

Note that the factor of p’in equation (A2) is 2/s, which is much smaller than the factor ofp in (42) for large N ; a factor which increases efficiency. For example, suppose s= 100, N = 100 (which corresponds to n = 10,000) and p’ = 0.5, one obtains e” = 65 per cent from equation (A2), whereas e’ = 44 per cent from (42).


REFERENCES

1. J. H. Wilkinson, The Algebraic Eiyencalue Problem, Clarendon Press, Oxford, 1965. 2. J. Ortega, 'The Givens-Householder method for symmetric matrices', in A. Ralston and H. S. Wilf (eds.), Mathematical

3. R. L. Burden. J. D. Faires. and A. C. Reynolds, Numerical Analysis, 2nd edn., Prindle, Weber and Schmidt, Boston,

4. K. Hwang and F. A. Briggs, Computer Architecture And Parallel Processing, McGraw-Hill, New York 1984.

Method for Digital Computer, Wiley, New York, 1967.

Massachusetts, 1981.

A parallel Householder tridiagonalization stratagem using scattered square decomposition

Documents

Transcript of A parallel Householder tridiagonalization stratagem using scattered square decomposition