Multilayer Neural Networks for Reduced-Rank Approximation

14
~ 684 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994 Multilayer Neural Networks for Reduced-Rank Approximation Konstantinos I. Diamantaras, Member, IEEE, and Sun-Yuan Kung, Fellow, IEEE Abstract-This paper is developed in two parts. First, we formulate the solution to the general reduced-rank linear ap- proximation problem relaxing the invertibility assumption of the input autocorrelation matrix used by previous authors. Our treatment unifies linear regression, Wiener filtering, full rank approximation, auto-association networks, SVD and Principal Component Analysis (PCA) as special cases. Our analysis also shows that two-layer linear neural networks with reduced num- ber of hidden units, trained with the least-squares error criterion, produce weights that correspond to the Generalized Singular Value Decomposition of the input-teacher cross-correlation ma- trix and the input data matrix. As a corollary the linear two-layer back propagation model with reduced hidden layer extracts an arbitrary linear combination of the generalized singular vector components. Second, we investigate artificial neural network models for the solution of the related generalized eigenvalue problem. By introducing and utilizing the extended concept of deflation (originally proposed for the standard eigenvalue problem) we are able to find that a sequential version of linear BP can extract the exact generalized eigenvector components. The advantage of this approach is that it's easier to update the model structure by adding one more unit or pruning one or more units when our application requires it. An alternative approach for extracting the exact components is to use a set of lateral connections among the hidden units trained in such a way as to enforce orthogonality among the upper- and lower-layer weights. We shall call this the Lateral Orthogonalization Network (LON) and we'll show via theoretical analysis-and verify via simulation-that the network extracts the desired components. The advantage of the LON-based model is that it can be applied in a parallel fashion so that the components are extracted con- currently. Finally, we show the application of our results to the solution of the identification problem of systems whose excitation has non-invertible autocorrelation matrix. Previous identification methods usually rely on the invertibility assumption of the input autocorrelation, therefore they can not be applied to this case. I. INTRODUCTION ECENTLY, A LOT OF ATTENTION is paid to the R properties of linear two-layer networks with reduced hidden layer size. Although a two-layer linear network is always equivalent to some single-layer linear network in terms of the overall mapping, the presence of the hidden units guards the rank of the mapping. Thus, if the network has n inputs, m outputs, and p hidden units, and if E RnXp, W E Rmxp, denote the lower and upper layer weight matrices Manuscript received December 17, 1991; revised September 29, 1992. This work was supported in part by the Air Force Office of Scientific Research under Grant AFOSR-89-0501A. K. I. Diamantaras is with Siemens Corporate Research, Princeton, NJ 08540 USA. S.-Y. Kung is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA. IEEE Log Number 9205984. - respectively, then the overall mapping W = WW'' is such that ranA(W) 5 p [21]. A reduced hidden layer network is defined as one with p 5 min{m, n}. It was first observed that in the auto-associative case the optimal least-squares weights are related to the Singular Value Decomposition (SVD) of the input data matrix [3]. Any learning rule that can find the optimal solution for the sum-of-squares error criterion like for example, the linear version of the back propagation algorithm [17], will produce weights that span the same space as the p principal components of the input sequence. Later it was found that in the hetero-associative case the result is related to the eigenvalue decomposition of the classical least-squares regression matrix [ 11 or equivalently, the SVD of the cross-correlation between the output-data and the pre- whitened input data [9], [20]. In all these publications however, it was assumed that the input data matrix has full rank and equivalently the input autocorrelation matrix is non-singular. In this paper we develop a general solution to the reduced- rank approximation problem without using the invertibility assumption for the input autocorrelation matrix, and we study the training of reduced-hidden-layer networks which relate to this problem. We both extend standard neural models, such as back propagation, as well as propose novel structures, such as networks incorporating lateral connections among the hidden units, with the aim of extracting the exact eigencomponents related to our problem. In particular, in Section II we show that the solution obtained by the network when minimizing the least-squares error criterion is related to the Generalized Singular Value Decomposition (GSVD) of the input-teacher cross-correlation matrix and the input data matrix. Then all previous results mentioned in the previous paragraph can be treated as special cases. In Section I11 we focus our attention in the training of neural networks to extract these generalized eigenvalue components. We show that under mild assumptions the error surface of a two-layer network contains no local minima so any gradient descent procedure such a back propagation is guaranteed to extract a global minimum. This is an extension of the results reported earlier by Baldi and Homik [l]. In Section 111-B we investigate the extension of the back propagation model using the concept of defla- tion so that the exact eigenvalues are extracted sequentially rather than getting some linear combination of them (as is the case with the straightforward BP model). This is useful for avoiding retraining of the network when one or more component(s) needs to be added or removed. In Section 111-C The superscript denotes matrix transposition. 1045-9227/94$04,00 0 1994 IEEE

Transcript of Multilayer Neural Networks for Reduced-Rank Approximation

~

684 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5 , NO. 5, SEPTEMBER 1994

Multilayer Neural Networks for Reduced-Rank Approximation

Konstantinos I. Diamantaras, Member, IEEE, and Sun-Yuan Kung, Fellow, IEEE

Abstract-This paper is developed in two parts. First, we formulate the solution to the general reduced-rank linear ap- proximation problem relaxing the invertibility assumption of the input autocorrelation matrix used by previous authors. Our treatment unifies linear regression, Wiener filtering, full rank approximation, auto-association networks, SVD and Principal Component Analysis (PCA) as special cases. Our analysis also shows that two-layer linear neural networks with reduced num- ber of hidden units, trained with the least-squares error criterion, produce weights that correspond to the Generalized Singular Value Decomposition of the input-teacher cross-correlation ma- trix and the input data matrix. As a corollary the linear two-layer back propagation model with reduced hidden layer extracts an arbitrary linear combination of the generalized singular vector components. Second, we investigate artificial neural network models for the solution of the related generalized eigenvalue problem. By introducing and utilizing the extended concept of deflation (originally proposed for the standard eigenvalue problem) we are able to find that a sequential version of linear BP can extract the exact generalized eigenvector components. The advantage of this approach is that it's easier to update the model structure by adding one more unit or pruning one or more units when our application requires it. An alternative approach for extracting the exact components is to use a set of lateral connections among the hidden units trained in such a way as to enforce orthogonality among the upper- and lower-layer weights. We shall call this the Lateral Orthogonalization Network (LON) and we'll show via theoretical analysis-and verify via simulation-that the network extracts the desired components. The advantage of the LON-based model is that it can be applied in a parallel fashion so that the components are extracted con- currently. Finally, we show the application of our results to the solution of the identification problem of systems whose excitation has non-invertible autocorrelation matrix. Previous identification methods usually rely on the invertibility assumption of the input autocorrelation, therefore they can not be applied to this case.

I. INTRODUCTION ECENTLY, A LOT OF ATTENTION is paid to the R properties of linear two-layer networks with reduced

hidden layer size. Although a two-layer linear network is always equivalent to some single-layer linear network in terms of the overall mapping, the presence of the hidden units guards the rank of the mapping. Thus, if the network has n inputs, m outputs, and p hidden units, and if E R n X p ,

W E R m x p , denote the lower and upper layer weight matrices

Manuscript received December 17, 1991; revised September 29, 1992. This work was supported in part by the Air Force Office of Scientific Research under Grant AFOSR-89-0501A.

K. I. Diamantaras is with Siemens Corporate Research, Princeton, NJ 08540 USA.

S.-Y. Kung is with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA.

IEEE Log Number 9205984.

-

respectively, then the overall mapping W = WW'' is such that ranA(W) 5 p [21]. A reduced hidden layer network is defined as one with p 5 min{m, n}. It was first observed that in the auto-associative case the optimal least-squares weights are related to the Singular Value Decomposition (SVD) of the input data matrix [3]. Any learning rule that can find the optimal solution for the sum-of-squares error criterion like for example, the linear version of the back propagation algorithm [17], will produce weights that span the same space as the p principal components of the input sequence. Later it was found that in the hetero-associative case the result is related to the eigenvalue decomposition of the classical least-squares regression matrix [ 11 or equivalently, the SVD of the cross-correlation between the output-data and the pre- whitened input data [9], [20]. In all these publications however, it was assumed that the input data matrix has full rank and equivalently the input autocorrelation matrix is non-singular.

In this paper we develop a general solution to the reduced- rank approximation problem without using the invertibility assumption for the input autocorrelation matrix, and we study the training of reduced-hidden-layer networks which relate to this problem. We both extend standard neural models, such as back propagation, as well as propose novel structures, such as networks incorporating lateral connections among the hidden units, with the aim of extracting the exact eigencomponents related to our problem. In particular, in Section II we show that the solution obtained by the network when minimizing the least-squares error criterion is related to the Generalized Singular Value Decomposition (GSVD) of the input-teacher cross-correlation matrix and the input data matrix. Then all previous results mentioned in the previous paragraph can be treated as special cases. In Section I11 we focus our attention in the training of neural networks to extract these generalized eigenvalue components. We show that under mild assumptions the error surface of a two-layer network contains no local minima so any gradient descent procedure such a back propagation is guaranteed to extract a global minimum. This is an extension of the results reported earlier by Baldi and Homik [l]. In Section 111-B we investigate the extension of the back propagation model using the concept of defla- tion so that the exact eigenvalues are extracted sequentially rather than getting some linear combination of them (as is the case with the straightforward BP model). This is useful for avoiding retraining of the network when one or more component(s) needs to be added or removed. In Section 111-C

The superscript denotes matrix transposition.

1045-9227/94$04,00 0 1994 IEEE

DIAMANTARAS AND KUNG: MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION

we introduce what we call the Lateral Orthogonalization Network as a novel network structure for extracting the exact components. The LON approach is aesthetically more appealing by “emulating” deflation via weight training instead of using the artificial deflation process, and in addition is easily parallelizable. Furthermore, we provide simulation results to verify the convergence of the proposed model. Finally in Section IV we demonstrate the application of our reduced-rank approximation analysis into a linear system identification prob- lem where the input autocorrelation matrix is not invertible. Older methods typically require an invertibility condition on the autocorrelation so they can not be applied to this problem. Simulation results are provided to display the results of our technique. We conclude in Section V.

11. REDUCED-RANK APPROXIMATION

In this section we will show the solution to the general least squares approximation problem using a reduced-rank overall matrix W E Rmxn such that

rank(W) = p 5 q = min{m, n} (1)

for a given p . We are provided with N inputlteacher pairs (xk, yk), k = 1,. . . , N , and we want to minimize the cost

N

E = llYk - WXk1I2 (2) k=l

which can also be written as

E = IIY - WXll$ (3)

where 1 1 . ] I F stands for the Frobenious norm: and Y E RmXN, X E R ~ ~ ~ , are the data matrices [Y~,YZ,...,YN], [XI, xp, . . , XN] respectively. Assume that N 2 n,3 and let4

(4) ( 5 )

be the GSVD of the matrix pair (YXT,XT) [13]. Thus U E RNxN and V E 72”’” are orthonormal matrices, Q E Rnxn is invertible, al ; . . ,a , > 0, a,+l = ... = a, = 0 for some r , 1 5 r 5 n, and @ 1 , - . - , P s > 0, pS+l = . . . = ,Oq 2 0 for some s, 1 1. s 5 q. Obviously, rank(XT) = r , rank(YXT) = s and s 5 r . Let’s define

UTXTQ = D1 = diag{al , . . . ,a ,} E RNxn VTYXTQ = D2 = diag{/31,...,/3q} E Rmxn

Pi A, E -, for i = l , . . . ,? min{q,r} = min{m,r}. (6)

The numbers A:are referred to as the finite generalized eigen- values of the matrix pair (XYTYXT,XXT). We will as- sume, without loss of generality, that the diagonal entries of

*For any matrix P = [Pz3], llPllb E t r a c e { P P T } = t race{PTP} =

’This condition is necessary for the existence of the GSV Decomposition to be used later. It is not restricting however since we can always pad as many 0-columns in X and Y as needed without changing the error function.

4Throughout this paper the notation d iag{ . . .} will denote a pseudo- diagonal matrix. A matrix P = [1)23] (regardless whether it is square or not) is called pseudo-diagonal if pZ3 = 0 when z # 1.

aa

E,, Pe.

~

685

D1 and D2 are ordered so that A 1 2 A2 2 . . . 2 A, > As+l = . . . A i = 0. Given the above observations we can write

r n - r

r n - r

D z = [Dz, D z ] m (8)

U = [U, U ] N - r (9)

r N - r

where D1, is a diagonal non-singular r x r matrix, the matrices Dp,, U,, a‘e comprised of the first r columns of the D2 and U , while D2, U, are comprised of the remaining columns. Then from (4)

XT Q = U, [D 1,0] (10)

Using the fact that pre- or postmultiplication with an or- thonormal matrix does not affect the Frobenious norm, we

where Q-T is shorthand notation for (Q-l)T we used the fact that DT, = D1,. The residue

does not depend on W thus it can be ignored. This also implies that Emin 2 R, where Emin is the minimum error. Thus W minimizes

686 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994

where A? = diag(A1,. . . , A,} = dzag(A1,. . . , A,, 0 , . . . ,O} E Rmx'. Assuming that p 5 s the optimal overall matrix W* should satisfy the equation

where A, = diag{A1,...,AP} E R p x P . Now let V = [VI . . . v,] and Q = [ql . . . qn] and define V, zz [VI . . . v,], Q p = [ql. . .q,] and D1, diag(al, . . . ,a,} E R p x P .

Then (14) is solved for

W* = V,A,DT,(Q,T + QTQT) (15)

where Q1 is a matrix formed by the last n - T columns of Q, so QTQ-TIDITO]T = 0, and Q is any arbitrary (n - r ) x p matrix. The minimum error is then

5

Emin R + AS (16) i=p+l

The value R is the mismatch of the linear regression between X and Y, and it is the absolute minimum error achievable when there is no rank restriction on W. By introducing rank restrictions on W we create the excess error term Cl=,+, A:. Scharf finds a similar result in [20] under the special case when XXT is non-singular.

A . Comments The optimal solution W* in (15) is obviously not unique because of the involvement of the arbitrary matrix Q, but an additional source of non-uniqueness is created when there is repetition of A, or any eigenvalue less than A,, since in this case the matrices V and Q are not unique as well. Although reduced-rank approximation introduces excess error compared to full-rank approximation, it might in fact perform better than the full-rank case with respect to some other, more meaningful error criterion E'. For example, if Y k = signalk + noisek it would make sense from a filtering application point of view, to minimize the cost

E' = Ilsignalk - w X k ( 1 2 k

rather than the standard E = C k [Isignal, + noiser, - wxk1l2. This would be the case if signal has low rank p , but y = signal + noise has higher rank > p. Then the full-rank filter will be modeling the noise in at least p" - p dimensions. On the other hand, if we have an a priori knowledge of p then using a rank-p approximation we may get a good estimate of the signal subspace and get rid of the noise in all other dimensions. For low signal- to-noise ratios the power of the noise in these dimensions may be significant so we can get superior performance (with respect to E') over full-rank filtering. For high SNR the performances of these two filters should approach each other asymptotically.

B . Special Cases X has full rank. In this case T = n and the matrix XXT is invertible. From (4) and (5) we have

Q ~ X X ~ Q = D ~ D ~ (17) Q ~ X Y ~ Y X ~ Q = D T D ~ (18)

and since ai > 0, i = l , . . . , n and N 2 n, DTDl is non-singular therefore we can write

XYTYXTQ = XXTQA2 (19)

where A$ = (DTDl)-lD;D2 = diag{A:, . . . , A?, 0, . . . , 0} E RnXn. Therefore, the columns of Q are the generalized eigenvectors of the pair (XYTYXT, XXT). One equivalent way of writing (19) when XXT is invert- ible is

( XXT)-lI2 XYTYXT ( XXT)-ll2 Q = Q A i (20)

where Q = (XXT)l/'Q. Since the matrix (XXT)-l/' is the prewhitening operator for the input sequence, Q is the eigenvector matrix of the cross-correlation between the prewhitened input data and the output data, in accordance to the results published in [91, [20].

Yet another way to see the same solution is by using (5) to obtain

yxT ( xxT) -l X Y ~ V = Y X ~ ( X X ~ ) - Q - ~ D ;

Y X ~ (xxT) - lxyTv = Y X ~ Q ( DTD 1) -l D;

From (17) we have (XXT)-lQ-T = Q(DTD1)-' so

= V D ~ ( D T D ~ ) - ~ D ; = VAL (21)

where A, = diag{X1;-.,X5,0,,0) E 2"'". Therefore, V is the eigenvector matrix of YXT(XXT)-lXYT in accordance to the results published in [l]. Linear Regression and Wiener Filtering: X has full rank, W has no rank restriction. If T = n and in addition there is no rank constraint on W then from (15) we see that the optimal solution is

w* = VA,D;;Q~ = VD~DT~D;;Q~ = V D ~ Q - ~ ( X X ~ ) - ~ = VVTYXT(XXy = YXT(XXT)-l (22)

which is the solution of the linear regression between X and Y. Assuming that &YXT M R,, = E{yxT}, and &XXT M R, = E{xxT}, one also recognizes (22) as the Wiener filter solution: R,,R,l. Auto-association: Y = X. In this case m = q = n, s = T and (5) yields

VTXUDl = D2 3 VTX[U,U] ["d. :] = ["UT :]

DIAMANTARAS AND KUNG: MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION

~

687

Since XU = 0 we can write

Therefore, V and U are the SVD matrices for X. Further- more, we can choose Q = V so that (4) becomes the SVD of XT, and (5) becomes the eigenvalue decomposition of XXT. This is in agreement with the results published in [3]. Under the probabilistic formulation the input sequence x k is viewed as a stochastic process and we may approximate &XXT by R,. In this context the solution is called the Principal Component Analysis (PCA) of X k

and is equivalent to the eigenvalue decomposition of R,.

111. NEURAL NETWORK T R A I " G

The error E in (2), applied to a linear network with p hidden units, can be written in the following form

E = (JY - WWTXll$

where w E RmXp, E R n x P are the upper- and lower- layer weights respectively. In the appendix we show that under the assumption

A 1 > A2 > ... > A+ > 0 (so s = ?) (A)

the error function contains no local minima (all critical points are saddle points). From the proof in the appendix it follows that there exist two p x p matrices a, &J such that M&JT = A, and VTW = ITa, [ D 1 , O ] Q - l W = IiNJ where IT (I;) is a matrix comprised of the first p columns of the m x m (T x T ) identity matrix. The matrices M, n/r are invertible (since their product has full rank) so M = A p a - , and we can write the general solution for the upper- and lower-layer weights as follows

T

- w* =VpM (24)

(25) T

- w* = ( Q ~ D ; , ~ + Q~+)A,M-

where + is any (n - T ) x p matrix. If the input has full rank then T = n and the optimal solution is simplified:

- w* = VpM (26) - w * = Q~D;,~A,R-' =

(27) = Qpdiag{ 3,. P1 . . , +}M-T P -

a1 f f p

A . Linear back propagation for Extracting the Component Subspace

The fact that there are no local minima under assumption ( A ) implies that any gradient descent algorithm for minimizing (23) in a network with p hidden units, will produce the global minimum solution (24) and (25). An example of such an

hidden unit

Fig. 1 . model for many hidden units with LON.

(a) The linear BP network with 1 hidden unit. (b) The proposed

algorithm is the back propagation learning rule, the linear version of which is as follows:

where a k = W F X k and b k = W z y k . Therefore, w k + W* and + E*. The canoilical solution for the weights would be the case were a = I, but the structure of a standard feed- forward network does not provide any way of distinguishing it from the other solutions, thus there is no reason why the system would prefer it. Nevertheless, some extensions of the network structure, using the concept of deflation or by introducing lateral connections in the hidden layer, can enforce preference to the canonical solution as discussed below. The advantage of such a structure is the flexibility of gradually increasing or decreasing the rank of the network without retraining the old neurons.

B. Linear BP for Sequential Extraction of the Exact Components

I ) Obtaining the First Component: The linear BP rule for a two-layer network with one hidden unit (Fig. l(a)) is capable of extracting the first component. In this case the standard BP method becomes

where a k = W r X k and bk = E z y k .

In the following we shall assume that the input has full rank for simplicity. A similar discussion can be carried out for the reduced rank case.

Then according to (26) and (27), the global minimum is achieved at the points (=*,E*) = ( p 1 v 1 , p ; ' $ q ~ ) where p1 is some non-zero scalar.

Special (symmetric) case: Self-supervised BP A special case of the above result is the auto-associative BP network, where as discussed in the previous section, we have X = Y , Q = V (since we are assuming now that X is full rank) and @, = a:. Thus v1 is the principal component of X k and the net converges to the values W* = plvl, E* = pT1v1.

688 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEFTEMBER 1994

It is interesting to notice that the BP-learning rule for the upper layer:

(32) - w k + l = F k f P [ x k - a k K k ] a k

is very similar to the linearized version of the normalized Hebbian rule, introduced by Oja in [14] except that a k = - W T X k rather than a k = F T x k . A similar observation has also been mentioned in [19]. However, since and F are going to converge to the same final vector (up to a scaling factor), we are motivated to change the BP rule by forcing w k = w k = w k for all times k, and by updating both upper and lower layer weights by the same rule

(33)

where now a k = W T X k . This rule is now exactly the same as Oja's rule which was shown to extract the (normalized) principal component of the input sequence, i.e., W k + VI.

2 ) Obtaining Multiple Components: For the extraction of multiple components it is intrumental to use an extended notion of deflation originally introduced for the symmetric eigenvalue problem [ 151. In the basic symmetric eigenvalue problem deflation is a transformation that "removes" the strongest eigenvalue thus making the second strongest one dominant. For our problem, the extended concept of deflation is defined below:

Definition 1 : Let UDIQ-', VD2Q-l, be the GSVD of a matrix pair (X', YX') and let the generalized eigenvalues be A, 2 A, 2 . . . 1 A; 2 0. Then the mapping

-

w k + l = w k + P [ x k - a k w k l a k

YXT -+ YX' Y - --tvlqTX X' (34)

YXT 4 YX' (Y - V,VTY)XT (35)

( :: 1 or

or

YX' -+ YX' Y XT - ,X'q,qfXX') 1 ( (111 (36)

where vI , q, are the first columns of V and Q respectively, is called deflation transformation of (X', YX').

Let us first examine the transformation YXT -+ YXT. We have

VTYXTQ = VTYXTQ - aVTVlqFXXTQ P Q1

(37)

Then clearly (UDIQ-l,VD~Q-l) is a GSVD of the pair ( XT, YXT) from which it follows that the generalized eigen- values now are:

A 1 = 0, A2 = A2,. . . ,A? = A?.

So after the deflation, the second component becomes domi- nant and it can be extracted using exactly the same leaming

rule used for the first component (applied now on the deflated data). Note that the recursive nature of the above argument lends itself easily to the extension to the pth component case (p > 1). Similarly, for the transformations YXT + EXT and YXT -+ YXT, we can also show that

UTXTQ = D1 and VTEXTQ = D, (38)

UTXTQ = D1 and VTYXTQ = D z (39)

which manifests deflation for the pairs (XT,EXT) and (XT, YXT). Since all three transformations have the same GSVD it follows that

(. - !!&qFx)XT = (Y - v l v ~ Y ) X T (11

= Y XT - 3xTqlqfxx') 1

( Q1

(40)

We can apply now the above results straightforwardly into the back propagation rule. For the pth neuron we assume that the previous p - 1 principal components have already been extracted, so that

- w i = piv;,

Then the modified BP rule for this neuron

(42)

where a p k = x&Xk, is equivalent to a single-hidden-unit BP rule with a modified teacher:

It is a simple exercise to show that the data matrix Y = [?I, ... , y ~ ] , for the modified teacher is the ( p - 1)-times deflation of Y hence the pth component is now dominant and will be extracted by the network. So Kp and y~~ will converge to ppvp and pF1$qp. Another version (version 2) of the modified BP rule corresponding to the deflation transformation of the second type is

(43)

DIAMANTARAS AND K U N G MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION 689

wp,k+l On the other hand in the asymptotic case N + 00

(44) In either case from (17) and (5), we have

Obviously the modified teacher Q ~ R , Q = c1 = ~ D T D ~ N (50)

E1 E 77,"'" and Ez E 77,"'" are diagonal matrices with

i = 1, . . . , q (apply the limit N -+ CO when appropriate). By construction of the back propagation rule, (48) and (49),

form a gradient descent procedure on the surface of the error function

is equal to the (p- 1)-times deflated original teacher. It follows

principal components. Finally, version 3 of the modified BP that the weights will again converge to the pth asymmetric entries 's = ha?, = '3 ' . ' > n, and Di = hbi,

(47)

corresponds to the deflation of the third type. Version 1 ((41) and (42)) is more aesthetically pleasing since

the modified teacher 9, is nothing but the original teacher after subtracting the reconstruction that has been achieved so far by the previous p- 1 neurons. This is intuitively in agreement with the concept of deflation which is perceived as a subtraction process of the previous components. 3) Explicit Deflation and Stochastic Approximation: Ljung [ 121 and later Kushner and Clark [ 101 showed that under mild assumptions and if p k + 0 as k + CO, stochastic equations of the form of (30) and (31) correspond to deterministic differential equations where the right hand side is replaced by its conditional expectation assuming TF and are non- random. Later [ 111, [2] it was further shown that even if the step-size parameter ,f? does not go to zero but remains equal to a (small) constant the mean values of the updated parameters w, w, still approximate the same differential equation. In our case the back propagation learning (30) and (31) correspond to the following differential equations

-

Applying the results of the previous section asymptotically we can say that J has no local minima (under assumption (A))andtheminimumpointis (W*,w*) = (p1v1,pT1$q~). Furthermore, gradient descent is guaranteed to converge to the global minimum so (W(t) ,E(t ) ) + (W*,w*) as t --f 00,

and according to [12] the discrete vectors (wk,wk) will also converge to the same point.

For the pth neuron we assume as usual that

Then versions 1, 2, and 3 of the modified BP correspond to the following differential equations

Version 1:

-- fi(t) - Ryxw(t) - w(t)w(t)TRZw(t) (48) Version 2:

dw(t) dt - R$,W(t) - /lW(t)II'R,w(t)

dt --

R,,EP(t)

1

I (49)

- where R,, = E{yxT} and R, = E{xxT} . If there are finite many training samples ( N < CO) then we can construct the

)y periodically repeating the data. - WP(t)wp(t)TRzEp(t)

r (54)

690 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994

Version 3: r

Extending the equivalence shown in (40) we have r 1

so

( 5 2 ) U (54) U (56)

(53) U ( 5 5 ) U (57)

thus the three sets of differential equations are completely equivalent. It follows by reverse argument that any one of (41), (43), or (46), combined with any one of (42), (a), or (47), corresponds to the same set of differential equations, thus any of the eight combinations should extract the pth component.

C. The Lateral Orthogonalization Net We saw in Section I1 that reduced-rank approximation is

a generalized Principal Component Analysis (PCA) problem. Orthogonality is a central property of the PCA solution: for example, in standard PCA the eigenvectors (principal components) of the symmetric autocorrelation matrix R, are orthogonal. In our case the columns of V are orthogonal while the columns of Q are $ -orthogonal, namely T

VTVj = 0,i # j

qTXXTqj = 0, i # j (ref. (17))

It has been also demonstrated in Section 111 that this orthogo- nality can be achieved via a deflation transformation assuming that all previous components have been extracted. Sanger [ 181 followed a similar approach in the standard PCA problem using the standard definition of deflation for that case. In the reduced-rank approximation problem we showed that a deflated version of the back propagation rule can be used for solving the related generalized PCA problem. However, this is an inherently sequential approach since all previous components need to be present before a new component can be extracted. Kung and Diamantaras [8] proposed the APEX model for the standard PCA case where deflation was in effect implemented using lateral connections among the units. These connections are trained by some lateral connection rule, thus

more natural in-network deflation. In addition the proposed approach achieved computational advantage over Sanger’s method. Also APEX tumed out to be easily parallelizable [4], [5 ] although it was originally proposed as a sequential model. The natural question then arises: can we extend the concept of lateral connection networks into the generalized PCA problem of reduced-rank approximation? We will see that the answer to this question is yes, and in the following we’ll develop a general framework for this approach that will lead to a parallel solution to the generalized principal component extraction problem. Naturally, we will call it the Lateral Orthogonalization Network (LON). Two kinds of learning rules can be used to train the LON:

Local Orthogonalization Rule. Let U l k and a 2 k be two stochastic processes (e.g., the activation sequences of two different neurons) the innerproduct of which is defined as

< al, a 2 >= E ( a l k a 2 k )

The role of the lateral network is to find a weight strength c, such that the modified process

a h k = a 2 k - C a l k (58 )

is orthogonal to U l k , i.e.,

< a ; , q >= 0 (59)

Assuming stationarity the value c is not a function of time. By Gram-Schmidt orthogonalization [6] the solution is

.

which means that the component needed to be removed from a 2 in order to achieve orthogonality is exactly the projection of a2 onto a1 (see Fig. 2). The following or- thogonalization learning rule is then proposed for tracking the optimal value c*:

(61) ck+l = c k + p [ a l k a 2 k - cka:k I Assuming that the system with which this rule is com- bined is stable, then from stochastic approximation the- 0ry5 it follows that (61) approximates the differential equation

dc’(t)/dt = E ( a l a 2 ) - c’(t)E{a:} (62)

which in tum tracks the value of e* = We call this orthogonalization rule local, because the values a 1

and a:! utilized for updating the weight strength c, are local to the lateral synaptic connection: they are just the activations of the neurons at both ends of the connection. Immediate Orthogonalization Rule. In some cases the lateral orthogonalization weights can be trained by a specialized rule which depends on the dynamics of the specific system. No general equation can be prescribed for such a rule since it is usually different for each different system it is applied to. However such a rule is usually easy to produce once the quantities to become orthogonal

W a : ) .

replacing the artificial explicit deflation of Sanger’s with a 5Again refer to [12], [IO]

DIAMANTARAS AND KUNG: MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION

~

69 1

/

Indeed, expanding we obtain the following associated differ- ential equations

{all

Fig. 2. The lateral orthogonalization network is based on the principle of removing any duplication of old components from a new component. The part to be removed in order to achieve orthogonality may be depicted as the projection from the new component to the old component. As shown, after the removal, the residue is orthogonal to the old component.

have been identified and the system dynamics have been set up. One example is the APEX lateral orthogonalization network described in [8]. The rule proposed there for the lateral weight c, was of the form

between neurons 1 and 2, where a l k = W T X k , a 2 k = w$&. Since in APEX, the adaptive rule for wz is

w Z , k + l = w Z k + p [ a 2 k x k - w 2 k a : k ] (64)

while Ck needs to be equal to W T W Z k : clearly a simple multiplication of (64) from the left by wT yields (63). Had the updating rule for wz been different, we would have obtained a different adaptive rule for c. Hence this rule is case-dependent. We will also call this rule immediate because Ck equals the inner product W&wl at all times (assuming CO = wgowl) rather than tracks it through some sort of dynamic equation.

1. Dej7ation using LON: Contrary to the standard PCA problem and APEX model, for the linear approximation problem discussed here we will need two lateral connections for each pair of hidden neurons: one for controlling the orthogonality of the upper-layer weights and one for the lower-layer weights (see Fig. l(b)). We will first discuss the sequential version of the model. Assuming that the first p - 1 components have already been extracted the proposed model for the pth neuron is defined below

where

6This value is necessary to ensure deflation in the standard PCA case so that the second neuron will extract the second component which will be orthogonal to the first one. We will not embark into further details here. For more elaborate discussion on the subject refer to [5 ] .

-p - dw d t (') - RFxSp(t) - R : x ~ i ~ p i ( t )

Ideally, EPi and gppi should be equal to

so that the above differential equations will become the same as (54) and (57). As we already saw this implies that the pth component is extracted from the network. However, the optimal values E;;, cii are not available immediately and therefore must be t r a ~ k e d . ~ That is done by the local type orthogonalization rules

(73)

with associated differential equations

which make evident that the local orthogonalization rules for Cpi and cpi indeed track the values

Fig. 3 shows a simulation experiment of a sequential LON network using (65), (66), (72), and (73), extracting multiple components. The data-used are 100 artificially produced (x, y ) pairs from two colored random sequences, repeated cyclically in sweeps where the input dimension is n = 4 and the output dimension is m = 3. The y-axis corresponds to the component estimation error while the x-axis corresponds to the number of iterations (each sweep contains 100 iterations so the plot contains results from 100 sweeps). We experimentally found the value P = 0.01 to be a good step-size constant for our data and this is the value we used in our simulations. The curves in Fig. 3 are superimposed and it is understood that the second unit is trained after the first one, the third unit after the second, etc. We found that for larger input and output dimensions the convergence of the network can be considerably slower than depicted in Fig. 3. However, even if slowly, the network eventually always converges to the appropriate solution.

7More precisely, the values of iVi and i V p k are immediately available so E p t = W ~ i V p ~ / ~ ~ i V ~ ~ ~ z can be adapted by an immediate orthogonalization rule. Indeed, multiplying (65) from the left by WT/11+7,112 one obtains:

and c;i as required.

4 z , k + l cpi,k + b [ b ~ k ~ ~ k / 1 1 ~ z ; ( ) ~ - Epi,ka&]

However, the value of sii is not readily available because of the involvement of R, which is never explicitly computed by the network. A time average approach is needed to track giZ, thus making the local orthogonalization rule necessary. For sake of uniformity we'll adopt the local rule for both FPz and cpt.

692

BP+LON

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994

2

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

4: 1atcomp. b: zlldwmp. c: 3rd"~.

lower layer

'0 1000 2000 3000 4000 S000 6wO 7000 8WO 9ooo loo00

i l e n t h

(b)

Fig. 3. Convergence of the linear approximation mode with thrz hidden units using LON. (a) Plot of the component estimation error llvp - @ (1' of

the upper layer. (b) Plot of the component estimation error 1 ) & - e 11 of the lower layer.

2. Parallel LON: Since the hidden units in the network are hierarchically structured each node does not get affected by any nodes following it. If the neurons prior to a node have converged reasonably close to the appropriate components then the node would converge to its corresponding component. So we heuristically contend that if we let all the nodes work concurrently the network will extract the principal components in parallel rather than one after the other. Indeed, the first hidden node is unaffected since it has no prior nodes. In turn, the second hidden neuron will start converging to the second component no later than the first one converges to the first component assuming that the final variance of the steady state of the first neuron does not affect the convergence of the second neuron. Similarly the pth neuron will start converging to the pth component no later than the p - 1 neuron has converged. This heuristic argument is verified by the simulation experiments depicted in Fig. 4. We used the same training data as the sequential experiment in Fig. 3. Notice that now the x-axis indications (iterations) should be interpreted literally, i.e., the three curves are not superimposed to save space (as was the case with Fig. 3 ) but rather they actually show the neurons operating in parallel. Therefore the

BP+LON

i r " s

(a)

BP+LON , . I , . , .

0: 1srWmp. -

c: 3rdmmp. - b: Zndwmp.

i t a t h

(b)

Fig. 4. The parallel model with three hidden units using LON. The training data used are the same used in Fig. 3. (a) Plot of the component estimation error lIvp - all2 of the upper layer. (b) Plot of the component estimation

error 1 1 - &112of the lower layer.

third unit converges in a little more than 1/3 the time it took it to converge in the serial version, if we count the time it takes for the previous components to converge in the serial version. Similarly the second unit converges in approximately half the time. However, we may not expect the pth unit to converge before the first p - 1 units have done so. This is best demonstrated by the behavior of the third unit in Fig. 4. Notice that for the first 2000 iterations (or 20 sweeps) this unit is actually moving away from the correct component. This is because it receives the wrong "clue" form one or both the previous units that have not yet converged. After the 2000 iteration mark however, the unit starts converging because at that point the second unit (and the first one before it) had already practically converged to their corresponding components.

IV. APPLICATION EXAMPLE

One application of reduced-rank approximation arises in system identification. Consider the Single-Input Single-Output

DIAMANTARAS AND K U N G MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION 693

linear system'

Sk+l = A s k f b x k (74) Zk = C T S k (75)

where A E R p X p is a non-singular matrix, Sk E RP is the state vector and X k , Zk is the input and output of the system respectively. We assume that the actual observation y k equals to the output plus observation noise. If the input sequence has finite length then we can write:

?le ye+i ... Y e + N - l i Ye+i Y e s 2 ...

Y e f m - 1 Ye+m . . . Ye+N+m-2

Ze+N- l 1 1 +noise

or in matrix form

Y = Z + & = H X + & (77)

where X E RnXN, Z E RmXN, are the input and output data matrices, H E Rmxn is the system Hankel matrix and Y E Rmx is the observation data matrix corrupted by noise E. Notice that n = + N > N for l > 0, and since we assume X e # 0, X has full rank T = N . The problem then is to find an approximation H* of H, so that the square error IIY - H*Xll$ is minimum. From basic linear system theory we know that H can be factored into the product

where (3 E R m x P and C E Rpxn are the system observ- ability and controllability matrices respectively. From this factorization it follows that H has rank p , therefore a rank- p approximation is sufficient to achieve a minimum error E = 0 in the noise-free case. In some applications (e.g., speech processing) we might have a priori information about the order of the underlying system so a reduced-rank approximation

system. We present a SISO example here for simplicity. ' A similar formulation as the following one arises for a general MIMO

approach is justified in this case. Furthermore, since n > N we cannot assume invertibility of the matrix XXT and therefore we cannot use the results of previous published works on the subject. However, we can pad l zero-columns to both X and Y without changing the error function:

so the new matrix X' = [XO] has the same number of rows as columns and we can apply the results of this paper on the problem. In this case, following the notation of Section 11, we have T = N , ? = min{m,N} and s = ~ a n l c { Y X ~ } . ~ The condition i = s is sufficient to guarantee absence of local minima in the error function E (cf. condition (A) ) . So in the noise-free case, since s = p and m > p we can guarantee freedom from local minima if N = p. In the noisy case, s = min{m,N,n} = min{m,N} (since n 2 N ) , so we always have s = ? and freedom from local minima is guaranteed for any value of m and N . In the noise-free case, if the error surface of the cost function contains no local minima then any learning rule minimizing

will produce weights such that

Z = O C X = W W T X . (79)

The matrices CX and WTX are both full rank so there exists an invertible matrix * such that

o=w* (80)

Thus, w is the observability matrix _Of _the system after the similarity transformation (A, bST) = (*A*-',*b,cT*-') and the first row of W is equal to ET = c T W 1 . Let wt and w1 denote the matrices resulting from w by removing the first and the last row respectively. Then

and A can be estimated by [7]

A = {wl}LWT

where { W' } is the left pseudoinverse of wl. Furthermore, from (79) and the fact that 0 has not less rows than columns, we obtain

cx = W W T X (83)

Equating the first columns of the two sides of the above equation we get

e CxiAe-'b = W I W T x - 1 (84) i=O

e + CxiQAe-i*-18b = W T x 1 (85)

i=O

91t is also implicitly assumed that m, n, N 2 p so p 5 s as required in the theoretical analysis of Section 11.

694

so b is computed as

lEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994

6T -1

4 - b = q b = c x i A ' - i (86)

[i:o ] lYTX1 2 -

Equations (82) and (86) along with the fact that the first row of w is equal to ET complete the solution procedure for c

the identification problem in the noise-free case. The system matrix H, can then be easily recons_tructed by computing the Markov parameters hi = ETAib. In the presence of noise however, this approach becomes an approximation the accuracy of which deteriorates with the decrease of the SNR.

0-

-2- I

A. Simulation Experiments

We simulated a two-layer linear back propagation network with n = 19 inputs, m = 20 outputs and p = 10 hidden units. The input to the linear system is a finite random sequence of length C + 1 = 10 and the output is corrupted with white Gaussian noise of adjustable power. The network is trained by presenting corresponding columns from the matrices X (input) and Y (teacher). There are totally N = 10( = n - e ) such input-teacher column pairs. In order to produce X and Y a system of order 10 was simulated by creating random matrices (A, b, cT) , making sure that stability is preserved. In Fig. 5 we show the estimate of the impulse response of the system produced by the method described above, after the 2000 sweeps over the training data. The simulation was done on a SUN SPARC-station and consumed approximately 7 min of execution time. In Fig. 6 we show the estimates _Of the eigenvalues of the system, through the eigenvalues of A. For each plot, 100 separate experiments were run, and each plot corresponds to a different (per sample) noise power. No preprocessing is performed on the data to filter out the noise. The output signal z k is a weighted sum of damped sinusoids so its average (per sample) power decreases if we increase enough the size of the observation time interval. For that reason we avoid using the SNR terminology. For reference purposes however we mention that the average power of z in this particular time window of the experiment was 57.07. As expected, without prior noise removal, the result is fairly accurate for small noise powers and decreases as the noise power gets large.

V. CONCLUSIONS

In this paper we developed the general analytical solution for the reduced-rank linear approximation problem using no assumption about the invertibility of the input autocorrelation matrix. The special cases of Wiener filtering and Principal Component Analysis as well as the reduced-rank approxi- mation problem with full rank input were unified under our formulation. The solution we found is related to the gener- alized singular value decomposition of the pair (XT, YXT) which collapses to the SVD of the input data in the auto- association paradigm. Following the results of Baldi and Hornik we showed that there are no local minima in the error surface of the linear network under reasonable conditions thus

-"O 5 10 15 20 25 30 35 40

time

Fig. 5. The actual impulse response of the system (solid line) versus the neural network estimated response (dashed line) after 2000 sweeps using linear back propagation algorithm.

we can guarantee for example, the convergence of the back propagation algorithm to a global minimum.

The connection of our analysis with two-layer neural net- work models is made by showing that a linear back prop- agation model with reduced number of hidden neurons is essentially solving the generalized eigenvalue problem related to reduced-rank approximation. The weights of the network will extract after training the generalized singular vectors related to the input and teacher data matrices. We showed that the exact components can be also extracted either by using the concept of deflation in conjunction with back propagation or by employing an LON. Our analysis is more general extending the LON approach that was initially introduced for the standard PCA problem.

In many applications, such as spectral estimation, sensor array processing or system identification it is known that reduced-rank approximation is a key concept for attacking the problem since the signal would reside in a low dimensional subspace usually called the signal subspace. Identification of this subspace is the objective of many powerful techniques such as Toeplitz Approximation Method [7] or ESPRIT [16]. Similarly, in this paper we find that the reduced input rank problem is related to a general type of identification problems where the input has finite support but is not impulse. In this case we can use our approach to estimate the system structure, assuming that we have some a priori knowledge of the order. We can either use the adaptive neural network solution or the analytical solution provided by GSVD. The adaptive solution has the ability to track a slowly time-varying system while the analytical approach is faster in general. We observed through simulation examples that the performance of optimal reduced rank filtering is very good at relatively noise powers but it deteriorates rather quickly. Nevertheless, we should also men- tion that the problem of detecting damped sinusoids is harder than the standard harmonic retrieval problem where there is no damping (i.e., the eigenvalues lie on the unit circle). In practice some noise filtering pre-processing stage would be needed for improving the performance of the proposed technique.

DIAMANTARAS AND KUNG: MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION 695

I

os.

0 .

4.5.

I .

--

I ' I

I '

Fig. 6. Estimates of the complex poles of a simulated linear system for various noise powers: (a) 0.0025, (b) 0.01, (c) 0.04, (d) 0.16. The noise is white Gaussian, and each subfigure contains 100 experiments with different realizations of the noise. The actual poles of the system are denoted by "x" and the estimated ones by "[. 1."

APPENDIX PROOF OF THE NON-EXISTENCE OF LOCAL MINIMA

Any critical point (W, E) of (23) is defined as one where d E / d W = 0 and dE/dWT = 0, with the derivative of a scalar function f ( Q ) , over a matrix Q defined as the matrix with elements df/dqij. It is easy to show that

dllG - KQLll$/aQ = KTGLT - KTKQLLT

for any conformal set of matrices G, K and L, therefore any critical point of E satisfies the equations

E - O + dW - VA+[DI,O]Q-~W = W W T Q - T D1r

- = o * dE [ ] [D181Q- 'E

-T W VAf[Dl,O]Q-l = w T W W T Q - T [";,I [Dl,0]Q-'

Equivalently,

A+[D1,O]Q-'W = V T W E T Q - T Dl '1 [ D I ~ I Q - ~ W [ W VA,. = WTWWTQ-T -T

or

A+E = FETE (87)

F A + = F FET (88) -T -T-

- where F = [T1...TmIT VTW E R m x P and = [fl ...!,IT F [D1,O]Q-'W E R r X P . Manipulating the above expressions we obtain

(87) + A;(A+E)F~ = ATFE~EF~ = - FFTFETEFT

(88) E(FTAf)AT = EFTPETAT = - FFTPETEFT

therefore

ATA,EF~ = E F ~ A ~ A T

Using assumption (A) we conclude that the matrix A =

A = FET = VTWWT Q-T [Dl'] = diag(Z1,. . . , l+}

- FET E R"" is diagonal, i.e.,

(89)

In addition, from (87) (or (88)) we obtain that for each i, either Zi = 0 or Zi = Ai . Furthermore, from these equations and (89) we get Zi = 0 e Ti = 0 e f i = 0. It follows that there

696 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 5, NO. 5, SEPTEMBER 1994

is an ordered set of indices S = {i(l), . . . , i(6)) # { 1,. . . , p } so that Z j # 0 iff j E S. Since the rank of A has to be at most p we have J? 5 p . From the above observations it follows that there exist two matrices m E R e x P , M E R@"P so that

F = IsmipM - F = I>p/I

Since for any matrix P , IIPII$ = trace{PPT}, we obtain

E(W', E') - E(W, E)

-

where

and el") (ea") is the ith column of the m X m (r X r ) identity matrix. Notice that A = FET = 1y1'As{I>'}*,

for F and E we obtain M/IT = AS. Since S # (1,. . * , p }

we will show that for any critical point (W,W), which is not a global minimum, there is an infinitesimal perturbation

Notice that

where A s = diag{Ai(l)l. A;($)} E RpxP, thus substituting {I>p}T(A, - A)T = 0,

there exists an index i $2 S, with 1 5 i _< p . In the sequel e,'"'{e$')}Te$')"j''}T(A,: - A)'} = A;

and also of the form

- w' = w + €viz' - w' = + tg;z T

for some vectors Z, z, which achieves less error than (w,E), since i ( P ) # S O in other words the quantity

E(W' , E') - E ( W , W)

- - All$ (90)

is negative thus no critical point can be a local minimum. We will treat the two cases J? = p , and 6 < p , separately.

First assume 6 = p , and let Z = mTe$') and z = a;lM'e$').

Now, the ordered set S has p elements yet S # { 1, . . . , p } , so i 5 p < i(p), therefore A i ( p ) < A; and A i ( p ) - A; < 0. As E -+ 0 the first term in (91) dominates over O(c3) thus

On the other hand, if 6 < p the matrices m, IM, have more columns than rows. Set Z equal to any non-zero vector such that MZ = 0. Then Z is linearly independent with respect to the rows of M; indeed if this was not true then there would exist a non-zero vector g such that ZT = g'R + 0 = ETNIT = g'MM' - = gTAs + g = 0, thus contradiction. It follows that Z ,L NuZZ{M}, hence there exist a non-zero vector

E(W',W') < E(W,IV).

E NuZZ{M} so that ZTz > 0. Then (90) becomes

x AS({I;p}~ + Ee:){ejr)}T) 11' - IlA, - All; REFERENCES

[I] P. Baldi and K. Homik, "Neural networks and principal component anal- ysis: Learning from examples without local minima," Neural Nerworks, vol. 2, pp. 53-58, 1989.

[2] A. Benveniste and G. Ruget, "A measure of the tracking capability of recursive stochastic algorithms with constant gains," IEEE Trans. Automat. Contr., vol. 27, pp. 639-649, 1982.

[3] H. Bourlard and Y . Kamp, "Auto-association by multilayer perceptrons and singular value decomposition," Biological Cybernerics, vol. 59, pp. 291-294, 1988.

F

= A, - A - cA;(p) I~~pef){e , 'T)}~ ll - EA;(,) eim) {ef)}' {IY 1'

- t 2 X ; ( p ) e l " ) { e ~ ) } T e ~ ) { e j T ) } T

DIAMANTARAS AND KUNG: MULTILAYER NEURAL NETWORKS FOR REDUCED-RANK APPROXIMATION 691

[4] H. Chen and R. Liu, “An alternative proof of convergence for Kung-Diamantaras APEX algorithm,” in Neural Networks for Signal Processing, B. H. Juang, S. Y. Kung, and C. A. Kamm, Eds. New York: IEEE, 1991, pp. 4 M 9 .

[5] K. I. Diamantaras, “Principal Component Learning Networks and Ap- plications,” Ph.D. dissertation, Princeton University, 1992.

[6] G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd ed. Baltimore, MD: Johns Hopkins University Press, 1989.

[7] S. Y. Kung, K. S. Arun, and D. V. Bhaskar Rao, “State-space and singular-value decomposition-based approximation methods for the har- monic retrieval problem,” J . Opt. Soc. Am., vol. 73, no. 12, pp. 1799-1811, Dec. 1983.

[8] S. Y. Kung and K. I. Diamantaras, “A neural network learning algorithm for adaptive principal component extraction (APEX),” in Proc. ICASSP, Albuquerque, NM, Apr. 1990, pp. 861-864.

[9] S. Y. Kung and K. I. Diamantaras, “Neural networks for extracting unsymmetric principal components,” in Neural Networks for Signal Processing, B. H. Juang, S. Y. Kung, and C. A. Kamm, Eds. New York IEEE, 1991, pp. 50-59.

[ 101 H. J. Kushner and D. S. Clark, Stochastic Approximation Methods for Constrained and Unconstrained Systems. New York: Springer-Verlag, 1978.

[ I 11 H. J. Kushner and H. Huang, “Asymptotic properties of stochastic approximations with constant coefficients,” SIAM J . of Control and Optimization, vol. 19, pp. 87-105, 1981.

[I21 L. Ljung, “Analysis of recursive stochastic algortihms,” IEEE Trans. Automat. Contr., vol. 22, no. 4, pp. 551-575, 1977.

[I31 C. F. Van Loan, “Generalizing the singular value decomposition,” SIAM J . Numerical Analysis, vol. 13, no. 1, pp. 76-83, 1976.

[ 141 E. Oja, “A simplified neuron model as a principal component analyzer,” J . Math. Biology, vol. 15, pp. 267-273, 1982.

[ 151 B. N. Parlett, The Symmetric Eigenvalue Problem. Englewood Cliffs, NJ: Prentice Hall, 1980.

[I61 R. Roy, A. Paulraj, and T. Kailath, “ESPRIT-A subspace rotation approach to estimation of parameters of cisoids in noise,’’ IEEE Trans. Acoust., Speech, Signal Processing, vol. 34, no. 5, pp. 134CL1342, 1986.

[I71 D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning intemal representations by error propagation,” in Parallel Distributed Processing (PDP): Exploration in the Microstructure of Cognition, vol. 1, D. E. Rumelhart, J. L. McClelland, and the PDP Research Group, ed. Cambridge, MA: MIT Press, 1986, ch. 8, pp. 318-362.

[18] T. D. Sanger, “An optimality principle for unsupervised learning,” in Advances in Neural Information Processing Systems, vol. 1, D. S. Touretzky, Ed. Palo Alto, C A Morgan Kaufmann, 1989, pp. 11-19.

[19] T. D. Sanger, “Optimal unsupervised leaming in a single-layer linear feedforward neural network,” Neural Networks, vol. 2, no. 6, pp. 459-1173, 1989.

[ZO] L. L. Scharf, “The SVD and reduced-rank signal processing,” in SVD and Signal Processing 11: Algorithms Analysis and Applications, R. J. Vaccaro, Ed. Elsevier Science Publishers, B.V., 1991, pp. 3-31.

[21] G. Strang, Linear Algebra and its Applications, 2nd ed. New York Academic Press, 1980.

Konstantinos I. Diamantaras received the diploma from the National Technical University of Athens, Greece, in 1987, and the M.A. and Ph.D. degrees from Princeton University in 1990 and 1992, respec- tively, all in electrical engineering. His thesis topic concemed the relationship between neural networks and statistical techniques for feature-extraction and data-compression, as well as issues on image cod- ing.

He is currently working as a Research Associate with Siemens Corporate Research, Princeton, NJ, in

designing parallel architectures for image processing. His research interests include neural networks, signal and image processing, parallel processing, and computer vision.

Dr. Diamantaras is a member of the Intemational Neural Network Society (INNS), and the Technical Chamber of Greece.

Sun-Yuan Kung received the Ph.D. degree in elec- trical engineering from Stanford University.

In 1974, he was an Associate Engineer of Am- dah1 Corporation, Sunnyvale, CA. From 1977 to 1986, he was a Professor of Electrical Engineenng- Systems of the University of Southern Califomia, Los Angeles. In 1984, he was a Visiting Professor of Stanford University and Delft University of Tech- nology. Since 1987, he has been a Professor of the Department of Electrical Engineering of Princeton University. His research interests include spectrum

estimations, digital signal/image processing, VLSI array processors, and neural networks.

Since 1990, Dr. Kung has served as Editor-in-chief of the Journal of VLSI Signal Processing. He is now an Associate Editor of the IEEE TRANSACTIONS ON NEURAL NETWORKS. He was appointed as the first Associate Editor in the VLSI area (1984) and as the first Associate Editor in Neural Network area (1991) in the IEEE T RANSACTIONS ON SIGNAL PROCESSING. He has served as a member of the IEEE Signal Processing Society’s Administration Committee (1989-1991). He is presently serving on Technical Committees on VLSI Signal Processsing and on Neural Networks. He has served as a founding member and General Chairman of various E E E conferences, including IEEE Workshops on VLSI Signal Processing in 1982 and 1986 (both in Los Angeles, CA), Intemational Conference on Application Specific Array Processors in 1990 (Princeton, NJ) and 1991 (Barcelona, Spain), and IEEE Workshops on Neural Networks and Signal Processing in 1991 (Princeton) and 1991 (Copenhagen, Denmark). He was the Keynote Speaker for the First Intemational Conference on Systolic Arrays, Oxford, U.K., in 1986, and the Intemational Symposium on Computer Architecture and DSP, Hong Kong, 1989. He has authored more than 250 technical papers and two textbooks: VLSI Array Processors (Prentice Hall, 1988), and Digital Neural Networks (Prentice Hall, 1993).