An analysis of convergence for a learning version of the subspace method

JOURNAL OF MATHEMATICAL ANALYSIS AND APPLICATIONS 91, 102-111 (1983)

An Analysis of Convergence for a Learning Version of the Subspace Method

ERKKI OJA AND JUHA KARHUNEN

Helsinki University of Technology, Department of Technical Physics, SF-02150 Espoo 15, Finland

Submitted by K. J. Astrom

The learning subspace method of pattern recognition has been earlier introduced by Kohonen et al. in a speech recognition application, where the phonemes to be classified are given as spectral representations. In that method, the class subspaces are updated recursively using special rotation matrices, which depend on the training vectors entering one at a time. Here the learning algorithm based on these operators is represented in a general mathematical form, and almost sure convergence is shown to a given criterion that is a function of the statistics of the training set as well as of a set of nonrandom but free parameters. The proof employs current techniques in stochastic approximation theory. For illustration, the resulting classification criterion is then applied to a concrete pattern recognition situation with suitably chosen parameter values.

1. INTRODUCTION

In the fields of pattern recognition and feature extraction, there exists a rich literature on reducing classification spaces to smaller dimensional ones; see, e.g., the books by Duda and Hart [4] and Young and Calvert [26]. The Karhunen-Lo&e expansion has been used for such purposes by many authors, e.g., [2, 6, 8, 15, 211. On the basis of the Karhunen-Loeve expansion, Watanabe [23, 241 introduced the subspace method of pattern recognition. In the basic version of this method, which Watanabe called CLAFZC, the vectors of each class are expanded in terms of the eigenvectors of the respective class autocorrelation matrix. If only a subset of eigenvectors is used, corresponding to the largest eigenvalues, then a useful data compression takes place. Geometrically, this corresponds to projecting each data vector onto a subspace spanned by these first eigenvectors. Classification can then be based on the dispersion of information over the components of the projection, or simply on the mutual lengths of the different projection vectors [ 231.

In the subspace method, the idea is implicit that the data has an approx- imately linear structure in order to allow the representation of each class as a

102 0022~247X/83/010102-10$03.00/O Copyright 8 1983 by Academic Press, Inc. All rights of reproduction in any form reserved.

LEARNING SUBSPACE METHOD 103

subspace of relatively small dimension m ( n rather than as an II- dimensional domain in the n-dimensional pattern space. For spectral data, this is a good approximation and has been used, e.g., in signal detection (see 1161).

Recently, Kohonen et al. [9, lo] introduced a learning version of the subspace classifier, the LEARNING SUBSPACE METHOD (LSM). The correlations and probability distributions of the data vectors are often such that classification based on a priori first-and second-order statistics does not necessarily yield good results. Also, sometimes the statistics are unknown and cannot be estimated because of nonstationarities; then a learning approach is the only possible one. In general, to use learning or training in various algorithms of patterns recognition is a well-established practice used by many authors, e.g., [7, 13, 14, 221. Usually convergence of the learning rules turns out to be a problem, which has usually been solved using Dvoretzky’s results 151.

The basic procedure in Kohonen’s LSM method is to rotate the subspace (equivalently, their basis vectors) using transformation matrices of the form

R, = (Z + ,ux,x;). (‘1 where xk is the prototype data vector at step k and p is a numerical constant. Positive ,U tends to increase the projection length of xk on the rotated subspace, while negative p tends to decrease it. The method has been applied so far to the classification of phonemes represented in spectral form, and an appreciable improvement over the conventional methods tested has been achieved [ 10, IS].

Due to the decision-directed approach taken in the LSM algorithm, it is difftcult to verify theoretically the convergence properties of the algorithm. Experimental results on the choice of ,U in (1) and on the different training strategies have been obtained 191. If the rotations, however, are not based on the intermediate classifications but the parameters are fixed from the start. then a study of the convergence is easier.

The main emphasis of the present paper is to present the learning algorithm based on the rotations (1) in a general mathematical form, and show almost sure convergence to a given criterion depending on a set of nonradom but free parameters. For illustration, this criterion is then applied to a concrete pattern recognition situation with suitably chosen parameter values.

2. THE TRAINING ALGORITHM

Let there be K classes w(I),..., w(~), each of which is represented by a subspace Y(j). The classification criterion used in the subspace method is the following:

If ]]Z]lY),,, = rnax ]].G]]rrJC,,, then classify x in ofi’. (2)

104 OJA AND KARHUNEN

There ]la]lYti, is the (Euclidean) vector norm of the orthogonal projection of x on subspace L@). It should be noted that, once the orthonormal bases of all the subspaces have been computed in one way or another, criterion (2) only requires the inner products of the vector x with the relatively small number of basis vectors. In computational needs, the method is considerably less tedious to implement than some standard methods like the nearest neighbour classifier or the Bayes classifier.

Describe the different classes of prototypes by K stochastic processes {xc’}, i = l,..., K, k > 1. Each xf’ E 5%“‘. For further use, define the correlation matrices

c(i) = E(~O~~)T) all k. (3)

Throughout the paper it will be assumed that the processes {xr’} are wide- sense stationary and independent.

Now let Ur!i ,..., Ujfl’, , Uf’l, E 5%‘nXp’i), be a set of matrices, each having orthonormal columns, such that

(4)

where L%?(.) denotes the range space of a matrix. The columns of UFl, comprise an orthonormal basis of subspace gfl,. Then the general algorithm of the learning version of the subspace method, generalized from Kohonen’s original algorithm, can be presented as follows: Let the input prototype vector x of the algorithm at step k belong to u”), i.e., x = XC'.

Then

OF’= (I+,a y,i)x~)x~)Tj ufl,, uy) = cf) yt);

ay, = (I-p y,oxpx~“) (-Jp 1, UC/ = @A pi) (5)

k k 9 j# i,

where the Vy’ are matrices performing Gram-Schmidt orthonormalization of the generally not orthonormal columns of matrices @‘.

The scalar parameters ,uk (j-i) > 0 determine the amount of rotation; the subspace 9;’ is rotated towards xk , U) the other subspaces away from it by amounts depending both on the index i of the correct class of the training vector at that step and on the index j of the subspace to be rotated. With these parameters, it is possible to regulate the angles between the different subspaces selectively and so diminish the amount of misclassifications. Parameters pyviJ could be made to depend on the tentative classification results at step k; in Kohonen’s algorithm [9] all the ,u:*” and ~~~” are zero if the training vector xy’ is correctly classified into oCi) by the subspaces L&? 1 according to criterion (2). On the other hand, if the classification is erroneous and xv’ is classified into UJ(~), r # i, then only ,uF*” and ,u:.~) are nonzero and equal, and the others are zero. So only those subspaces are

LEARNING SUBSPACE ME'HOD 105

changed, which correspond to the correct class and the one that gave the erroneous classification. In the present paper, however, we only consider the case in which the rotation parameters are independent of the intermediate results; in fact, they form nonrandom null sequences.

3. ALMOST SURE CONVERGENCE OF THE ALGORITHM

In this section, the assumptions concerning the parameters of (5) are made precise and convergence is established to an explicitly expressable limit. The proof involves recent techniques in stochastic approximation theory.

Assume the following:

(Al) Ler pk (j*” = OC.i*i’,uk be nonrandom numbers such that O”.” is independent of k and ,u, > 0, Ck,uk = 00, Cknuf < co.

(A2) Let xy’ be statistically independent of xy’for r < k and all i and j in ( l,..., K}. Let x:’ be almost surely bounded for all i and k. Let C”’ oj‘ Eq. (3) be independent of k.

(A3) Let the choice of the class i of input vector xy’ take place independently of k with probabilities ~6~‘. 7~~~’ > 0, C zti) = 1. Let this choice be independent of the vectors xt’ and the basis matrices iY:ii , .

(A4) Let the numbers 0”*j’ and the numbers p”’ = rkl UC’] be such that in the matrix

C(i) = @i,f)n(i)cli) _ \’ nl,i)~(i.ilc(il (61 ,Ti

the p U) largest eigenvalues are distinct.

(A5 j Let H(j) be the subspace spanned by the eigenvectors of c”’ corresponding to the p”’ largest eigenvalues. There is a number F > 0 such that in (5), the event (11 UF”z/I > E for all z E. Y”’ with 11 z/I = 1 } occurs infinitely often with probability one.

The main result of the present paper is

THEOREM 1. Assume (Al)-(A5). Then the columns of matrix Uy’ tend almost surely to an orthonormal basis of subspace /Ii’.

To prove the theorem, we first give three auxiliary lemmas. Without loss of generality. consider in (5) the first class w(I) only. Under the indepen- dence assumptions (A3), (5) implies for all k

ty = (I + ,ukx~‘~xy’) up, w.p. 7c’l’,

d;l) = (1 _ ,g(l.j)p x(j)x(j)T kk k ) up!, w.p. 7r’i’, j > 1,

u(l)= @“V” 6 k k *

(7)

106 OJA AND KARHUNEN

In (7), the parameter 8 (‘,i) has been absorbed in ,uuk so that 6)“*” = 1. The other 8”*j’ must be redefined likewise.

We first present the following result which is easy to prove, e.g., by matrix pseudoinverse techniques (see [ 1 I):

LEMMA 1. Let P E Snxn be an orthogonal projection matrix with range space SF(P), let x E .5+?” be a vector and let ,u E .R be a scalar with ,u # - [Jx((-~. Denote

Yip= {.JIy=(Z+pxxT)z, rE.GqP)}.

The orthogonal projection matrix on subspace 9 is then

P’ = P + [P/(1 + @’ IIx/12 + &4(xTW)l

x (P~~~+xx~P-~P~~~P+~[(x~Px)xx’~-~IJx(~~P~~~PJ}. (8)

Assume now that l,~u( in (8) is sufficiently small to be able to write

p/(1 + cu’ lIxI12 + 74(xTW)

=p[ 1 - (u’ (/XI/* + 2P)(XT’pX) + Ocu’)l

with Ok’) tending to zero as ,u~ tends to zero. Substituting this in (8) and combining terms that are of degree 2 or higher in ,U yields

P’ = P + /l[PxxT + XXTP - 2PxxTP + O(p)]. (9)

In algorithm (7) denote P, = U:“U:“T which is the orthogonal projection matrix on subspace PC’) k . Due to the assumption (Al), we may safely assume that elk is small. We obtain

P, = P,-, + ‘+[H, + aP/ol (10)

with

H, = Pk-,x;‘)x:‘)T + x:“x:“~P,-, - 2P x(“x:“~P~~, k-l k

with probability rr(“;

H, = -@‘yi’ [P, _ 1 ,;)x;)~ + ~$‘x~‘~P~ _ , - 2p, _, x:“x~‘~P, _ 1 ] (11)

with probability r(j), i = 2,..., K.


Define Z?(P,,-,) = E(H, / P,-,): I? depends on k through P, ~, only due to (A2). Also, H is easily formed:

H(P,~ ,)=7r”)[Pk-,C(‘) + c”‘P,- 1-2P, ,C”‘P, ,] x

- 1’ .(j)@l.j)lp,_,c’j) + @)p,-, _ 2p,- , c’.j’p, ,I JT

=P,-, ct CP,--, -2P,- ,CP,&, (12)

with (cf. Eq. (6))

h c= n’l~c’~’ - \’ 71’i),‘l.,,,l;l, (13)

,Y

Now (10) can be written as

with

and

A,= H, -I?(P, ,) (15)

B, = W,J (16)

Due to assumptions, both A, and B, are a.s. bounded random matrices such that

and

W,lP,-,I=0 for all k (17)

lim B, = 0 a.s. (18) k-o:

LEMMA 2. Denote

f = {P E .Y?““’ ( P is symmetric and idempotent

and has rank equal to p”’ = rk(U:“\}. (19)

Let F be a locally asymptotically stable solution of

P=PC+ CP-2PCP (20)

in the following sense: there is an F > 0 such that ijll P(t,,) - Fl( < t‘for some t, with P(t,) E .9”, FE .Y, then lim, +.,, P(t) = F. Denote the domain oj

108 OJAANDKARHUNEN

attraction of F by 8(F). If there is a compact set A c 9?(F) such that P, E z2 infinitely often, then as. in algorithm (14)

lim P, = F. k-cc

ProoJ We rely on Theorem 2.3.1 of Kushner and Clark [ 11 j. It is easy to show that assumptions A.2.2.1-A.2.2.4 of that theorem are implied by our (Al )-(A3). The boundedness of P, follows from P, E 9 for each k.

Equation (20) is a symmetrical matrix Riccati differential equation (see [ 171) which can be solved in closed form. Knowing the solution, the asymptotically stable solutions can be found. We state without proof

LEMMA 3. In (20), let P(t,,) E 9. Then the solution on (-a~, co) is

P(t) = e C(f--lo)v ~Te2~~f--lo)~)-I (

vTe??-to) 3 (21)

with V any n x p’” matrix with VVT = P(tJ, VT V = I. P(t) is in 9 for all t. Let F be the orthogonal projection matrix on the subspace ,H”’ defined in WI. If

IIJ%J - Fl12 = ,,y {xT(P(t,) -F)’ x) < 1, r I

(22)

then P(t) tends to F as t + co.

Lemmas 2 and 3 imply the convergence of (5):

Proof of Theorem 1. Without loss of generality, let i = 1. The convergence of UL” to a matrix whose columns are an orthonormal basis of ,#I) is equivalent to the convergence of P, = U:“Ui”T to matrix F of Lemma 3. According to Lemma 2, a sufftcient condition is the existence of a compact .d c 5?(F) such that P, E d, i.o. It can be shown by standard methods in matrix algebra that the event ]( U~‘“.z/] > E defined in (AS) implies I/P, - FI( < 1 - 6 for some positive number 6. The set of P, satisfying this is compact and, by (22), is a subset of @(?). This concludes the proof.

4. SOME CLASSIFICATION RESULTS

Table I shows classification results for both the method reported here (Method 1) and the Fukunaga-Koontz method [6] (FK). As seen from the table, the performance of the present subspace method is not unfavourable as compared to the one using the Fukunaga-Koontz subspaces. In this simulation, the parameters p”‘, p”‘, rr(‘), rc(‘) as well as the statistics used


TABLE I

Comparison of the Classification Using Either the Present Method or the Fukunaga-Koontz Subspaces”

Pattern Subspace \pace dims dim ~-- ---~-

II P ,I)

P ,*I

5 3 3 5 3 3

10 5 6 15 5 3 15 5 3

A priori Noise Sample (‘0 of correct probs var. sizes classification

n11) n,z,

-__-

0.7 0.3 0.7 0.3 0.3 0.7 0.5 0.5 0.5 0.5

Cl. 1 Cl. 2

0.60 280 120 1.00 280 120 0.35 90 210 0.30 400 400 0.85 400 400

Method 1 FK

81.0 80.5 76.5 74.0

99.7 Yfl.! 98.3 08.0 97.0 Yl.8

” Artificial data, two classes with unequal mean vectors and unequal correlation matrices. plus additive Gaussian noise with variance 0’.

for generating the pseudorandom test vectors were chosen rather arbitrarily. with no attention paid to improving the performance of one of the methods over the other. The parameters 19”.‘) and 0’*.” were both equal to one. In the case p”’ + p”’ < n this guarantees orthogonal subspaces. When p”’ + p’?’ > n, the intersection of the subspaces has no effect on classification, and the basis vectors orthogonal to this intersection are all mutually orthogonal. There is, however, no reason to assume that orthogonality yields now the best classification accuracy; so some values N”.” = 0”~” < 1 would most likely yield even better results.

In another experiment, the classification was applied to speech data. The details of the data acquisition system and the data have been given elsewhere 19, lo]. The training set consisted of 533 phonemes in spectral form and the test set. respectively, of 539 phonemes. Each pattern vector had 17 = 30 elements, and there were in all K = 17 classes.

The application of the method was now based directly on the matrices (6) whose eigenvectors were easily computed, but each stage of the classification could be replaced by an adaptive approach.

To find the parameter values, the eigenvalues and eigenvectors of the class correlation matrices C”’ were computed, and the dimension p”’ = dim(Y “‘I for each class was based on the cumulative sum of the eigenvalues of C’“. which were normalized so that the total sum was equal to one. The number P Ii) was determined as the number of eigenvalues needed for the cumulative sum to reach a given limit. As a result, the dimensions varied frm 3 to 8. depending on how dominant the largest eigenvalues of the correlation matrix were.

To find the rotation parameters ,u’~*~‘, the following scheme was used: the Karhunen-Lo&e subspaces of dimensions p”’ were constructed for the

110 OJAANDKARHUNEN

classes using the eigenvectors computed earlier. The test data were classified using these subspaces, which corresponds to applying the CLAFZC method. Each ,uci,j) was then taken equal to the number of vectors, actually belonging to class wci), that were classified into u(j). So the larger the error of classification is for a given index pair (i, j), the larger is ,LL(~*~) and, equivalently, B”*j’ in matrix (6).

The a priori probabilities #) were estimated from the numbers of prototypes of each class in the training sample.

The classification results were very similar to the ones obtained earlier by Kohonen [9] using the learning subspace method. The test sample contained 539 vectors. For each of them, those three classes were determined, on whose subspaces the projections of the test vector were the largest in magnitude. In 80.7% of the cases, the projection on the correct subspace for that test vector was the largest; so that number gives the classification accuracy. In 91.8% of the cases, the projection on the correct class was among the two largest projections, and in 95.7% of the cases it was among the three largest.

More extensive experiments of the application of the learning subspace methods to speech recognition, up to a thousand word vocabulary and from one to live individual speakers, have recently been reported [ 10, 181. Those references also contain comparisons to some standard pattern recognition algorithms.

ACKNOWLEDGMENTS

The feasibility of learning in the subspace method of pattern recognition was pointed out to the authors by T. Kohonen. The authors are also grateful to him and the members of the speech recognition group, especially H. Riittinen, for providing the data and programs used in the phoneme classification experiment. Thanks are also due to the referee for making useful comments on the original manuscript.

REFERENCES

1. A. ALBERT, “Regression and the Moore-Penrose Pseudoinverse,” Academic Press, New York/London, 1972.

2. Y. T. CHIEN AND K. S. Fu, On the generalized Karhunen-Lo&e expansion, IEEE Trans. Inform. Theory IT-13 (1967). 518-520.

3. .I. L. DOOB, “Stochastic Processes,” Wiley, New York, 1953. 4. R. 0. DUDA AND P. E. HART, “Pattern Classification and Scene Analysis,” Wiley, New

York, 1973. 5. A. DVORETZKY, On stochastic approximation, in “Proc. 3rd Berkeley Symp. on Math.

Stat. and Prob.,” No. 1, University of California Press, 1956. 6. K. FUKUNAGA AND W. L. G. KOONTZ, Application of the Karhunen-Lo&e expansion to

feature selection and ordering, IEEE Trans. Compul. C- 19 (1970), 3 1 l-3 18.

LEARNING SUBSPACE METHOD III

7. R. L. KASHYAP AND C. C. BLAYDON, Recovery of functions from noisy measurements taken at randomly selected points and its application to pattern classification. Proc. IEEE 54 (1966). 1127-I 129.

8. J. KITTLER. The subspace approach to pattern recognition, in “Progress in Cybernetics and Systems Research,” Vol. III (R. Trappl. G. J. Klir. and L. Ricciardi. Eds.). pp. 92-97. Hemisphere, Washington, 1978.

Y. T. KOHONEN. G. NEMETH, K. BRY. M. JALANKO. AND H. RIITTINE~. Spectral classification of phonemes by learning subspaces, in “Proc. 1979 IEEE Int. Conf. on Acoustics. Speech. and Signal Processing.” Washington. D.C.. pp. 97-100. April 2 -4. IY79.

IO. T. KOHONEN. H. RIITTINEN. M. JALANKO. E. REUHKALA. ANI) S. HAI.TSONE~. A thousand-word recognition system based on the learning subspace method and redundant hash addressing. in “Proc. 5th lnt. Conf. on Pattern Recognition.” Miami Beach. FIX. pp. 158-165, Dec. l-4. 1980.

II. H. KUSHNER AND D. CLARK. “Stochastic Approximation Methods for Constrained and Unconstrained Systems,” Springer. New York/Heidelberg/Berlin. 1978.

12. 1.. LJUNC. Analysis of recursive stochastic algorithms. IEEE Trans. Automnt. Cor~/roi AC-22 (1977). 551-575.

13. R. W. MCLAREN. A stochastic automaton model for the synthesic of learning systems. IEEE Trans. $~tem Sci. Cybernet. SSC-2 (1966). lOY%l 14.

14. Z. J. NIC‘OLIT ANU K. S. Fu. A mathematical model of learning in an unknown random environment. hoc. Nat. Electron. ConJ 22 (1966). 601-613.

15. Y. NOGUCHI. Subspace method and projection operators. in “Proc. 4th Int. Joint Conf. on Pattern Recognition.” pp. 449-451. Kyoto, Japan, Nov. 7-10. 1978.

16. N. L. OwSLEY. Spectral signal set extraction. in “Aspects of Signal Processing” (G. Tacconi. Ed.). pp. 469-475, Academic Press. New’ York. 1977.

17. W. T. REID. “Riccati Differential Equations.” Academic Press. New York. 1972. 18. I-1. RIITTINEN. S. HALTSONEN, E. REUHKALA. AND M. JALANKO. Experiments on an

isolated-word recognition system for multiple speakers. in “Proc. 1981 IEEE Int. Conf. on Acoustics. Speech. and Signal Processing,” Atlanta. Ga.. pp. 975-978. March 30 April 1. 1981.

19. H. R~‘TISHAIJSER. Computational aspects of F. L. Bauer’s simultaneous iteration method. Nurner. Math. 13 (1969). 4-13.

20. 1I. R. S~HWARZ. “Numerik Symmetrischer Matrizen.” Teubner. Stuttgart. 1968. 21. C. W. THERRIEN. Eigenvalue properties of projection operators and their application to

the subspace method of feature extraction, IEEE Trans. Conzput. C-24 (1975). Y44-948.

22. YA. Z. TSYPKIN. “Foundations of the Theory of Learning Systems.” (Z. J. Nicolic. Transl.). Academic Press, New York/London. 1973.

23. S. WATANABE:. P. F. LAMBERT, C. A. KULIKOWSKI. J. L. BUXTON. ANU R. WALKI-R. Evaluation and selection of variables in pattern recognition. in “Computer and Infor mation Sciences” (J. Tou, Ed.), Vol. 2, pp. 9 l-122, Academic Press. New York. 1967.

24. S. WATANABE ANI) N. PAKVASA, Subspace method in pattern recognition. in “Proc. 1st Int. Joint Conf. on Pattern Recognition,” Washington. D.C.. pp. 25-32. Oct. 30-No\. I. 1973.

25. B. WIDRO~. J. R. CLOVER JR., J. M. McCoo~. J. KAUNITZ. C. S. WILLIAMS. R. H. Ht:ARN. J. R. ZEIDLER. E. DONG JR.. AND R. C. GOODLIN. Adaptive noise cancelling principles and applications, Proc. IEEE 63 (1975). 1692-17 16.

26. T. Y. YOCING AND T. W. CALVERT. “Classification. Estimation. and Pattern Recognition.” Elsevier. New York. 1974.

An analysis of convergence for a learning version of the subspace method

Documents

Transcript of An analysis of convergence for a learning version of the subspace method