Neural and statistical classifiers-taxonomy and two case studies

13
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997 5 Neural and Statistical Classifiers—Taxonomy and Two Case Studies Lasse Holmstr¨ om, Member, IEEE, Petri Koistinen, Jorma Laaksonen, Student Member, IEEE, and Erkki Oja, Senior Member, IEEE Abstract—Pattern classification using neural networks and sta- tistical methods is discussed. We give a tutorial overview in which popular classifiers are grouped into distinct categories according to their underlying mathematical principles; also, we assess what makes a classifier neural. The overview is complemented by two case studies using handwritten digit and phoneme data that test the performance of a number of most typical neural-network and statistical classifiers. Four methods of our own are included: Reduced kernel discriminant analysis, the learning -nearest neighbors classifier, the averaged learning subspace method, and a version of kernel discriminant analysis. Index Terms— Classification methods, neural networks, com- parison, handwritten digit recognition, phoneme classification. I. INTRODUCTION T HE main application of artificial neural networks is pattern recognition. Understood in the sense of “artificial perception,” this involves deep problems in cognitive sci- ence and artificial intelligence. Realistic systems for computer vision, even for intrinsically two-dimensional problems, are hybrids of many methodologies like signal processing, classi- fication, and relational matching. It seems that neural networks can be used to an advantage in certain subproblems, especially in feature extraction and classification [1]. These are also problems amenable to statistical techniques, because the data representations are real vectors of measurements or feature values, and it is possible to collect training samples on which regression analysis or probability density estimation become feasible. Thus, in many cases neural techniques and statistical techniques are seen as alternatives, or in fact neural networks are seen as a subset of statistics. This approach has led on one hand to a fruitful analysis of existing neural networks, on the other hand brought new viewpoints to current statistical methods, and sometimes produced a useful synthesis of the two fields. It must be emphasized, however, that outside some well-defined and focused application fields like multivariate classification, the two fields are different and the long-term goals of neural networks in designing autonomous machine intelligence still remain. Manuscript received January 3, 1996; revised June 26, 1996. This work was supported by Grant 4172/95 from Tekes in the program “Adaptive and Intelligent Systems Applications.” L. Holmstr¨ om and P. Koistinen are with the Rolf Nevanlinna Institute, University of Helsinki, Helsinki, Finland. J. Laaksonen and E. Oja are with the Laboratory of Computer and Information Science, Helsinki University of Technology, Espoo, Finland. Publisher Item Identifier S 1045-9227(97)00242-7. In a narrow sense, the pattern recognition problem can be parceled to data acquisition and preprocessing, feature extrac- tion, and classification. In addition to general references such as [2]–[12], questions of pattern recognition are extensively covered in the literature of many specific application areas such as speech and image recognition. The performance of statistical and neural-network methods is usually discussed in connection with feature extraction and classification. Efficient feature extraction is crucial for reliable classification and, if possible, these two subsystems should be matched optimally in the design of a complete pattern recognition system. In this article we concentrate on classification only. Further, only supervised classification using statistical and neural-network methods is considered, thus leaving out both unsupervised classification (clustering [13]) and syntactic pattern recognition [9]. In statistics literature supervised classification is often called discriminant analysis [7]. Recently, many benchmark and comparison studies have been published on neural and statistical classifiers [14]–[19]. One of the most extensive was the Statlog project [18] in which statistical methods, machine learning, and neural networks were compared using a large number of different data sets. As a general conclusion of that study, good statistical classifiers included the -nearest neighbor ( -NN) rule, and good neural-network classifiers included learning vector quan- tization (LVQ) and radial basis function (RBF) classifiers. One of the databases used in that study consisted of handwritten digits. The good performance of nearest neighbor classifiers for handwritten digits was also confirmed by Blue et al. [20], who however only compared the result to RBF and multilayer perceptron (MLP) neural classifiers. A kernel discriminant analysis type method, the probabilistic neural network (PNN), also did very well in that study. Offline character recognition is one of the most popular practical applications of pattern recognition. During the last two decades, the research has mostly focused on handwritten characters [21]–[30]. This is mainly due to the fact that while recognition of machine printed characters is considered a solved problem, the reliability achieved with handwritten text has not yet reached the level required in practical applica- tions. Handwritten character recognition has become a popu- lar application area for neural-network classifiers [31]–[33], which through their adaptive capabilities have often been able to achieve better reliability than classical statistical or knowledge-based structural recognition methods. In our first case study, we also use handwritten digits. 1045–9227/97$10.00 1997 IEEE

Transcript of Neural and statistical classifiers-taxonomy and two case studies

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997 5

Neural and Statistical Classifiers—Taxonomyand Two Case Studies

Lasse Holmstrom, Member, IEEE, Petri Koistinen, Jorma Laaksonen,Student Member, IEEE, and Erkki Oja, Senior Member, IEEE

Abstract—Pattern classification using neural networks and sta-tistical methods is discussed. We give a tutorial overview in whichpopular classifiers are grouped into distinct categories accordingto their underlying mathematical principles; also, we assess whatmakes a classifier neural. The overview is complemented by twocase studies using handwritten digit and phoneme data that testthe performance of a number of most typical neural-networkand statistical classifiers. Four methods of our own are included:Reduced kernel discriminant analysis, the learning -nearestneighbors classifier, the averaged learning subspace method, anda version of kernel discriminant analysis.

Index Terms—Classification methods, neural networks, com-parison, handwritten digit recognition, phoneme classification.

I. INTRODUCTION

THE main application of artificial neural networks ispattern recognition. Understood in the sense of “artificial

perception,” this involves deep problems in cognitive sci-ence and artificial intelligence. Realistic systems for computervision, even for intrinsically two-dimensional problems, arehybrids of many methodologies like signal processing, classi-fication, and relational matching. It seems that neural networkscan be used to an advantage in certain subproblems, especiallyin feature extraction and classification [1]. These are alsoproblems amenable to statistical techniques, because the datarepresentations are real vectors of measurements or featurevalues, and it is possible to collect training samples on whichregression analysis or probability density estimation becomefeasible. Thus, in many cases neural techniques and statisticaltechniques are seen as alternatives, or in fact neural networksare seen as a subset of statistics. This approach has led onone hand to a fruitful analysis of existing neural networks, onthe other hand brought new viewpoints to current statisticalmethods, and sometimes produced a useful synthesis of thetwo fields. It must be emphasized, however, that outside somewell-defined and focused application fields like multivariateclassification, the two fields are different and the long-termgoals of neural networks in designing autonomous machineintelligence still remain.

Manuscript received January 3, 1996; revised June 26, 1996. This workwas supported by Grant 4172/95 from Tekes in the program “Adaptive andIntelligent Systems Applications.”L. Holmstrom and P. Koistinen are with the Rolf Nevanlinna Institute,

University of Helsinki, Helsinki, Finland.J. Laaksonen and E. Oja are with the Laboratory of Computer and

Information Science, Helsinki University of Technology, Espoo, Finland.Publisher Item Identifier S 1045-9227(97)00242-7.

In a narrow sense, the pattern recognition problem can beparceled to data acquisition and preprocessing, feature extrac-tion, and classification. In addition to general references suchas [2]–[12], questions of pattern recognition are extensivelycovered in the literature of many specific application areassuch as speech and image recognition. The performance ofstatistical and neural-network methods is usually discussed inconnection with feature extraction and classification. Efficientfeature extraction is crucial for reliable classification and, ifpossible, these two subsystems should be matched optimallyin the design of a complete pattern recognition system. Inthis article we concentrate on classification only. Further, onlysupervised classification using statistical and neural-networkmethods is considered, thus leaving out both unsupervisedclassification (clustering [13]) and syntactic pattern recognition[9]. In statistics literature supervised classification is oftencalled discriminant analysis [7].Recently, many benchmark and comparison studies have

been published on neural and statistical classifiers [14]–[19].One of the most extensive was the Statlog project [18]in which statistical methods, machine learning, and neuralnetworks were compared using a large number of differentdata sets. As a general conclusion of that study, good statisticalclassifiers included the -nearest neighbor ( -NN) rule, andgood neural-network classifiers included learning vector quan-tization (LVQ) and radial basis function (RBF) classifiers. Oneof the databases used in that study consisted of handwrittendigits. The good performance of nearest neighbor classifiersfor handwritten digits was also confirmed by Blue et al. [20],who however only compared the result to RBF and multilayerperceptron (MLP) neural classifiers. A kernel discriminantanalysis type method, the probabilistic neural network (PNN),also did very well in that study.Offline character recognition is one of the most popular

practical applications of pattern recognition. During the lasttwo decades, the research has mostly focused on handwrittencharacters [21]–[30]. This is mainly due to the fact that whilerecognition of machine printed characters is considered asolved problem, the reliability achieved with handwritten texthas not yet reached the level required in practical applica-tions. Handwritten character recognition has become a popu-lar application area for neural-network classifiers [31]–[33],which through their adaptive capabilities have often beenable to achieve better reliability than classical statistical orknowledge-based structural recognition methods. In our firstcase study, we also use handwritten digits.

1045–9227/97$10.00 1997 IEEE

6 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

Another comprehensive benchmark study was done in theElena project [19] using four real-world datasets. In orderto compare our classifiers in the case of data that are verydifferent from the handwritten digits, we chose the phonemedatabase of the Elena project. Contrary to the digit data, thephoneme classification problem is low dimensional, has onlytwo classes, and the class densities are highly non-Gaussian.In addition to giving insights into the relative performances ofthe classifiers for two very different classification problems,a major motivation for our case studies was to discuss thevarious design issues one has to face when implementing thevarious methods.In both of the experiments reported here, four methods of

our own are compared to a carefully selected subset of neuraland statistical classifiers. These four methods are: reducedkernel discriminant analysis, the learning -NN (L- -NN) clas-sifier, the averaged learning subspace method, (ALSM), anda variant of kernel discriminant analysis that uses smoothingparameter selection based on error rate minimization. Further,local linear regression has been rarely used previously in aclassification context. In handwritten digit recognition, besidesa different set of classifiers selected for comparison, the presentstudy also differs from [20] in our systematic use of training setcross-validation in all classifier design. The final performanceestimates are based on an independent testing set that had norole in classifier construction, including the choice of optimalpattern vector dimension.While the two case studies do give some indication of the

relative merits of the tested methods in the two particularapplications studied, we want to emphasize that the resultsobtained by no means provide grounds for any objectiveranking of these methods as alternative general classificationschemes. The best classifier for a given task can be found byexperimenting with different designs and basing the choice oncriteria which, in addition to classification error, can includeother issues such as computational complexity and feasibilityof efficient hardware implementation. See [34] for an insightfuldiscussion on the difficulties of benchmarking different typesof classifiers.In Section II, we first give a tutorial overview of classifica-

tion methods that groups classifiers into a few main categoriesaccording to their underlying mathematical principles. We alsopresent some characteristics of neural-network classifiers thatmake them different from the mainstream statistical methods.Based on these comparisons, a number of classification meth-ods were selected for the case studies. Closer descriptions ofthe specific methods selected are then given in Section III,including several implementation details. The findings of thetwo case studies, handwritten digit classification and phonemeclassification, are discussed in Sections IV and V. Section VIsummarizes the results of this paper.

II. A TAXONOMY OF CLASSIFIERSIn the following, we assume that the patterns are stochasti-

cally independent and identically distributed, although takingadvantage of context-dependent information in such applica-tions as optical character recognition or speech recognitioncertainly is beneficial.

A -dimensional feature or pattern vector is denoted byand its associated class by

, where is the number of possible classes. Aclassifier can be regarded as a functionthat classifies a given pattern to the class . A patternwith its associated class is stochastically modeled as a randompair . The a priori probability of class is denotedby , its probability density function by, and that of the pooled data by . The

Bayes classifier that minimizes the misclassification error [4]is defined by

(1)

Here “argmax” denotes the maximizing argument, i.e., isclassified to the class that maximizes the product .An equivalent formulation is to consider the a posterioriprobability of class givenand use the rule

(2)

Other criteria than minimum classification error can be im-portant in practice, including use of class-dependent mis-classification costs and Neyman–Pearson style classification[3], [35]. The use of a reject class can help reduce themisclassification rate in tasks where exceptional handling (e.g.,by a human expert) of particularly ambiguous cases is feasible.The decision to reject a pattern and to handle it separatelycan be based on its probability to be misclassified whichfor the Bayes rule is . Thehighest misclassification probability occurs when the posteriorprobabilities are equal and then . Onecan therefore select a rejection thresholdand reject if

(3)

The rules (1) and (2) are formally equivalent since. However, in practice the classifiers (1) and (2)

have to be estimated from training dataof pattern vectors with known classes and then two distinctapproaches emerge. The use of rule (1) requires explicitestimation of the class-conditional probability density func-tions . For (2), some regression technique can be used toestimate the posterior probabilities directly without consid-ering the class-conditional densities separately. We elaborateupon these two approaches further in Sections II-A and Bbelow.

A. Density EstimationIn the density estimation approach one needs estimates both

for the prior probabilities and the class-conditional densitiesin (1), but in practice the difficult part is to estimate the

class-conditional densities. A classical parametric approachis to model the class-conditional densities as multivariatenormals. Depending on whether unequal or equal class co-variances are assumed, the logarithm of is then eithera quadratic or linear function of giving rise to quadratic

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 7

discriminant analysis (QDA) and linear discriminant analysis(LDA) [7]. A recent development is regularized discriminantanalysis (RDA) [36] which interpolates between LDA andQDA.In nonparametric density estimation, no fixed parametrically

defined form for the estimated density is assumed. Kernelor Parzen estimates as well as -NN methods ( large)are examples of popular nonparametric density estimationmethods [37]–[39]. They give rise to kernel discriminantanalysis (KDA) [40]–[42] and -NN classification rules [4],[40]. The PNN [43] is the neural-network counterpart ofKDA. In another approach the densities are estimated asfinite mixtures of some standard probability densities byusing the expectation-maximization (EM) algorithm or someother method [44]–[49]. Such an approach can be viewed asan economized KDA or an instance of the RBF approach[10]. The self-organizing reduced kernel density estimatorintroduced in [48] estimates densities in the spirit of RBF andwe refer to the corresponding classification method as reducedkernel discriminant analysis (RKDA).

B. RegressionIn the second approach to classification the class posterior

probabilities are directly estimatedusing some regression technique. Parametric methods includelinear and logistic regression, as well as an MLP [10], [50]with a fixed architecture, i.e., fixed layer sizes. Examplesof nonparametric methodologies are projection pursuit [51],[52], additive models [53], multivariate adaptive regressionsplines (MARS) [54], the Nadaraya—Watson kernel regressionestimator [39], [55] and local linear regression (LLR) [39],[56], [57]. The Nadaraya—Watson kernel regression estimatoris also called the general regression neural network [58]. Otherneural-network approaches include MLP’s with a flexiblearchitecture and RBF expansions [10], [50].One can use “one-of- ” coding to define the responseto pattern from class be the unit vector

with “1” in the th place. Inthe least squares approach one then tries to minimize

(4)

over a family of -valued functions , where we denotethe th component of a vector by . The correspondingmathematical expectation is minimized by the vector of classposterior probabilities, . Of course, thisideal solution may or may not belong to the family , andbesides, sampling variation will anyhow prevent us fromestimating exactly even when it does belong to [59],[60].The least squares fitting criterion (4) can be thought to rise

from using the maximum likelihood principle to estimate aregression model where errors are distributed normally. Thelogistic approach [7, ch. 8] uses binomially distributed error,clearly the statistically correct model. One natural multivariatelogistic regression approach is to model the posterior proba-

bilities as the softmax [61] of the components of

(5)

Note that this also satisfies the natural condition .A suitable fitting criterion is to maximize the conditional log-likelihood of given that .In the case of two classes this approach is equivalent to theuse of the cross-entropy fitting criterion [10].A very natural approach would be a regression technique

that uses the error rate i.e., number of misclassified patternsas the fitting criterion to be minimized [62]. Classification andregression trees (CART’s) are an example of a nonparametrictechnique that estimates the posterior probabilities directlybut uses neither the least squares nor the logistic regressionapproach [63].

C. Other ClassifiersIn addition to using the Bayes rule combined with either

density estimation or regression, a third popular approach isto model the classes using prototypes and classify accordingto the shortest distance (defined in a suitable sense) to aprototype. Examples are LVQ [64] and the -NN classifierswith a small . A distinct family of discrimination methodsare the subspace classifiers [65] that model the individualclasses using linear subspaces and classify a pattern accordingto its shortest distance to the model subspaces. Examplesinclude the classical CLAFIC (CLAss-Featuring InformationCompression) method and the ALSM that features incrementallearning.

D. What are Neural Classifiers?In the previous discussion, we characterized some popular

classification techniques in terms of the mathematical prin-ciples they are based on. In this general view, many neuralnetworks can be seen as representatives of certain largerfamilies of statistical techniques. However, this abstract pointof view fails to identify some key features of neural networksthat characterize them as a distinct methodology.From the very beginning of neural-network research

[66]–[68], the goal was to demonstrate problem solvingwithout explicit programming. The neurons and networks weresupposed to learn from examples and store this knowledge ina distributed way among the connection weights.The original methodology was exactly opposite to the

goal-driven or top-down design of statistical classifiers interms of explicit error functions. In neural networks, theapproach has been bottom-up: starting from a very simplelinear neuron that computes a weighted sum of its inputs,adding a saturating smooth nonlinearity, and constructinglayers of similar parallel units, it turned out that “intelligent”behavior like speech synthesis [69] emerged by simple learningrules. The computational aspect has always been central. Atleast in principle, everything that the neural network does

8 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

Fig. 1. A schematic depiction of the neural characteristics of some classi-fication methods.

should be accomplished by a large number of simple localcomputations using the available input and output signals,like in real neurons, but unlike in heavy numerical algorithmsinvolving such operations as matrix inversions. Perhaps thebest example of a clean-cut neural-network classifier is theLeNet system [17], [70] for handwritten digit recognition.Such a computational model supports well the implementationin regular very large scale integration (VLSI) circuits.In the current neural-network research, these original views

are clearly becoming vague as some of the most fundamentalneural networks like the one hidden layer MLP or RBFnetworks have been shown to have very close connections tostatistical techniques. The goal remains, however, of buildingmuch more complex artificial neural systems for demandingtasks like speech recognition [71] or computer vision [72], inwhich it is difficult or eventually impossible to state the exactoptimization criteria for all the consequent processing stages.Fig. 1 is an attempt to assess the neural characteristics

of some of the classification methods discussed earlier. Thehorizontal axis measures the flexibility of a classifier architec-ture in the sense of the richness of the discriminant functionfamily encompassed by a particular method. High flexibilityof architecture is a property often associated with neuralnetworks. In some cases (MLP, RBF, CART, MARS) theflexibility can also include algorithmic model selection duringlearning.In the vertical dimension, the various classifiers are catego-

rized on the basis of how they are designed from a trainingsample. Training is considered nonneural if the training vectorsare used as such in classification (e.g., -NN, KDA), or if somestatistics are first estimated in batch mode and the discriminant

functions are computed from them (e.g., QDA, CLAFIC).Neural learning is characterized by simple local computationsin a number of real or virtual processing elements. Neurallearning algorithms are typically of the error correction type;for some such algorithms, not even an explicit cost functionexists. Typically, the training set is used several times (epochs)in an on-line mode. Note, however, that for some neuralnetworks (MLP, RBF) the current implementations in factoften employ sophisticated optimization techniques whichwould justify moving them downwards in our map to thelower half plane.In this schematic representation the classical LDA and QDA

methods are seen as least neural with the RDA and CLAFICpossessing at least some degree of flexibility in their architec-ture. The architecture of KDA, -NN, and LLR is extremelyflexible. Compared to CLAFIC, the ALSM method allows forboth incremental learning and flexibility of architecture in theform of subspace dimensions that can change during learning.In this view, neural classifiers are well exemplified in particularby such methods as MLP, RBF, RKDA, LVQ, and L- -NN,but also to some degree by ALSM, CART, and MARS.

III. CLASSIFIERS USED IN THE CASE STUDIESFor our case study we chose a set of methods that includes

typical representatives of the various categories discussed inSection II. We will next give a short description of eachmethod tested.

A. QDA, LDA, RDA, KDA, and RKDAQDA is based on the assumption that pattern vectors from

class are normally distributed with mean vector and co-variance matrix . Following the density estimation approachthen leads to the rule

(6)

(after taking the logarithm of and canceling a commonterm). Here and denote the natural estimates of thecorresponding theoretical quantities, namely, the sample meanand the sample covariance of those pattern vectors in thetraining set which originate from class , respectively.If one assumes that the different classes are normally

distributed with different mean vectors but with a commoncovariance matrix , then the previous formula simplifies tothe LDA rule

(7)

where a natural estimate for is the pooled covariance matrixestimate .Friedman’s RDA [36] is a compromise between LDA

and QDA. The decision rule is otherwise the same as (6)but instead of one uses regularized covariance estimates

with two regularizing parameters. Parameter con-trols the shrinkage of the class conditional covariance esti-

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 9

mates toward the pooled estimate and controls the shrink-age toward a multiple of the identity matrix. Let

be the training data from class , denote by thematrix , and let .Then

(8)

where

(9)

One obtains QDA when and LDA whenprovided one uses the estimates

and .In KDA one forms kernel estimates of the class condi-

tional densities and then applies rule (1). To define the kernelestimates, let again be the training datafrom class . Then the estimate of the th class-conditionaldensity is

(10)

where, given a fixed probability density function called thekernel, is the smoothing parameter of class and

denotes the scaled kernel. This scaling ensures that and hence also each

is a probability density. A popular choice which was alsoused in the present study is the Gaussian kernel

. The choice of suitable values forthe smoothing parameters is crucial and several approacheshave been proposed in the literature, see [7] and [37]–[39].We used two methods based on cross-validated error count (cf.Section IV-B). In the first method, KDA1, we restrict all thesmoothing parameters to be equal to a parameter and eval-uate the number of misclassified vectors using cross-validation.We then vary and select that value which minimizes thecross-validated error count. In the second method, KDA2,we let the smoothing parameters vary separately and seek tominimize the cross-validated error count. Both of the methodslead to optimization problems where the object function ispiecewise constant. In the first method the search space isone-dimensional, and the optimization problem can be solvedsimply by evaluating the object function on a suitable gridof -values. In the second method the nonsmoothness of theobject function causes more trouble. Instead of minimizing theerror count directly, we found it advantageous to minimize asmoothed version of it. The particular smoothing describedbelow is inspired by that used in [73]. Let be theposterior probability estimate at corresponding to the currentsmoothing parameters and define the functions by

(11)

where is a parameter. Then the smoothed error count isgiven by . As , this converges toward

the true error count. In the experiments we fixed .Since the smoothed error count is a differentiable functionof the smoothing parameters, one can use a gradient-basedminimization method for the optimization. As an initial pointfor the iteration, we used the vector , where isthe value selected in KDA1.The standard kernel density estimate suffers from the curse

of dimensionality: as the dimension of data increases, thesize of a sample required for an accurate estimateof an unknown density grows quickly. On the other hand,even if there are enough data for accurate density estimation,the application at hand may limit the complexity of theclassifier one can use in practice. A kernel estimate with alarge number of terms may be computationally too expensiveto use. One solution is to reduce the estimate, that is, to usefewer kernels but to place them at optimal locations. Onecan also introduce kernel-dependent weights and smoothingparameters. Various reduction approaches were described in[45]–[47], [74]–[78]. In some cases the methods suggestedhave a close connection with radial basis function and mixturedensity estimation techniques. The self-organizing reducedkernel density estimate [48] has the form

(12)

where are the reference vectors of a self-organizing map [64], are nonnegative weightswith , and is a smoothing parameterassociated with the th kernel. In order to achieve substantialreduction one takes . The kernel locationsare obtained by training the self-organizing map using thewhole available sample from . The weights

are computed iteratively and they reflect the numberof training data in the Voronoi regions of the correspondingreference vectors. The smoothing parameters are optimized viastochastic gradient descent that attempts to minimize a MonteCarlo estimate of the integrated squared error .Simulations have shown that when the underlying densityis multimodal, the use of the feature map algorithm givesbetter density estimates than -means clustering, the approachproposed in [79]. RKDA constitutes using estimates (12) forthe class-conditional densities in the classifier (1). A drawbackof RKDA in the present application is that the smoothingparameters of the class-conditional density estimates used inthe approximate Bayes classifier are optimized from the pointof view of integrated squared error and not discriminationperformance which is the true focus of interest.

B. MLP and LLRWe used a standard MLP with inputs, hidden units, andoutput units, and with all the feedforward connections be-

tween adjacent layers included. We had the logistic activationfunction in the hidden and output layers [10], [50]. Such anetwork has adaptable weights which inour experiments were determined by minimizing the sum ofsquared errors criterion (4) using a conjugate gradient method[80]. Using the notation of Section II-B, we used the vector

10 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

as the desired output for input , i.e., thevectors were scaled to better fit within the range of thelogistic function. Then the scaled outputsof the optimized network can be regarded as estimating theposterior probabilities . We employed theheuristic of starting the local optimizations from many randominitial points and keeping the weights yielding the minimumvalue for the sum of squared errors to prevent the networkfrom converging to a shallow local minimum. It is advisableto scale the random initial weights so that the inputs to thelogistic activation functions are of the order unity [10, ch. 7.4].In weight decay regularization [10, ch. 9.2], one introduces

a penalty for weights having a large absolute value in order toencourage smooth network mappings. When training MLP’swith weight decay (MLP WD), we minimized the criterion

(13)

with conjugate gradient optimization. Here comprises allthe weights and biases of the network, is the set of weightsbetween adjacent layers excluding the biases and is theweight decay parameter. The network inputs and the outputsof the hidden units should be roughly comparable before theweight decay penalty in the form given above makes sense. Itmay be necessary to rescale the inputs in order to achieve this.LLR (or more generally local polynomial regression) is

a nonparametric regression method which has its roots inclassical methods proposed for the smoothing of time seriesdata, see [57]. Such estimators have received more attentionrecently, see e.g., [81]. The particular version described belowis also called LOESS [56], [57]. LLR models the regressionfunction in the neighborhood of each point by means ofa linear function . Given training data

, the fit at point is calculated asfollows. First one minimizes

(14)

with respect to the vector and the matrix and then takesas the fit. As the weight function we used the tricube function[56], . The local bandwidth iscontrolled by a neighborhood size parameter : onetakes a number equal to rounded to the nearest integerand then equal to the distance to the th closest neighborof among the vectors . If the components ofare measured in different scales, then it is advisable to

select the metric for the nearest neighbor calculation carefully.However, we simply used the Euclidean metric. At a given, the weighted linear least squares problem (14) can bereduced to inverting a matrix, whereis the dimensionality of , see, e.g., [39, ch. 5].

C. Tree Classifier, MARS, and FDAThe introduction of tree based models in statistics dates

back to [82] although their current popularity is largely dueto the seminal book [63]. For Euclidean pattern vectors

, a classification tree is a binary tree where ateach node the decision to branch either to left or right isbased on a test of the form . The cut-off valuesare chosen to optimize a suitable fitting criterion. The treegrowing algorithm recursively splits the pattern space intohyper rectangles while trying to form maximally pure nodes,that is, subdivision rectangles that ideally contain trainingvectors from one class only. Stopping criteria are used to keepthe trees reasonably sized although the commonly employedstrategy is to first grow a large tree that overfits the data andthen use a separate pruning stage to improve its generalizationperformance. A terminal node is labeled according to the classwith the largest number of training vectors in the associatedhyper rectangle. The tree classifier therefore uses the Bayesrule with the class posterior probabilities estimated by locallyconstant functions. The particular tree classifier used here isavailable as part of the S-Plus statistical software package[83]–[85]. This implementation uses a likelihood function toselect the optimal splits [86]. Pruning was performed by theminimal cost-complexity method. The cost of a subtree istaken to be

size (15)

where is an estimate of the classification error of ,size of is measured by the number of its terminal nodes,and is a cost parameter. An overfitted tree is prunedby giving increasingly large values and selecting nestedsubtrees that minimize .MARS [54] is a regression method that shares features with

tree based modeling. MARS estimates an unknown functionusing an expansion

(16)

where the functions are multivariate splines. The algorithmis a two stage procedure, beginning with a forward stepwisephase which adds basis functions to the model in a deliberateattempt to overfit the data. The second stage of the algorithmis standard linear regression backward subset selection. Themaximum order of variable interactions (products of variables)allowed in the functions , as well as the maximum valueof allowed in the forward stage, are parameters that needto be tuned experimentally. Backward model selection usesthe generalized cross-validation criterion (GCV) introduced in[87].The original MARS algorithm fits only scalar valued func-

tions and is therefore not well-suited to discrimination taskswith more than two classes. A recent proposal called flexiblediscriminant analysis (FDA) [88] with its publicly availableS-Plus implementation in the StatLib program library containsvector valued MARS as one of its ingredients. However, FDAis not limited to just MARS as it allows the use of otherregression techniques as its building blocks as well. In FDAone can first train separate MARS models with equalbasis function sets but different coefficients to map trainingvectors to the corresponding unit vectors (cf. Section II-B). Then a linear map is constructed to map the regression

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 11

function output space onto a lower dimensional featurespace in a manner that optimally facilitates prototypeclassification based on the transformed class meansand a weighted Euclidean distance function.

D. -NN, Learning -NN, and LVQ ClassifiersIn a -NN classifier [4] each class is represented by a set of

prototype vectors. The closest neighbors of a pattern vectorare found from among all the prototypes and the class labelis decided by the majority rule. A possible tie of two or moreclasses is broken by decreasing by one and revoting.Recently, two of the authors have introduced a set of

adaptation rules that can be used in iterative training ofa -NN classifier [89]. The learning rules of the proposedL- -NN resemble those of LVQ but at the same time theclassifier utilizes the improved classification accuracy providedby majority voting. The performance of the standard -NNclassifier depends on the quality and size of the training setand the performance of the classifier decreases if the availablecomputing resources limit the number of training vectors onecan use. In such a case the L- -NN rule is better able toutilize the available data by using the whole size trainingset to optimize the smaller set of prototype vectors.The LVQ algorithm [64] produces a set of prototype or

codebook pattern vectors that can be used in a 1-NN classifier.Training consists of moving a fixed number of codebookvectors iteratively toward or away from the training samples. The variations of the LVQ algorithm differ in the way

the codebook vectors are updated. The LVQ learning processcan be interpreted either as an iterative movement of thedecision boundaries between neighboring classes, or as a wayto generate a set of codebook vectors whose density reflectsthe shape of the function defined as

(17)

where . Note that the zero set of consists ofthe Bayes optimal decision boundaries.

E. CLAFIC and ALSMThe motivation for the subspace classifiers originates from

compression and optimal reconstruction of multidimensionaldata. The use of linear subspaces as class models is based onthe assumption that the data within each class approximatelylies on a lower-dimensional subspace of the pattern space. A vector from an unknown class can then be classified

according to its shortest distance from the class subspaces.The sample mean of the whole training set is first

subtracted from the pattern vectors. For each class the cor-relation matrix is estimated and its first few eigenvectors

are used as columns of a basis matrix . Theclassification rule of the CLAFIC [90] algorithm can then beexpressed as

(18)

The ALSM introduced by one of the authors [65] is an iterativelearning version of CLAFIC in which the unnormalized sample

class correlation matrices are slightlymodified according to the correctness of the classifications

(19)

Here is the training sample from class , andare small positive constants, is the set of indexes for

which comes from class but is classified erroneously to adifferent class, and consists of those indexes for which

is classified to although it actually originates from adifferent class. The basis matrices are recalculated aftereach training epoch from the dominant eigenvectors of themodified . In our experiments we chose equal numbers

of basis vectors both in CLAFICand in ALSM, and for ALSM, was chosen to be equalto .

IV. CASE STUDY 1CLASSIFICATION OF HANDWRITTEN DIGITS

A. Data, Preprocessing, and Feature ExtractionThe first case involved handwritten digit data specifically

assembled for the present study. To obtain the test material,a set of forms was filled out by randomly chosen Finnishpeople. The total number of forms was 894 and each containedtwo handwritten examples of each of the ten digits. Theimages were scanned in binary form with the resolution of300 pixels per inch in both directions. In this resolutioneach digit was printed approximately in a 75 100 pixelrectangle. The bounding boxes of the digits were normalizedto 32 32 pixels and the slant of the images was removed.No normalization of the line width was performed. Thepreprocessing procedure closely resembles that of [20], [91],and it produced a sample of 17 880 binary vectors, each 1024-dimensional. Some examples are displayed in the leftmostcolumn of Fig. 2. The sample was then divided into twosets of equal size, one for training and one for testing theclassifiers.As pattern vectors we used Karhunen–Loeve (KL) trans-

forms [6] of the original preprocessed images derived fromthe estimated covariance matrix of the training set. Thisapproach is similar to that adopted in [20] and [91]. Ingeneral, the magnitude of the covariance matrix eigenvaluescan be used to measure the usefulness of the corresponding KLtransform components in classification but, for the digit data,the eigenvalues decreased rather steadily in magnitude andit was not possible to decide on any natural reduced patternvector dimension. On the other hand, higher dimensions implymore computations both in classifier design and application.In our tests we used a maximum pattern dimension of 64.Cross-validation (see below) can then be used to select the bestdimension for each classifier separately. The columnsof Fig. 2 show how the normalized images can be reproducedfrom their KL transforms with progressively greater accuracyas the transform dimension is increased.

12 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

Fig. 2. In the leftmost column, some normalized 32 32 digit images. Inthe second column, mean of the training set and in the remaining columns,reproductions by adding projections in the principal directions up to dimension1, 2, 5, 16, 32, and 64.

B. Cross-Validation in Classifier DesignThe whole process of classifier design, including parameter

estimation, the choice of various tuning parameters, and modelor architecture selection, should be strictly based on thetraining sample only. The separate testing sample can beemployed to obtain an unbiased estimate of the performanceof the classifier thus designed. To utilize the available trainingsample efficiently, cross-validation [92] (or “rotation,” cf. [4,Section 10.6.4]) can be used. In -fold cross-validation thetraining sample is first divided into disjoint subsets. Onesubset at a time is then put aside, a classifier is designedbased on the union of the remaining subsets, and thentested for the subset left out. Cross-validation approximatesthe design of a classifier using all the training data and thentesting it on an independent set of data. This enables defining areasonable object function to be optimized in classifier design.For example, for a fixed classifier, the dimension of the patternvector can be selected so that it minimizes the cross-validatederror count.For some methods such as KDA, -fold cross-validation

where is the size of the training set could readily be used,whereas for methods such as MLP this would be impractical,as the design of the classifier involves a rather lengthy trainingstage. Besides, for our handwritten digit data such leave-one-out cross-validation seemed to deliver optimistically biasederror rates which probably is due to the fact that in the trainingsample there were several exemplars of each digit from eachwriter. As a reasonable compromise, ten-fold cross-validation

was used for most methods, with the exception of RKDA andMLP where two-fold and four-fold cross-validation was used,respectively.Choosing an optimal pattern vector dimension is part of

architecture or model selection and for most classifiers thiswas done using cross-validation. Cross-validation was alsoemployed to determine the regularization parameters of RDA,the smoothing parameters of KDA and LLR, the number ofreference vectors in RKDA, the size of the hidden layer ofthe MLP classifier, the weight decay parameter of the MLPWD classifier, the codebook sizes for the L- -NN method

and LVQ, the subspace dimension for CLAFIC and ALSM,as well as the learning coefficients of ALSM. The numbersof training epochs of L- -NN, LVQ, and ALSM were alsodetermined by cross-validation. The tree classifier can usecross-validation to evaluate the error of (15) whenpruning the overfitted model. Depending on the computationalcosts associated with the various methods, parameter searchgrids of different coarseness were used. However, as explainedin Section III-A, in KDA2 we used multivariate optimizationinstead of a grid search.During cross-validation of the MLP classifier, each opti-

mization was started from two random initial points with thevalues of the weights drawn independently from the normaldistribution scaled such that the net inputs to the logisticsigmoids were of the order unity. Thus eight optimizationruns were needed for each -pair tried. In MLP trainingusing weight decay (MLP WD), we kept the patternvector dimension and hidden layer size at the valuesselected for MLP and chose the weight decay parameter withcross-validation. The training strategy in MLP WD wasotherwise similar to that used in plain MLP training, butprior training the network inputs were linearly rescaledso that each component of the input vector had meanzero and unit variance. The rescaling transformation wasdetermined separately for each cross-validation training set andthat rescaling was then applied to the cross-validation testingset.Fig. 3 displays an example of architecture selection using

cross-validation, in this case for the CLAFIC subspace clas-sifier. The subspace dimension is selected according tothe smallest cross-validated (CV) error giving the optimum

. Note how the cross-validated error follows closelythe testing error while the training error is optimisticallybiased and also suggests a somewhat higher optimal subspacedimension.

C. Performance Comparisons and DiscussionTable I shows the testing set classification errors for all the

classifiers considered. The optimal cross-validated parametersare also shown for reference. In the range of 2.5–4.0%, wherethe results for our best classifiers lie, the standard deviationof the error rate as computed from the binomial distribution isabout 0.2% points, cf. [4, ch. 10.3]. The RKDA, MLP, MLPWD, L- -NN, and LVQ algorithms need initialization that

makes their testing set performance depend either on a set ofrandomly generated real numbers or the particular subset of

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 13

Fig. 3. Choosing the CLAFIC subspace dimension for the handwritten digitdata using cross-validation. Shown are the cross-validated error, the trainingset error, and the testing set error. The optimal dimension 29 is marked witha star.

training data used for initialization. To account for this, eachof these methods was evaluated by computing the averageperformance and its standard deviation in ten independenttrials. During testing, the MLP optimizations were started fromthree random initial points with the values for the weightsdrawn randomly similarly as during cross validation. The MLPhaving the least sum of squared errors on the training set wasthen tested on the testing set and this procedure was repeatedten times to obtain the results. The MLP WD classifierwas tested similarly with the difference that the best of thethree runs was determined on the basis of the weight decaycriterion (13) and that the inputs were rescaled so that allthe components of the training set had mean zero and unitvariance. The transformation determined based on the trainingset was also used to rescale the testing set.In the following, we first comment on the performance of

each classifier and classifier group in turn, and then make someoverall conclusions. Of the normality based classifiers, LDAwas clearly inferior to QDA at each tested dimension. RDAselected at each dimension the parameter as zero but, atthe selected optimal dimension, the parameter controllingthe shrinkage toward a multiple of the identity matrix wasselected nonzero so that shrinkage did happen, and this turnedout to be beneficial.Of the two KDA methods, KDA2 seems to perform slightly

better than KDA1. However, a kernel classifier based on asample of this large size is slow to evaluate and might be tooslow to use in practice. The classification error of RKDA ishigher than that of KDA but as a tradeoff there is a dramaticdrop in classifier complexity: where KDA uses 894 kernels perclass, RKDA needs only 35. At the expense of increasing thecomplexity and the required training time, the discriminationperformance of RKDA could be further improved by usingmore kernels.The results obtained with the MLP classifier are not partic-

ularly good, which is probably due to overfitting. Introductionof the weight decay penalty improved the results greatly.

TABLE ICOMPARISON OF CLASSIFICATION PERFORMANCE FOR HANDWRITTENDIGIT DATA. SHOWN ARE TESTING SET CLASSIFICATION ERRORSAND, IN PARENTHESES, ESTIMATED STANDARD DEVIATION IN TEN

INDEPENDENT TRIALS FOR CERTAIN CLASSIFIERS. WHERE APPLICABLE,CROSS-VALIDATED PARAMETERS ARE GIVEN. THOSE PARAMETERS GIVENWITHIN SQUARE BRACKETS WERE DETERMINED WITHOUT DIRECT

CROSS-VALIDATION, E.G., TAKEN FROM THE CLASSIFIER ON THE PREVIOUS LINE

Excluding the committee classifier, the best results wereobtained with LLR. Unfortunately, this method is impracti-cally slow since the weighted linear least squares problemhas to be solved separately at each pattern to be classified.However, other classifiers operating on roughly the same prin-ciples might achieve good performance as well. A particularlypromising candidate in this respect would be the weightedlinear predictor introduced in [93].The performance of the tree classifier in this first case

study was disappointing. The natural approach is to use full64-component pattern vectors and to let the algorithm itselfdecide how many variables (pattern components) an optimaltree needs. Using the default tree growing stopping parametervalues led to a classifier with 54 terminal nodes that actuallyutilized 15 out of the 64 possible variables. The testing errorwas 23.6%. The complexity cost (15) decreased monotonicallywith increasing tree size and it was not possible to find anatural optimal pruning size. On the basis of training error, thebest pattern vector dimension appeared to be 16. The defaultparameters were adjusted to allow for larger trees but due tothe heavy memory requirements it became quickly impossibleto use cross-validated pruning. The best results were obtainedby growing huge trees with no training error and using themas such. The tree grown with 16-dimensional data had 849terminal nodes and testing set error 16.8% which is the resultreported in Table I.The natural approach for FDA/MARS is also to use the full

64-component pattern vectors and to let the algorithm pickthe optimal model architecture. The default component models

are of additive type where products of variables are notallowed in the expansion (16). The algorithm selected modelswith 116 terms and the testing error was 8.5%. Experimentingwith different designs suggested that the training error could bereduced by allowing higher order interactions and more terms

14 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

Fig. 4. Rejection curve for LLR with the handwritten digit data. The value ofthe threshold parameter is given at selected points. For comparison, the starsindicate the results obtained with the committee classifier. The leftmost staris with no rejection; for the next star, the pattern is rejected when all threemembers vote differently, and for the rightmost star the pattern is rejectedunless the committee is unanimous.

in the models. The required computation times and memoryconstraints however forced us to limit experimentation to 32-component pattern vectors and second-order interactions. Thebest computationally feasible classifier used 195-term MARSexpansions, three times the default maximum allowed modelsize, and the testing error was 6.3%.LVQ used pattern vector dimension which had

produced the best results with the three-NN classifier. Acodebook of size was initialized using pattern vectorsfrom each class. The classifier was trained using ten epochs ofLVQ1 and a varying number of LVQ2 epochs (for definitionsof LVQ variants, see [64]). The learning coefficient waslinearly decreased from 0.2 to zero and LVQ2 used relativewindow width of 0.5. Cross-validation suggested an optimalcodebook size and the use of one LVQ2 trainingepoch.Of the two subspace classifiers, the performance of the

ALSM turned out to be very good. ALSM training was inmost cases terminated when the training set error decreasedto zero. Although the classification error for the independenttesting sample also decreased in this process, it still probablyremains above the level achievable with a greater amount oftraining data.We next tested classification with the reject option. It is well

known that if exceptional handling is possible, using the rejectoption can improve the performance of a classifier. Fig. 4illustrates rejection using (3). We show the classification errorpercentage as the function of the size of the reject class forthe LLR method, which was chosen because it had the besterror rate of all the single classifiers tested. The results areshown for the testing sample only; the cross-validation curveresembles closely the curve shown here.Combining several different classifiers in a committee can

provide an improvement over the use of a single classifier[94]. As a final experiment we formed a committee from theLLR, ALSM, and L-3-NN classifiers, all of which performed

individually very well. A majority voting rule was used and,in situations where all three classifiers disagreed, the decisionof the L-3-NN classifier was enforced. Using this votingstrategy the testing set classification error for the committeewas 2.5%. Marking the input vector as rejected if one or twoof the classifiers disagree with the committee decision givesthe two additional points in the reject-error plane shown inFig. 4. Letting either the LLR or the ALSM classifier breakthe ties would in fact result in better performance but, inaccordance with the single classifier design principles, wewanted to avoid using testing set performance to select anoptimal voting strategy. Note however that this ideal is stillpartially compromised as the three committee members wereselected on the basis of their good testing set performance.To summarize the results for single classifiers, the LLR

method was the best classifier from the point of view of clas-sification accuracy, with the ALSM following close behind.However, the amount of computation required for these twoclassifiers is quite different: the computational demands of theLLR classifier would make the method impossible to use ina practical application. As for the ALSM, the classificationrule is based on a number of inner products and it is thereforerelatively fast and simple. Furthermore, within the statisticalerror, several of the other classifiers have similar performance.The QDA and RDA methods which are based on the normalityassumption did surprisingly well in our study. This is probablydue to the fact that, for our data, the KL transformed vectorsappear to be approximately normally distributed. The L-3-NN method and ALSM were better in our study than theirnonadaptive counterparts, 3-NN and CLAFIC, respectively.Although L- -NN and ALSM (or the basic learning subspacemethod) are not directly neural-network classifiers, they havebeen strongly motivated by neural algorithms and seem to beable to improve the classification performance by decision-directed learning using misclassified sample vectors.

V. CASE STUDY 2PHONEME CLASSIFICATION

The phoneme classification data of the second case studywere first studied in [95]. This data set has also recentlybeen used in an extensive comparison study of neural andstatistical classifiers [19]. There are two classes that correspondto nasal (class 1) and oral (class 2) French vowels. We choseto consider this example as its data characteristics are verydifferent from those of the first case study. In addition tohaving just two classes, the dimension of the data is onlyfive and the class distributions are highly nonnormal. The firsthalf of the “phoneme CR” database consisting of 3818 class1 patterns and 1586 class 2 patterns was used as a training setand the second half was used as an independent testing set.The performance of the tested classifiers is shown in

Table II. The approximate standard deviation of the error ratein the range 11–24% is 0.6–0.8%. Cross-validation similarto that employed in the first case study was used to obtainthe various classifier parameters. The major difference wasthat the pattern vector dimension was fixed at five. Also,it was now computationally feasible to use ten-fold cross-validation with RKDA. No input rescaling was used in MLP

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 15

TABLE IICOMPARISON OF CLASSIFICATION PERFORMANCEFOR PHONEME DATA. NOTATION AS IN TABLE I

WD since the database has already been processed so thateach component of the pattern vector has mean zero andunit variance. Further, ten-fold cross-validation was used todetermine the interaction order and maximum number ofmodel terms in the forward stage of MARS.The default tree classifier performed now quite well sug-

gesting that the poor performance in the first case study wasperhaps largely caused by the particular software and hardwareused. The performance was slightly better for a tree with notraining error and this is the result given in Table II. Again,the complexity cost (15) decreased monotonically and therewas no apparent way to prune the trees.Exploratory data analysis using low-dimensional projec-

tions shows that the phoneme data have a very rich internalstructure. The class distributions contain clusters and partsof the data appear to lie on low-dimensional manifolds. Itseems quite obvious that classifiers with relatively simpledecision boundaries cannot then perform well and, in fact,the classifiers with linear or quadratic discriminant functions(LDA, QDA, RDA, CLAFIC, and ALSM) clearly fall behindthe more flexible methods. For best performance, a classifierno less complex than the training set itself appears to beneeded. Such classifiers (KDA1, KDA2, LLR, 1-NN, and3-NN) are able to separate the classes effectively but theirdemands on computing resources (storage capacity and/orspeed) may make their use difficult in practice. An alternativeis provided by the more neural algorithms (RKDA, MLP, MLPWD, tree classifier, FDA/MARS, L-3-NN, and LVQ) that

offer a tradeoff between complexity and classification error.Note also that the statistical error in the test results is quitelarge making it impossible to rank the best classifiers withconfidence.

VI. SUMMARYNeural and statistical classifiers were discussed both from

a theoretical point of view and in the context of two casestudies involving realistic pattern recognition problems. Thestarting point of neural-network algorithms is quite differentfrom that of statistical classifiers. Neural networks are on-line

learning systems, intrinsically nonparametric and model-free,with the learning algorithms typically of the error correctiontype. For some neural learning algorithms, not even an explicitcost function exists. Thus, the computational aspect is central:at least in principle, the computations should be on-line, simpleand local, and avoid heavy numerical operations that use afixed training set in batch mode. Most statistical classifiers,on the other hand, can be interpreted to follow the Bayesianclassification principle, either explicitly estimating the classdensities and a priori probabilities, or the optimal discriminantfunctions directly by regression. Computational limits are setby current conventional hardware.Despite these theoretical differences, which may be im-

portant in the long range goals of artificial neural systems,in practice it is possible to view also present day neuralclassifiers in a statistical framework. Then the power of“neural” algorithms is increased flexibility of architecture inthe sense of the richness of the discriminant function familyand the possibility for incremental learning.As examples we considered handwritten digit recognition

with standard preprocessing and KL feature extraction, andphoneme classification. The performance of a number ofmost typical neural classifiers, including the MLP and LVQ,as well as a set of various types of statistical classifierswas estimated in these two cases. The statistical classifiersincluded LDA, QDA, RDA, KDA, a tree classifier, andFDA with MARS, the -NN classifier, and CLAFIC. Fourmethods of our own were used as well: RKDA, the L- -NNclassifier, the ALSM, and a modified version of KDA. Further,committee classifiers and classification with rejection wereconsidered. The classifier designs were based on training setcross-validation, and the classification errors were estimatedusing an independent testing set.For the digit data, the LLR method, although computa-

tionally prohibitively heavy, was the best classifier from thepoint of view of classification accuracy, with the ALSMfollowing close behind. For methods having both a learningand a nonlearning version, error correcting learning seemedto give an advantage. In phoneme classification, the datahave a rich internal structure and the kernel and nearest-neighbor methods (KDA, -NN) perform best. The moreneural algorithms provide attractive alternatives by combin-ing good classification performance and less complex de-sign.

ACKNOWLEDGMENTThe authors wish to thank A. Hamalainen for helping to

compute the RKDA results and P. Tikkanen for running theinitial tests for certain classifiers.

REFERENCES

[1] E. Oja, “Self-organizing maps and computer vision,” in Neural Networksfor Perception, H. Wechsler, Ed., vol. I. New York: Academic, ch.II.9, 1992, pp. 368–385.

[2] R. Duda and P. Hart, Pattern Classification and Scene Analysis. NewYork: Wiley, 1973.

[3] T. Y. Young and T. W. Calvert, Classification, Estimation, and PatternRecognition. New York: Elsevier, 1974.

[4] P. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach.Englewood Cliffs, NJ: Prentice-Hall, 1982.

16 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 8, NO. 1, JANUARY 1997

[5] C. Therrien, Decision, Estimation, and Classification: An Introductionto Pattern Recognition and Related Topics. New York: Wiley, 1989.

[6] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed.New York: Academic, 1990.

[7] G. McLachlan, Discriminant Analysis and Statistical Pattern Recogni-tion. New York: Wiley, 1992.

[8] R. J. Schalkoff, Pattern Recognition: Statistical, Structural, and NeuralApproaches. New York: Wiley, 1992.

[9] K. S. Fu, Syntactic Pattern Recognition and Applications. EnglewoodCliffs, NJ: Prentice-Hall, 1982.

[10] C. M. Bishop, Neural Networks for Pattern Recognition. Oxford, U.K.:Oxford Univ. Press, 1995.

[11] B. Ripley, Pattern Recognition and Neural Networks. Cambridge,U.K.: Cambridge Univ. Press, 1996.

[12] L. Devroye, L. Gyorfi, and G. Lugosi, A Probabilistic Theory of PatternRecognition. New York: Springer-Verlag, 1996.

[13] J. Hartigan, Clustering Algorithms. New York: Wiley, 1975.[14] B. Ripley, “Neural networks and related methods for classification,” J.

Roy. Statist. Soc., Series B, vol. 56, no. 3, pp. 409–456, 1994.[15] B. Cheng and D. Titterington, “Neural networks: A review from a

statistical perspective,” Statist. Sci., vol. 9, no. 1, pp. 2–54, 1994.[16] Y. Idan, J.-M. Auger, N. Darbel, M. Sales, R. Chevallier, B. Dorizzi, and

G. Cazuguel, “Comparative study of neural networks and nonparamet-ric statistical methods for off-line handwritten character recognition,”in Proc. Int. Conf. Artificial Neural Networks, I. Aleksander and J.Taylor, Eds. Brighton, U.K.: North-Holland, vol. 2, Sept. 1992, pp.1607–1610.

[17] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel,Y. LeCun, U. A. Muller, E. Sackinger, P. Y. Simard, and V. Vapnik,“Comparison of classifier methods: A case study in handwritten digitrecognition,” in Proc. 12th Int. Conf. Patt. Recognition, vol. II, Oct.1994, pp. 77–82.

[18] D. Michie, D. J. Spiegelhalter, and C. C. Taylor, Eds., Machine Learn-ing, Neural, and Statistical Classification. London: Ellis HorwoodLtd., 1994.

[19] F. Blayo et al., “Deliverable R3-B4-P, task B4: Benchmarks,” Tech.Rep., ELENA, (Enhanced Learning for Evolutive Neural Architecture,)ESPRIT Basic Res. Project 6891, 1995, available ftp://ftp.dice.ucl.ac.-be/pub/neural-nets/ELENA/databases/Benchmarks.ps.Z

[20] J. L. Blue, G. T. Candela, P. J. Grother, R. Chellappa, and C. L. Wilson,“Evaluation of pattern classifiers for fingerprint and OCR applications,”Pattern Recognition, vol. 27, no. 4, pp. 485–501, 1994.

[21] C. Y. Suen, M. Berthold, and S. Mori, “Automatic recognition ofhandprinted characters—The state of the art,” Proc. IEEE, vol. 68, no.4, pp. 469–487, Apr. 1980.

[22] J. Mantas, “An overview of character recognition methodologies,”Pattern Recognition, vol. 19, no. 6, pp. 425–430, 1986.

[23] S. Impedovo, L. Ottaviano, and S. Occhinegro, “Optical characterrecognition—A survey,” Int. J. Pattern Recognition Artificial Intell., vol.5, nos. 1/2, pp. 1–24, 1991.

[24] S. Mori, C. Y. Suen, and K. Yamamoto, “Historical review of OCRresearch and development,” Proc. IEEE, vol. 80, no. 7, pp. 1029–1058,July 1992.

[25] C. Y. Suen, C. Nadal, R. Legault, T. A. Mai, and L. Lam, “Computerrecognition of unconstrained handwritten numerals,” Proc. IEEE, vol.80, no. 7, pp. 1162–1180, July 1992.

[26] C. Y. Suen, R. Legault, C. Nadal, M. Cheriet, and L. Lam, “Build-ing a new generation of handwriting recognition systems,” PatternRecognition Lett., vol. 14, no. 4, pp. 303–315, Apr. 1993.

[27] S. N. Srihari, “Recognition of handwritten and machineprinted text forpostal address interpretation,” Pattern Recognition Lett., vol. 14, no. 4,pp. 291–302, Apr. 1993.

[28] M. Gilloux, “Research into the new generation of character and mailingaddress recognition systems at the French post office research center,”Pattern Recognition Lett., vol. 14, no. 4, pp. 267–276, Apr. 1993.

[29] T. Wakahara, “Toward robust handwritten character recognition,” Pat-tern Recognition Lett., vol. 14, no. 4, pp. 345–354, Apr. 1993.

[30] H. S. Baird, “Recognition technology frontiers,” Pattern RecognitionLett., vol. 14, no. 4, pp. 327–334, Apr. 1993.

[31] K. Fukushima, “Neocognitron: A hierarchical neural network capable ofvisual pattern recognition,” Neural Networks, vol. 1, no. 2, pp. 119–130,1988.

[32] I. Guyon, “Applications of neural networks to character recognition,”Int. J. Pattern Recognition Artificial Intell., vol. 5, nos. 1/2, pp. 353–382,1991.

[33] D.-S. Lee, S. N. Srihari, and R. Gaborski, “Bayesian and neural-networkpattern recognition: A theoretical connection and empirical results withhandwritten characters,” in Artificial Neural Networks and StatisticalPattern Recognition, I. K. Sethi and A. K. Jain, Eds. New York:Elsevier, 1991, pp. 89–108.

[34] R. Duin, “A note on comparing classifiers,” Pattern Recognition Lett.,vol. 17, pp. 529–536, 1996.

[35] L. Holmstrom, S. Sain, and H. Miettinen, “A new multivariate techniquefor top quark search,” Comp. Phys. Comm., vol. 88, pp. 195–210, 1995.

[36] J. Friedman, “Regularized discriminant analysis,” J. Amer. Statist. As-soc., vol. 84, no. 405, pp. 165–175, 1989.

[37] B. W. Silverman, Density Estimation for Statistics and Data Analysis.London: Chapman and Hall, 1986.

[38] D. Scott, Multivariate Density Estimation: Theory, Practice, and Visu-alization. New York: Wiley, 1992.

[39] M. Wand and M. Jones, Kernel Smoothing. London: Chapman andHall, 1995.

[40] E. Fix and J. Hodges, “Discriminatory analysis—Nonparametric dis-crimination: Consistency properties,” Project 21-49-004, USAF Schoolof Aviation Medicine, Randolph Field, TX, Tech. Rep. 4, 1951.

[41] D. Hand, Kernel Discriminant Analysis. Cichester, U.K.: Res. StudiesPress, 1982.

[42] B. Silverman and M. Jones, “E. Fix and J. L. Hodges (1951): An im-portant contribution to nonparametric discriminant analysis and densityestimation—Commentary on Fix and Hodges (1951),” Int. Statist. Rev.,vol. 57, no. 3, pp. 233–247, 1989.

[43] D. F. Specht, “Probabilistic neural networks,” Neural Networks, vol. 3,no. 1, pp. 109–118, 1990.

[44] R. Redner and H. Walker, “Mixture densities, maximum likelihood, andthe EM algorithm,” SIAM Rev., vol. 26, no. 2, Apr. 1984.

[45] H. Traven, “A neural-network approach to statistical pattern classifi-cation by semiparametric estimation of probability density functions,”IEEE Trans. Neural Networks, vol. 2, pp. 366–377, 1991.

[46] C. E. Priebe and D. J. Marchette, “Adaptive mixtures: Recursivenonparametric pattern recognition,” Pattern Recognition, vol. 24, no.12, pp. 1197–1209, 1991.

[47] , “Adaptive mixture density estimation,” Pattern Recognition, vol.26, no. 5, pp. 771–785, 1993.

[48] L. Holmstrom and A. Hamalainen, “The self-organizing reduced kerneldensity estimator,” in Proc. 1993 IEEE Int. Conf. Neural Networks, SanFrancisco, CA, vol. 1, Mar. 28–Apr. 1, 1993, pp. 417–421.

[49] T. Hastie and R. Tibshirani, “Discriminant analysis by Gaussian mix-tures,” J. Roy. Statist. Soc., Series B, vol. 58, no. 1, pp. 155–176, 1996.

[50] S. Haykin, Neural Networks: A Comprehensive Foundation. NewYork: Macmillan, 1994.

[51] J. Friedman and W. Stuetzle, “Projection pursuit regression,” J. Amer.Statist. Assoc., vol. 76, no. 376, pp. 817–823, 1981.

[52] T. Flick, L. Jones, R. Priest, and C. Herman, “Pattern classification usingprojection pursuit,” Pattern Recognition, vol. 23, no. 12, pp. 1367–1376,1990.

[53] T. Hastie and R. Tibshirani, Generalized Additive Models. London:Chapman and Hall, 1990.

[54] J. Friedman, “Multivariate adaptive regression splines,” Ann Statist.,vol. 19, pp. 1–141, 1991.

[55] P. Koistinen and L. Holmstrom, “Kernel regression and backpropagationtraining with noise,” in Advances in Neural Information ProcessingSystems 4, J. Moody, S. Hanson, and R. Lippman, Eds. San Mateo,CA: Morgan Kaufmann, 1992, pp. 1033–1039.

[56] W. Cleveland and S. Devlin, “Locally weighted regression: An approachto regression analysis by local fitting,” J. Amer. Statist. Assoc., vol. 83,pp. 596–610, 1988.

[57] W. Cleveland and C. Loader, “Smoothing by local regression: Principlesand methods,” Tech. Rep., AT&T Bell Laboratories, 1995, availablehttp://netlib.att.com/netlib/att/stat/doc/95.3.ps.Z

[58] D. Specht, “A general regression neural network,” IEEE Trans. NeuralNetworks, vol. 2, pp. 568–576, Nov. 1991.

[59] H. White, “Learning in artificial neural networks: A statistical perspec-tive,” Neural Computa., vol. 1, pp. 425–464, 1989.

[60] M. D. Richard and R. P. Lippman, “Neural network classifiers estimateBayesian a posteriori probabilities,” Neural Computa., vol. 3, no. 4, pp.461–483, 1991.

[61] J. Bridle, “Training stochastic model recognition algorithms as networkscan lead to maximum mutual information estimation of parameters,” inAdvances in Neural Information Processing Systems 2, D. Touretzky,Ed. San Mateo, CA: Morgan Kaufmann, 1990, pp. 211–217.

[62] W. Highleyman, “Linear decision functions with application to patternrecognition,” Proc. IRE, vol. 50, pp. 1501–1514, 1962.

[63] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification andRegression Trees. London: Chapman and Hall, 1984.

[64] T. Kohonen, Self-Organizing Maps. New York: Springer-Verlag, 1995.[65] E. Oja, Subspace Methods of Pattern Recognition. Letchworth, U.K.:

Res. Studies Press, 1983.[66] W. S. McCulloch and W. Pitts, “A logical calculus of the idea imma-

nent in nervous activity,” Bull. Math. Biophys., vol. 5, pp. 115–133,1943.

HOLMSTROM et al.: NEURAL AND STATISTICAL CLASSIFIERS 17

[67] F. Rosenblatt, “The perceptron: A probabilistic model for informationstorage and organization in brain,” Psych. Rev., vol. 65, pp. 386–408,1958.

[68] F. Rosenblatt, Principles of Neurodynamics: Perceptrons and the Theoryof Brain Mechanisms. Washington, D.C.: Spartan, 1961.

[69] T. J. Sejnowski and C. R. Rosenberg, “Parallel networks that learn topronounce English text,” J. Complex Syst., vol. 1, no. 1, pp. 145–168,Feb. 1987.

[70] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W.Hubbard, and L. D. Jackel, “Backpropagation applied to handwrittenzip code recognition,” Neural Computa., vol. 1, pp. 541–551, 1989.

[71] T. Kohonen, “The ‘neural’ phonetic typewriter,” Comput., vol. 21, no.3, pp. 11–22, 1988.

[72] J. Lampinen and E. Oja, “Distortion tolerant pattern recognition basedon self-organizing feature extraction,” IEEE Trans. Neural Networks,vol. 6, pp. 539–547, 1995.

[73] G. Tutz, “An alternative choice of smoothing for kernel-based densityestimates in discrete discriminant analysis,” Biometrika, vol. 73, pp.405–411, 1986.

[74] K. Fukunaga and R. R. Hayes, “The reduced Parzen classifier,” IEEETrans. Pattern Anal. Machine Intell., vol. 11, pp. 423–425, Apr. 1989.

[75] K. Fukunaga and J. M. Mantock, “Nonparametric data reduction,” IEEETrans. Pattern Anal. Machine Intell., vol. PAMI-6, no. 1, pp. 115–118,Jan. 1984.

[76] I. Grabec, “Self-organization of neurons described by the maximum-entropy principle,” Biol. Cybern., vol. 63, pp. 403–409, 1990.

[77] P. Smyth and J. Mellstrom, “Fault diagnosis of antenna pointing systemsusing hybrid neural network and signal processing models,” in Advancesin Neural Information Processing Systems 4, J. Moody, S. Hanson, andR. Lippmann, Eds. San Mateo, CA: Morgan Kaufmann, 1992, pp.667–674.

[78] J. X. Wu and C. Chan, “A three-layer adaptive network for patterndensity estimation and classification” Int. J. Neural Syst., vol. 2, no. 3,pp. 211–220, 1991.

[79] J. MacQueen, “Some methods for classification and analysis of multi-variate observations,” in Proc. 5th Berkeley Symp. Math. Statist. Prob., L.M. LeCam and J. Neyman, Eds. Berkeley, CA: U.C. Berkeley Press,1967, pp. 281–297.

[80] D. Shanno and K. Phua, “Remark on algorithm 500,” ACM Trans. Math.Software, vol. 6, no. 4, pp. 618–622, 1980.

[81] T. Hastie and C. Loader, “Local regression: Automatic kernel carpen-try,” Statist. Sci., vol. 8, pp. 120–143, 1993.

[82] J. Morgan and J. Sonquist, “Problems in the analysis of survey data anda proposal,” J. Amer. Statist. Assoc., vol. 58, pp. 415–434, 1963.

[83] R. Becker, J. Chambers, and A. Wilks, The NEW S Language. NewYork: Chapman and Hall, 1988.

[84] J. Chambers and T. Hastie, Eds., Statistical Models in S. New York:Chapman and Hall, 1992.

[85] W. Venables and B. Ripley, Modern Applied Statistics with S-Plus.New York: Springer-Verlag, 1994.

[86] L. Clark and D. Pregibon, Tree-Based Models. New York: Chapmanand Hall, ch. 9, 1992.

[87] P. Craven and G. Wahba, “Smoothing noisy data with spline functions,”Numerical Math., vol. 31, pp. 317–403, 1979.

[88] T. Hastie, R. Tibshirani, and A. Buja, “Flexible discriminant analysis byoptimal scoring,” J. Amer. Statist. Assoc., vol. 89, pp. 1255–1270, 1994.

[89] J. Laaksonen and E. Oja, “Classification with learning -nearest neigh-bors,” in Proc. Int. Conf. Neural Networks, vol. 3, Washington, D.C.,June 3–6, 1996, pp. 1480–1483.

[90] S. Watanabe, P. F. Lambert, C. A. Kulikowski, J. L. Buxton, and R.Walker, “Evaluation and selection of variables in pattern recognition,”in Computer and Information Sciences II, J. Tou, Ed. New York:Academic, 1967.

[91] M. D. Garris, J. L. Blue, G. T. Candela, D. L. Dimmick, J. Geist, P.J. Grother, S. A. Janet, and C. L. Wilson, “NIST form-based handprintrecognition system,” National Institute of Standards and Technology,Gaithersburg, MD, Tech. Rep., 1994.

[92] C. Stone, “Cross-validatory choice and assessment of statistical predic-tions,” J. Roy. Statist. Soc. B, vol. 36, pp. 111–147, 1974.

[93] K. Stokbro, D. Umberger, and J. Hertz, “Exploiting neurons withlocalized receptive fields to learn chaos,” Complex Syst., vol. 4, pp.603–622, 1990.

[94] M. P. Perrone, “Pulling it all together: Methods for combining neuralnetworks,” in Neural Information Processing Systems 6, J. D. Cowan, G.Tesauro, and J. Alspector, Eds. San Francisco, CA: Morgan Kaufmann,1994, pp. 1188–1189.

[95] P. Alinat, “Periodic progress report 4,” Tech. Rep., ROARS ProjectESPRIT II, no. 5516, Thomson rep. TS. ASM 93/S/EGS/NC/079, Feb.1993.

Lasse Holmstrom (M’88) received the B.S., M.S.,and Ph.L. degrees in mathematics from the Uni-versity of Helsinki, Finland, in 1974, 1975, and1977, respectively. He received the Ph.D. degree inmathematics from Clarkson College of Technology,Potsdam, NY, in 1980.He is the Head of the Division of Mathematical

Methods of Information Technology at the RolfNevanlinna Institute, University of Helsinki. He hasheld research positions at the University of Helsinki,Helsinki University of Technology, Academy of

Finland, and Visiting Assistant Professorships at Clarkson College of Tech-nology, Vassar College, Poughkeepsie, NY, and, most recently, a VisitingProfessorship of Statistics at Rice University, Houston, TX, in 1993. Hiscurrent research interests include neural networks, pattern recognition, aswell as computational and mathematical statistics, in particular nonparametricfunction estimation. His earlier work has also included functional analysis,geometric modeling, and computer aided design.

Petri Koistinen received the Ph.D. degree in in-formation technology from the Helsinki Universityof Technology, Finland, in 1996. He is currently aResearcher in the Rolf Nevanlinna Institute at theUniversity of Helsinki.His research interests include the theory and

applications of neural networks, pattern recognition,and computer vision.

Jorma Laaksonen (S’96) received the M.S. andLic. Tech. degrees in 1990 and 1991, respectively,and is currently finishing his doctoral thesis at theLaboratory of Computer and Information Science,Helsinki University of Technology, Finland.His research interests are in neural networks,

pattern recognition, and image analysis.

Erkki Oja (S’76–M’77–SM’90) received the Dr.Tech. degree (with distinction) from HelsinkiUniversity of Technology, Finland, in 1977.He is Professor of Computer Science at the

Laboratory of Computer and Information Science,Helsinki University of Technology. He is theauthor of a number of journal papers and bookchapters on pattern recognition, computer vision,and neural computing, and the book SubspaceMethods of Pattern Recognition, (Letchworth, U.K.:Res. Studies Press) which has been translated into

Chinese and Japanese. His present research interests include subspace, PCA,and self-organizing networks, and applying artificial neural networks tocomputer vision and signal processing.Dr. Oja has served in the scientific and organization committees of a

number of conferences, most recently ICANN-96, ICNN-96, ICPR-96, andICONIP-96. He is a Member of the Finnish Academy of Sciences, PastChairman of the Finnish Pattern Recognition Society, a Member of theGoverning Board of the International Association of Pattern Recognition(IAPR), an IAPR Fellow, Vice President of the European Neural NetworkSociety (ENNS), and Nember of the Neural Networks Technical Committeeof the IEEE. He is Member of the editorial boards of the journals NeuralComputation, Neural Computing Surveys, Pattern Recognition Letters, andthe IEEE TRANSACTIONS ON NEURAL NETWORKS.