Neural and Wavelet Network Models for Financial Distress Classification

Data Mining and Knowledge Discovery, 11, 35–55, 2005c© 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.

Neural and Wavelet Network Models for FinancialDistress Classification

VICTOR M. BECERRA [email protected] of Cybernetics, University of Reading, Reading RG6 6AY, United Kingdom

ROBERTO K. H. GALVAO [email protected] Tecnologico de Aeronautica, Div. Engenharia Eletronica, Sao Jose dos Campos–SP, 12228-900, Brazil

MAGDA ABOU-SEADA [email protected] University, Business School, London NW4 4BT, United Kingdom

Editor: Geoff Webb

Received September 14, 2004; Revised March 31, 2005; Accepted April 4, 2005

Abstract. This work analyzes the use of linear discriminant models, multi-layer perceptron neural networks andwavelet networks for corporate financial distress prediction. Although simple and easy to interpret, linear modelsrequire statistical assumptions that may be unrealistic. Neural networks are able to discriminate patterns that are notlinearly separable, but the large number of parameters involved in a neural model often causes generalization prob-lems. Wavelet networks are classification models that implement nonlinear discriminant surfaces as the superposi-tion of dilated and translated versions of a single “mother wavelet” function. In this paper, an algorithm is proposedto select dilation and translation parameters that yield a wavelet network classifier with good parsimony characteris-tics. The models are compared in a case study involving failed and continuing British firms in the period 1997–2000.Problems associated with over-parameterized neural networks are illustrated and the Optimal Brain Damage prun-ing technique is employed to obtain a parsimonious neural model. The results, supported by a re-sampling study,show that both neural and wavelet networks may be a valid alternative to classical linear discriminant models.

Keywords: financial distress, neural networks, wavelets, finance, classification

1. Introduction

A company is said to be insolvent or under financial distress if it is unable to pay its debtsas they become due, which is aggravated if the value of the firm’s assets is lower than itsliabilities. In most countries, once a company has become insolvent there are several coursesof action covered by the relevant laws. Not all of these courses of action necessarily meanthe end of a company or its business activity. The primary objective is to recover as muchof the money owed to creditors as possible.

Prediction of corporate financial distress is a subject that has received a great deal ofinterest by many researchers in finance. Distress prediction models are used by a large

Supported by FAPESP under Grant 00/09390-6 and by CNPq under a Research Felloship. This work was performedwhile R. K. H. Galvao was a visiting lecturer at the Cybernetics Department of the University of Reading, UK.

36 BECERRA, GALVAO AND ABOU-SEADA

number of parties, which include lenders, investors, regulatory authorities, governmentofficials, auditors and managers (Foster, 1986).

The development of prediction models for financial distress started in the late 1960’s andcontinues to this day. Arguably the two most influential works were those by Beaver (1966)and Altman (1968). Both presented techniques that have been replicated and improved fordifferent types of firms and in different countries. Beaver identified a single financial ratio(cash flow/total debt) as the best predictor of corporate bankruptcy and used it for univariateclassification, which is a statistical technique based on a single variable. Altman was the firstto apply the multivariate technique known as discriminant analysis to the failure predictionproblem.

The data required to build distress prediction models consist of financial ratios calculatedfrom financial statements. By deflating statistics by size, the use of ratios allows a uniformtreatment of both small and large companies. The models are normally validated using adifferent data set from the one used to build the model.

Although discriminant analysis using linear functions is the method commonly found inthe distress analysis literature, it has several disadvantages. First, the patterns need to belinearly separable. Second, samples are assumed to follow a multivariate normal distribution.Third, it is hypothesized that the groups being classified have identical covariances.

To overcome the limitations of linear discriminant analysis, several studies publishedsince 1990 proposed the use of neural network models for financial distress prediction.Neural networks are nonlinear architectures built through the interconnection of simpleprocessing elements, often called neurons. The strength of the connections defines theinput-output characteristic of the model. Neural network classifiers are able to discriminatepatterns that are not linearly separable and do not require the data to follow any specificprobability distribution.

Neural networks have been found to be better classifiers than discriminant analysis meth-ods in a number of works based on financial data from American firms (Odom and Sharda,1990; Tam and Kiang, 1990, 1992; Coats and Fant, 1993; Wilson and Sharda, 1994). Afew studies have also been published on the application of neural networks for financialdistress prediction of British companies. Alici (1996) reported a classification accuracy of80% with a neural network and 73% with a linear discriminant model, using a data setof 46 failed and 46 continuing firms during the period 1987–92. Tyree and Long (1996)applied probabilistic neural networks for assessing the financial distress of 55 failed and 91continuing British firms. This technique was also found to produce more accurate resultsthan linear discriminant analysis.

However, Altman et al. (1994), who used data from over 1000 continuing and distressedItalian firms concluded, in contrast to the studies listed above, that discriminant analysiscompares favourably with neural networks because of the advantage of making the underly-ing economic and financial model transparent and easy to interpret. Furthermore, Trigueirosand Taffler (1996) emphasized that neural networks have a greater tendency to overfittingthan linear models. This problem, which leads to poor generalization, is due to the largenumber of parameters usually involved in neural models.

Since a good trade-off between parsimony and discriminating capability is difficult toachieve via trial-and-error, a number of constructive algorithms have been proposed to

NEURAL AND WAVELET NETWORK MODELS 37

assist the analyst in defining the neural network structure. Pruning algorithms start with alarge number of neurons and subsequently eliminate those that are identified as redundant(Kun et al., 1990; Hassibi et al., 1993). Growing algorithms, on the other hand, start with asmall network, which receives new neurons until the performance of the model is consideredsatisfactory (Ash, 1989; Setiono and Hui, 1995). A third possibility is the use of evolutionaryalgorithms (Yao, 1999), which aim at optimizing both the structure and the parametervalues of the network by using global search methods inspired in the mechanisms of naturalevolution. It is worth noting that the majority of these constructive methods employ searchalgorithms which are sensitive to the set of initial conditions. A technique that does notdepend on a good choice for the starting point would thus be desirable.

In this context, the wavelet network arises as an interesting alternative. This networkapproximates nonlinear functional mappings as the superposition of dilated and translatedversions of a single function (Zhang and Benveniste, 1992). This function is localized bothin the space and frequency domains, which enables efficient constructive algorithms to beused (Zhang, 1997). Wavelet networks have been successfully applied to the identificationof nonlinear dynamic systems (Cannon and Slotine, 1995; Zhang, 1997). Wavelet structureshave also been employed to extract features from signals and images which are to be discrim-inated by linear or neural network classifiers (Szu et al., 1992; Galvao and Yoneyama, 1999).However, the use of wavelet networks to implement the classifier itself is still incipient.

Given the similarities between radial basis functions (RBF) and wavelet networks, it ispossible to relate the methods for selecting the structure of the former, with the methodproposed in this paper for structuring the latter. The classical approach to locate the RBFneuron centers is to apply clustering techniques, such as k-means clustering (Moody andDarken, 1989) and vector quantization (Kohonen, 1995) to create templates of the input.As an alternative, the input-output clustering technique (Pedrycz, 1998) determines centerlocations based on the input and output deviations. In addition to the clustering methods, theorthogonal forward selection algorithm (Chen et al., 1991; Gomm and Yu, 2000) is anotherfrequently used method for RBF centers selection, which introduces an orthogonal transformto facilitate the center selection procedure. However, it has been found that the selection ofbasis vectors produced by orthogonal forward regression procedure is not the most compactwhen the approximation is performed using a nonorthogonal basis (Sherstinsky and Picard,1996). RBF centers can also be determined using the support vector machine (SVM) method(Scholkopf et al., 1997), which determines the structure of the classifier by minimizing thebounds of training error and generalization error. It is often the case that the centers selectedusing SVM are close to the boundary of the decision surface. In contrast, the centers selectedby clustering techniques are stereotypical patterns of the training samples. Another proposedmethod is to choose RBF centers based on Fisher ratio class separability measure with theobjective of achieving maximum discriminative power (Mao, 2002), which is done usinga multi-step procedure that combines Fisher ratio, an orthogonal transform, and a forwardselection search method.

This paper proposes the use of a wavelet network model for the financial distress clas-sification of British firms in a recent period (1997–2000). A constructive algorithm of thegrowing type is employed in order to obtain a parsimonious classifier. The resulting waveletnetwork compares favorably with a conventional linear discriminant model and also with a


neural network of the multi-layer perceptron type trained with the Optimal Brain Damagepruning algorithm (Kun et al., 1990).

This text is organized as follows. A brief review of linear discriminant analysis is given inSection 2. Section 3 summarizes the fundamentals of multi-layer perceptrons and pruningtechniques. Section 4 describes the wavelet network model and the constructive algorithmproposed. The data set employed and the classification results are detailed in Section 5.Concluding remarks and suggestions for further research are given in Section 6.

2. Linear discriminant analysis

The most widely used discriminant analysis method is the one developed by Fisher in 1936,which attempts to maximize the ratio of between-groups and within-groups variances.

Using vector-matrix notation, the following discriminant function can be derived for thecase of binary (two groups) classification (Morrison, 1990):

Z (x) = (µ1 − µ2)T S−1x (1)

where Z : Rd → R is a discriminant function, x = [x1 x2 . . . xd ]T is a vector of d classifi-

cation variables, µ1 ∈ Rd and µ2 ∈ R

d are the sample mean vectors of each group, and Sis the common sample covariance matrix with dimensions d × d.

Note that equation (1) can also be written as:

Z = w1x1 + w2x2 + · · · + wd xd = wT x (2)

where w = [w1 w2 . . . wd ]T is a vector of coefficients:

w = S−1(µ1 − µ2) (3)

The cut-off value for classification can be calculated as the midpoint of the mean scoresof the two samples:

zc = 1

2(µ1 − µ2)T S−1(µ1 + µ2) (4)

A given vector x should assigned to population 1 if Z (x) > zc, and to population 2otherwise.

3. The multi-layer perceptron neural network

multi-layer perceptrons are distributed, adaptive and generally nonlinear learning machinesbuilt from the interconnection of simple processing elements, usually called neurons. Thenumber of elements and their interconnection pattern define the network structure. Thesignals flowing on the connections are scaled by adjustable parameters called synapticweights. Each processing element has several inputs and a single output, which is obtainedas a nonlinear static function of the sum of the weighted inputs.


Multi-layer perceptrons can implement arbitrary nonlinear discriminant functions. Thisis done by adjusting the synaptic weights in a process called training. Training data consistof input and output values that the network is required to learn.

3.1. Mathematics of the multi-layer perceptron

Define the net input to a processing element j for the kth sample as

v j (k) =d∑

i=1

w j i xi (k) + b j (5)

where xi (k), i = 1, . . . , d are the values of the input variables for sample k, w j i is theweight of the connection between input variable i and processing element j and b j is a bias.

The output of the processing element is the result of passing the scalar value v j (k) throughits activation function ϕ j (·):

y j (k) = ϕ j (v j (k)) (6)

The actual shape of the activation function ϕ j varies between applications, but a typicalchoice for classification problems is the logistic sigmoid function, which has a range ofvalues between zero and one:

ϕ j (v j (k)) = 1

1 + e−v j (k) (7)

Figure 1 shows a typical multi-layer perceptron, with three input signals in the inputlayer, one hidden layer with two neurons and a single output signal.

For the multi-layer perceptron shown in figure 1 it is possible to write the output variableas follows:

y = ϕ2(W2ϕ1(W1x + b1) + b2) (8)

Figure 1. A typical multi-layer perceptron.


where ϕ1 is the activation function of the hidden layer, which is assumed to apply element-wise to its vector argument, b2 is the scalar bias in the output layer, W1 and W2 are weightmatrices and b1 is a bias vector.

The most widely used training algorithm to determine the weights of a multi-layer percep-tron is known as the backpropagation algorithm (Haykin, 1999). This algorithm employs astochastic gradient search to minimize the mean-square-error (MSE) cost function definedas

MSE = 1

M

M∑

k=1

[y(k) − d(k)]2 (9)

where d(k) is the desired output for the kth sample, y(k) is the actual network output andM is the total number of training samples. Other optimization algorithms can be used, suchas the Levenberg–Marquardt method (Gill et al., 1981), which has a faster convergence, atthe expense of more memory usage.

Training usually starts from a random set of weights and proceeds until a specified valueof the mean-square-error is met or a maximum number of iterations is reached. In mostcases, the cost function displays many local minima and the training result depends on theinitial weight values.

3.2. Pruning techniques

The simplest pruning method consists of training the network until a local minimum of thecost function is reached and then eliminating the weights with smallest magnitudes. Theproblem with this approach is that, due to the nonlinearity of the structure, the removal ofsmall weights may actually result in a large increase in the output error.

The Optimal Brain Surgeon (OBS) (Hassibi et al., 1993) and the Optimal Brain Damage(OBD) (Kun et al., 1990) techniques employ second-derivative information (the Hessianof the cost function with respect to the network parameters) to identify synaptic weightswhose elimination will result in the smallest increase in the output error. They differ in thefact that OBS employs a recursive algorithm to approximate the Hessian matrix, whereasOBD assumes a diagonal Hessian matrix for computational simplicity.

4. Wavelet networks

A wavelet network can be regarded as a neural architecture with activation functionswhich are dilated and translated versions of a single function ψ : R

d → R, where dis the input dimension (Zhang and Benveniste, 1992; Zhang, 1997). This function, called“mother wavelet”, is localized both in the space (x) and frequency (ω) domains in thesense that |ψ(x)| and |ψ(ω)| rapidly decay to zero when ‖x‖ → ∞ and ‖ω‖ → ∞respectively.


As an example, consider the so-called “Mexican Hat” mother wavelet defined as

ψ(x) = (1 − ‖x‖2) exp

(−‖x‖2

2

)(10)

which is depicted in figure 2 for a two-dimensional input. For ease of visualization, a cross-section of this wavelet is presented in figure 3. This type of multidimensional wavelet iscalled “radial”, because it depends solely on the norm of the input vector.

A wavelet network model with L elements can be parameterized as

y(x) =L∑

j=1

w jψ j (x) (11)

Figure 2. Mexican Hat wavelet with two-dimensional input.

Figure 3. Cross-section of the Mexican Hat wavelet.


where basis functions ψ j (x), called “wavelets”, are dilated and translated versions of ψ(x):

ψ j (x) = a−d/2j ψ

(x − b j

a j

)(12)

The dilation parameter a j ∈ R∗ controls the spread of the wavelet, while the translation

parameter b j ∈ Rd determines its central position. It can be shown (Daubechies, 1992)

that, if pairs (a j , b j ) are taken from the grid

{(αm, nβαm); m ∈ Z, n ∈ Zd} (13)

for convenient values of α > 1 and β > 0, then any function y(x) in L2(Rd ) can beapproximated by Eq. (11) to an arbitrary precision, given a sufficiently large number ofwavelets. Therefore, wavelet networks, like multi-layer perceptrons, can be used to buildnonlinear discriminant functions.

Arguably the main advantage of wavelet networks over other neural architectures is theavailability of efficient constructive algorithms (Zhang, 1997) for defining the networkstructure, that is, for choosing convenient values for (m, n). After the structure has beendetermined, weights w j can be obtained through linear discriminant analysis.

4.1. Defining the structure of the wavelet network

In this work, a modified version of the constructive method introduced by Zhang (1997) isproposed:

1. Normalize the modelling data to fit within the effective support S of the mother waveletemployed. For radial wavelets, S is a hypersphere in R

d with radius R. For the MexicanHat, for instance, R can be taken as 5 (see figure 3). For computational simplicity, Sis approximated as a hypercube inscribed in the hypersphere with edges parallel to thecoordinate axis.

2. For each sample xk in the modelling set, find Ik , the index set of wavelets whose effectivesupports contain xk :

Ik = {(m, n) s.t. xk ∈ Sm,n, mmin ≤ m ≤ mmax}, k = 1, . . . , M (14)

where Sm,n is a hypercube centered in nβαm with edges αm R√

2. As a rule of thumb, setthe minimum and maximum scale levels to mmin = −1 and mmax = +1, respectively. Inthe present application, these settings proved to be adequate but, in the general case, morescale levels may be added until the performance of the wavelet network is consideredsatisfactory.

3. Determine the pairs (m, n) which appear in at least two sets Ik 1 and Ik 2, k1 �= k2. Theseare the wavelets whose effective supports include at least two samples. This step is


different from the algorithm described by Zhang (1997), which allows for wavelets witheffective supports containing only one sample. Since such wavelets introduce oscillationsbetween neighboring data points, they should be excluded from the modelling processto avoid overfitting problems.

4. Let L be the number of wavelets obtained above. For simplicity of notation, replace thedouble index (m, n) by a single index j = 1, . . . , L .

5. Apply the L wavelets to the M modelling samples and gather the results in matrix formas

Ψ =

ψ1(x1) ψ1(x2) · · · ψ1(xM )

ψ2(x1) ψ2(x2) · · · ψ2(xM )...

... · · · ...

ψL (x1) ψL (x2) · · · ψL (xM )

(15)

Notice that each sample is now represented by L wavelet outputs (a column of Ψ), insteadof d variables. Since the mapping x → ψ(x) is a nonlinear transformation, patterns whichwere not linearly separable in the x-variable domain may be so in the domain of waveletoutputs. However, many wavelets resulting from steps 1-4 may be redundant or may notconvey useful discriminating information. Thus, the next step consists in determining whichwavelets or, alternatively, which rows of Ψ are the most relevant for the classification task.

At this point, the algorithm of Zhang (1997), which was proposed in an estimationframework, makes use of the information available in the dependent variable (the variableto be estimated). That algorithm, henceforth termed “classical”, starts with the single waveletregressor that displays the best correlation with the dependent variable. At each subsequentiteration, the remaining wavelet regressors are orthogonalized with respect to those alreadyselected in a Gram-Schmidt manner (Bjorck, 1994). After the orthogonalization step, thenetwork is augmented with the wavelet regressor that is best correlated with the dependentvariable.

In the present case, the selection process must be guided by the binary class label availablefor each modelling sample. For this purpose, the Fisher Discriminant concept is employed,as defined below.

Definition 1 (Fisher Discriminant). Let ψ be a vector with elements associated to twoclasses (1 and 2). The Fisher Discriminant of ψ is defined as

F(ψ) = [µ1(ψ) − µ2(ψ)]2

[σ1(ψ)]2 + [σ2(ψ)]2(16)

where µ1(ψ) and σ1(ψ) are respectively the mean and standard deviation of the elements ofψ associated to class 1. In the same manner, µ2(ψ) and σ2(ψ) are defined for the elementsassociated to class 2.


The Fisher Discriminant is a measure of how well the two classes are discriminated in ψ.In the context of classification problems, such an index plays a role similar to the correlationcoefficient in least-squares regression (Taylor and Cristianini, 2004). Note that it cannot bedirectly used to assess the joint discriminating power of two or more vectors. However, ifthese vectors are used to build a single-output classification model, the Fisher Discriminantcan be applied to the model output.

Apart from the discriminating power measured by the Fisher Discriminant, the problemof redundancy between the selected wavelet terms is also taken into account. In order toaddress such an issue, the following Condition Number definition is employed.

Definition 2 (Condition Number associated to a set of vectors). Let A = {ψ j , j =1, . . . , J } be a set of M-dimensional row vectors (M > J ). The condition number associatedwith A is defined as the condition number (ratio between the maximum and the minimumsingular values) of a matrix ∆ built as

∆ =

ψ1

ψ2

...

ψ J

(17)

The condition number can be used as a measure of the collinearity (or “redundancy”)between the vectors in A (Lawson and Hanson, 1974). In fact, if the vectors are linearly de-pendent, the condition number is infinite and, if they form an orthonormal set, the conditionnumber equals one.

The algorithm proposed here selects rows from Ψ in a stepwise manner, starting from theone which displays the largest Fisher Discriminant and adding a new row at each iteration.

(a) (Initialization) Let A be the set of row vectors still available for selection and B theset in which the selected vectors are stored. Initially, A = {

ψ j , j = 1, . . . , L}, where

ψ j = [ψ j (x1) ψ j (x2) . . . ψ j (xM )], and B = ∅.(b) (Preliminary pruning) Eliminate from A all vectors whose norm is lower than κ max j

(‖ψ j‖) for a fixed 0 < κ < 1.(c) (First selection) Determine the vector in A which displays the largest Fisher discrimi-

nant. Move this vector from A to B.(d) (Collinearity prevention) Remove from A all vectors which, if added to B, result in a

condition number larger than a fixed threshold χ > 0. If all vectors in A are eliminatedthen stop.

(e) (Selection) For each of the remaining vectors in A, obtain a linear discriminant model byusing this vector in conjunction with the vectors in B. Evaluate the Fisher discriminantof the model output. Determine the vector which leads to the largest Fisher discriminantand move it from A to B.

(f) Return to step (d).


Notice that redundancy is avoided in step (d), whereas step (e) selects the vector thatdisplays the better synergy with those already selected.

5. Case study based on data from British firms

Financial data from 29 failed and 31 continuing British corporations were used in this study.The data set covers the period between 1997 and 2000. The variables employed are financialratios commonly found in the literature: x1 (working capital/total assets), x2 (accumulatedretained profit/total assets), x3 (profit before interest and tax/total assets), x4 (book valueof equity/book value of total liabilities), x5 (sales/total assets). Here the only differencefrom Altman’s choice (Altman, 1968) is the use of the book value of equity, rather than themarket value of equity, to calculate x4. The full data tables employed in this example aregiven by Galvao et al. (2004).

The data set was divided into a modelling set (21 failed and 21 continuing firms) anda validation set (8 failed and 10 continuing firms). The validation set is used to assess thepredictive power of the models, that is, their ability to classify companies which were notused for model development. For this reason, it is assumed that the validation data arenot to be employed at any stage of the modelling process. The same modelling/validationpartitioning of the data set was employed with all classification techniques.

5.1. Linear discriminant analysis model

Using the five ratios given above and Eq. (1), the Z -score model is given by:

Z = 3.3057x1 + 0.9270x2 − 2.0891x3 + 0.4678x4 + 0.1431x5 (18)

The cut-off value for this model was calculated from Eq. (4) as zc = 1.5087.By inspecting Eq. (18), it is possible to note that the third coefficient does not have a

logical value, as it is negative. This coefficient is associated with the third ratio x3, whichrelates profit before interest and tax with total assets, and it should be positive as the higherthe value of this ratio (higher profits), the less likely it is that the firm is in financial trouble(all other factors kept constant). The source of this problem can be understood by calculatingthe coefficient of multiple correlation (Ezekiel and Fox, 1959) of each ratio with respectto the other four. For ratio xi , this coefficient is defined as σ (xi )/σ (xi ), where σ (·) standsfor the sample standard deviation and xi is the least-squares estimate of xi obtained fromthe other ratios. The values obtained for x1, x2, x3, x4, x5 were 0.96, 0.75, 0.97, 0.56, 0.41,respectively. As can be seen, x3 is the ratio that can be better predicted from the other four.Thus, it is the most likely source of collinearity problems, which are known to deterioratethe prediction ability of linear discriminant models (Naes and Mevik, 2001).

In order to prevent the illogical value from affecting the accuracy of the discriminantanalysis model, ratio x3 was eliminated from the modelling data and a new model wascalculated with four inputs (x1, x2, x4 and x5), rather than five. The resulting Z-score model


Table 1. Results, Z–score model using four ratios. Error types:I (failed company classified as continuing) or II (continuing com-pany classified as failed).

Data Type I Type II Total Percentset errors errors errors accuracy

Modelling 2 7 9 79

Validation 0 4 4 78

is as follows:

Z = 0.2173x1 + 0.3788x2 + 0.4666x4 + 0.1244x5 (19)

with cut-off value zc = 0.7548. As can be seen, all coefficients display positive values, whichis in agreement with the financial interpretation of the model. Moreover, by re-evaluatingthe coefficients of multiple correlation after the exclusion of x3, the values obtained forx1, x2, x4, x5 were 0.58, 0.60, 0.42, 0.31, which shows that the collinearity effects are muchsmaller than in the previous case. Interestingly, ratio x1 no longer exhibits a large coefficientof multiple correlation, which suggests that x1 was primarily correlated with x3.

This model yields nine errors on the modelling set and four errors on the validationset, as shown in Table 1. It is worth noting that, if ratio x3 is not discarded, the numberof validation errors increases to 7, which is in agreement with the financial and statisticalreasonings presented above.

5.2. Multi-layer perceptron without structure optimization

To illustrate some potential problems associated to the use of neural network models,consider a multi-layer perceptron with the following architecture.

– Five inputs: financial ratios x1, x2, x3, x4, and x5.– One output in the range (0, 1). The company is classified as failed if the output is smaller

than 0.5 and as continuing otherwise.– Two hidden layers.– Four neurons in each hidden layer.– Logistic sigmoid activation function in all neurons.

This network was trained with the Levenberg–Marquardt method (Gill et al., 1981). Initialweights were randomly generated and the maximum number of iterations was set to 100.To investigate the sensitivity of the training algorithm with respect to its starting point, 100training runs were performed, each one starting from a different set of initial weights.

Figure 4 displays the outcomes of the training runs, sorted according to the final mean-square-error MSE of the network (Eq. (9) with d(k) = 0 or 1 and M = 42). The validation


Figure 4. Training outcomes sorted according to the mean-square-error in the modelling set.

errors in each case are also presented. Notice that an MSE smaller than (0.5)2/42 = 6×10−3

ensures the correct classification of all modelling companies.As seen in figure 4, most training runs resulted in MSE < 10−10, which shows that the

modelling patterns can be adequately discriminated by the neural network. However, insome cases, despite the good performance in the modelling set, the neural network yieldedfive validation errors, which is worse than the linear model result (four validation errors).

This generalization problem points to a lack of parsimony in the neural network structure.In fact, the number of model parameters is now 49 (40 synaptic weights and 9 biases), ascompared to 5 (4 coefficients and the cut-off value) in the linear model.

It could be argued that one of the training runs actually resulted in a single validation error.However, the number of validation errors is presented here only for discussion purposes,since the validation set is not supposed to enter the modelling process at any stage. Thus,the only information which the analyst should use to favor a particular training run againstthe others is the final value of the cost function MSE.

According to figure 4, the training run which attained the lowest MSE resulted in a neuralnetwork which yields three validation errors. This result will be kept for the records (Table2). However, it should be pointed out that, if the whole experiment were repeated withnew sets of initial weights, the training run with the smallest MSE could lead to a differentnumber of validation errors.

5.3. Multi-layer perceptron trained with optimal brain damage

As illustrated in the previous example, the use of over-parameterized neural networks mayresult in poor generalization and sensitivity to the initial weight values used in the training.In what follows, a pruning technique is employed to obtain a more parsimonious neuralnetwork.


Table 2. Results, multi-layer perceptron (best training run ac-cording to the mean-square-error criterion). Error types: I (failedcompany classified as continuing) or II (continuing companyclassified as failed).


Modelling 0 0 0 100

Validation 1 2 3 83

Figure 5. Pruning record of the multi-layer perceptron trained with Optimal Brain Damage. The arrow marksthe inflection point used to select the number of parameters.

Optimal Brain Damage was applied to a multi-layer perceptron consisting initially ofone hidden layer with 10 neurons. Pruning was performed by removing one connection ata time and then retraining the network (Norgaard, 2000).

Figure 5 displays the mean square error MSE as a function of the number of remainingparameters (weights and biases) in the network. A good trade-off between model complexityand classification performance in the training set is obtained at the inflection point markedwith an arrow (22 parameters).

Table 3 presents the classification performance of the pruned multi-layer perceptron onthe modelling and validation sets. Notice that this network has less than half the number ofparameters of the multi-layer perceptron employed in the previous subsection, with no lossin prediction capability.

These results show that the Optimal Brain Damage is an efficient technique for obtainingparsimonious neural network models. However, it should be noticed that sensitivity to initialweight values still exists. In fact, consider figure 6, which was obtained by using a differentset of initial weights. The inflection point of the MSE curve is now at 24 parameters andthe resulting network yields four validation errors, instead of three.


Table 3. Results, multi-layer perceptron pruned with OptimalBrain Damage. Error types: I (failed company classified as con-tinuing) or II (continuing company classified as failed).


Modelling 0 0 0 100

Validation 0 3 3 83

Figure 6. Pruning record of the multi-layer perceptron trained with Optimal Brain Damage and a different setof initial weights. The arrow marks the inflection point used to select the number of parameters.

It could be argued that the poorer generalization ability of this network is a consequenceof its being less parsimonious than the previous one, which had 22 parameters. However, themain point of this example is that the reproducibility issues raised in the previous subsectionwere not solved. The Optimal Brain Surgeon technique was also tested, but the results weresimilar.

5.4. Wavelet network

A Mexican Hat wavelet with grid parameters α = 2 and β = R/2 = 2.5 was employed.Steps 1 and 2 of the algorithm described in Section 4.1 resulted in 991 wavelets, a numberwhich was reduced to 457 by Step 3. Of these, 258 were discarded in step (b) of the selectionprocess with κ = 10−3. The use of a threshold χ = 10 for the condition number caused thealgorithm to stop after selecting 11 wavelets.

Figure 7 displays the number of modelling errors obtained with the Z -score model asthe wavelets are added during the selection process. This graph suggests the choice of sixwavelets, according to the principle that, given models with similar prediction capabilities,


Table 4. Results, wavelet network with six wavelets (proposedprocedure). Error types: I (failed company classified as continu-ing) or II (continuing company classified as failed).


Modelling 1 1 2 95

Validation 0 2 2 89

Figure 7. Modelling errors as a function of the number of wavelets employed. The arrow marks the inflectionpoint used to select the number of wavelets.

the one with fewest parameters should be favored. Interestingly, these wavelets are equallydistributed among the three scale levels employed (m = −1, 0, +1), which justifies theimportance of using a multi-scale network structure.

As shown in Table 4, the six-wavelet model leads to two errors on the validation set. Whencompared to the multi-layer perceptron trained with Optimal Brain Damage, this waveletnetwork is less successful in discriminating the training patterns, but its performance on thevalidation set is better. The reason may lie in the fact that, although the wavelet network has43 parameters (6 wavelets × (1 scale + 5 translations) + 6 weights + 1 cut-off value), only 7of them are real-valued. Thus, on the overall, the wavelet network structure is simpler then themulti-layer perceptrons obtained in the previous sections. Moreover, since the constructionof the wavelet network does not involve random factors (such as the generation of a initialset of weights), the results are reproducible, unlike the training of a multi-layer perceptron.

For comparison, a second wavelet network was constructed by using a modified versionof the classical method described in Section 4.1. The modification consisted of using theFisher Discriminant value associated to each wavelet term instead of the correlation withthe dependent variable originally employed by Zhang (1997). In this case, the best result


Table 5. Results, wavelet network with six wavelets (modifiedclassical procedure). Error types: I (failed company classified ascontinuing) or II (continuing company classified as failed).


Modelling 1 5 6 86

Validation 0 4 4 78

(by using the parsimony criterion described above) was attained by using a single wavelet.As shown in Table 5, the resulting wavelet network leads to four errors in the validation set.

5.5. Re-sampling study

The results discussed so far were obtained for one given partition of the available datainto modelling and validation sets. In order to perform a more thorough comparison ofthe classification techniques discussed in this paper, a re-sampling study was carried out.For this purpose, 100 different modelling/validation partitions were randomly generatedwith the same size as the one employed before (42 modelling companies and 18 validationcompanies). In the LDA case, 1000 partitions were employed because of the computationalsimplicity of the calculations involved. For each partition, the modelling procedures previ-ously described were applied and the resulting classifiers were validated. Table 6 presentsthe average number of validation errors obtained in this study, as well as the standard devi-ation associated to each mean value (standard errors). For the wavelet network classifiers,

Table 6. Average number of errors and associated standard deviations obtained in the re-sampling study. The results of not excluding the wavelets whose supports contain only onemodelling sample are shown in parenthesis. The average percent values of classificationaccuracy in the modelling and validations sets are also presented.

Model Modelling Av. modelling Validation Av. validationtype errors accuracy (%) errors accuracy (%)

LDA 8.49 ± 0.05 79.8 4.71 ± 0.05 73.8

MLP-OBD 0.17 ± 0.04 99.6 4.46 ± 0.17 75.2

WN1 1.42 ± 0.11 96.6 4.17 ± 0.17 76.8

(1.14 ± 0.11) (97.3) (4.34 ± 0.18) (75.9)

WN2 6.14 ± 0.20 85.4 4.06 ± 0.15 77.4

(5.96 ± 0.20) (85.8) (4.20 ± 0.15) (76.7)

Key: LDA—Linear discriminant analysis.MLP-OBD—Multi-layer perceptron (Optimal Brain Damage).WN1—Wavelet network (proposed procedure).WN2—Wavelet network (modified classical procedure).


Table 7. T -test study at a confidence level of 95% tocheck the statistical significance of the superiority ofwavelet networks indicated by the results presented inTable 6.

LDA MLP-OBD WN1

MLP-OBD � - -

WN1 �= � -

WN2 �= �= �The symbol �= indicates a significant difference, the sym-bol � indicates no significant difference, and a dash in-dicates that the test was not needed.Key: LDA—Linear discriminant analysis.MLP-OBD—Multi-layer perceptron (Optimal BrainDamage).WN1—Wavelet network (proposed procedure).WN2—Wavelet network (modified classical procedure).

the results of not excluding the wavelets whose supports contain only one modelling sampleare also presented.

A t-test study was carried out using function ttest of Matlab’s Statistics Toolbox (TheMathworks, 2004) based on the validation outcomes presented in Table 6 to check forsignificant differences between the mean values shown and the value of nine, which wouldbe the expected result of using a random prediction method (nine errors in 18 validationobjects). The results show that indeed the classification techniques under considerationoutperform the expected result of a random prediction method. In fact, the probability ofobserving any of the obtained results by chance given that the population mean were nineis smaller than 10−46.

A comparison of the four distress prediction models in terms of average number ofvalidation errors points to a superiority of the wavelet networks. In order to assess thestatistical significance of such a finding, t-tests at a confidence level of 95% were carriedout for the validation data by using function ttest2 of Matlab’s Statistics Toolbox (TheMathworks, 2004). The results, which are reported in Table 7, indicate that both waveletnetworks are indeed significantly superior to LDA. Furthermore, the multi-layer perceptronis found to be equivalent to LDA. It can thus be concluded that the wavelet networks werea better alternative to the classical LDA procedure and to the multi-layer perceptron in thiscase study.

The t-test also reveals that the two wavelet network construction methods are statisticallyequivalent in terms of validation performance. In this comparison context, it is interestingto notice that the proposed procedure (WN1) leads to a much smaller number of modellingerrors than the modified classical procedure (WN2). It might be argued that the proposedcollinearity avoidance mechanism allowed the inclusion of more wavelet terms in the net-work (thus providing a more flexible decision surface, which better separates the modellingdata) without causing overfitting problems.

Finally, Table 6 shows that the preliminary exclusion of wavelets whose supports containonly one modelling sample does not compromise the predictive ability of the resulting


classifiers. In fact, such a procedure actually improves the validation performance of thewavelet networks to a small degree, apart from considerably reducing the computationalworkload involved in the selection of the wavelet terms.

6. Conclusions

This paper discussed the use of different classification models to address the problem of cor-porate financial distress prediction. The relevance of pruning techniques in neural networkmodels was examined and a novel procedure for building wavelet network classifiers wasintroduced. The results, supported by a re-sampling study, showed that nonlinear modelsmay be a valid alternative to the classical linear discriminant models employed in this con-text. Moreover, wavelet networks may have advantages over the conventional multi-layerperceptron structures employed in neural network frameworks.

A possibility not explored in this paper is subjecting the wavelet network to a backprop-agation training after selecting the wavelets to be included in the model. By fine-tuningthe scale and translation parameters, this training could possibly improve the network per-formance. However, the advantage of a simple structure, mostly parameterized by integervalues, would be lost.

Improvements might also be obtained through the use of a different set of financial ratios,as discussed elsewhere (Galvao et al., 2004). However, to keep the focus on the comparisonbetween model architectures, the ratio selection problem was not investigated here.

Notes

1. ψ is the Fourier Transform of ψ .2. L2(Rd ) = {y : R

d → R s.t.∫

Rd |y(x)|2dx1dx2 . . . dxd < ∞}.

References

Alici, Y. 1996. Neural networks in corporate failure prediction: The UK experience. In Neural Networks inFinancial Engineering. A. Refenes, Y. Abu-Mostafa, and J. Moody (Eds.): London: World Scientific.

Altman, E. 1968. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal ofFinance, 23(4):589–609.

Altman, E., Marco, G., and Varetto, F. 1994. Corporate distress diagnosis: Comparisons using linear discriminantanalysis and neural networks (the Italian experience). Journal of Banking and Finance, 18:505–529.

Ash, T. 1989. Dynamic node creation. Connection Sci., 1(4):365–375.Beaver, W. 1966. Financial ratios as predictors of failure. Empirical Research in Accounting: Selected Studies,

5:71–111.Bjorck, A. 1994. Numerics of GramSchmidt orthogonalization. Linear Algebra Applicat, 197.Cannon, M. and Slotine, J.-J.E. 1995. Space-frequency localized basis function networks for nonlinear system

estimation and control. Neurocomputing, 9:293–342.Chen, S., Cowanss, C., and Grant, P. 1991. Orthogonal least squares learning algorithm for radial basis function

networks. IEEE Transactions on Neural Networks, 2(2):302–309.Coats, P. and Fant, L. 1993. Recognizing financial distress patterns using a neural network tool. Financial Man-

agement, 22:142–155.Daubechies, I. 1992. Ten Lectures on Wavelets. Philadelphia: SIAM.


Ezekiel, M. and Fox, K.A. 1959. Methods of Correlation and Regression Analysis, 3rd edition, New York: JohnWiley.

Foster, G. 1986. Financial Statement Analysis, London: Prentice-Hall.Galvao, R.K.H., Becerra, V.M. and Abou-Seada, M. 2004. Ratio selection for classification models. Data Mining

and Knowledge Discovery, 8(2):151–170.Galvao, R.K.H. and Yoneyama, T. 1999. Improving the discriminatory capabilities of a neural classifier by using

a biased-wavelet layer. International Journal of Neural Systems, 9(3): 167–174.Gill, P., Murray, W. and Wright, M. 1981. Practical Optimization. London: Academic Press.Gomm, J. and Yu, D. 2000. Selecting radial basis function network centers with recursive orthogonal least squares

training. IEEE Transactions on Neural Networks, 11(2):306–314.Hassibi, B., Stork, D., and Wolff, G. 1993. Optimal brain surgeon and general network pruning. In IEEE Interna-

tional Conference on Neural Networks, pp. 293–299.Haykin, S. 1999. Neural Networks: A Comprehensive Foundation. London: Prentice-Hall.Kohonen, T. 1995. Self-Organizing Maps. Berlin: Springer-Verlag.Kun, Y., Denker, J., and Solla, S. 1990. Optimal Brain Damage. In Advances in Neural Information Processing

Systems, Touretzky D. (Ed.), San Mateo, Calif.: Morgan Kaufmann, pp. 598–605.Lawson, C.L. and Hanson, R.J. 1974. Solving Least Squares Problems. Englewood Cliffs: Prentice-Hall.Mao, K. 2002. RBF neural network center selection based on Fisher ratio class separability measure. IEEE

Transactions on Neural Networks, 13(5):1211–1217.Moody, J. and Darken, C. 1989. Fast learning in network of locally-tuned processing units. Neural Computing,

1:281–294.Morrison, D. 1990. Multivariate Statistical Methods, New York: McGraw-Hill.Naes, T. and Mevik, B.H. 2001. Understanding the collinearity problem in regression and discriminant analysis.

Journal of Chemometrics, 15(4):413–426.Norgaard, M. 2000. Neural network based system identification toolbox. Technical Report 00-E-891, Technical

University of Denmark, Department of Automation.Odom, M. and Sharda, R. 1990. A neural network model for bankruptcy prediction. In IJCNN International Joint

Conference on Neural Networks, Vol. II. San Diego, California, pp. 163–167.Pedrycz, W. 1998. Conditional fuzzy clustering in the design of radial basis function neural networks. IEEE

Transactions on Neural Networks, 9(4):601–612.Scholkopf, B., Sung, K.-K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., and Vapnik, V. 1997. Comparing

support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Transactions on SignalProcessing, 45:2758–2765.

Setiono, R. and Hui, L.C.K. 1995. Use of a quasi-newton method in a feedforward neural network constructionalgorithm. IEEE Trans. Neural Networks, 6(1):273–277.

Sherstinsky, A. and Picard, R. 1996. On the efficiency of the orthogonal least squares training method for radialbasis function networks. IEEE Transactions on Neural Networks, 7(1):195–200.

Szu, H.H., Telfer, B., and Kadambe, S. 1992. Neural network adaptive wavelets for signal representation andclassification. Optical Engineering, 31(9):1907–1916.

Tam, K. and Kiang, M.Y. 1990. Predicting bank failures: A neural network approach. Applications of ArtificialIntelligence, 4:265–282.

Tam, K. and Kiang, M.Y. 1992. Managerial applications of neural networks. Management Science, 38, 926–947.

Taylor, J.S. and Cristianini, N. 2004. Kernel Methods for Pattern Analysis. Cambridge: Cambridge UniversityPress.

The Mathworks: 2004. Statistics Toolbox Users Guide, Version 5. Natick, Massachussetts: The Mathworks.Trigueiros, D. and Taffler, R. 1996. Neural networks and empirical research in accounting. Accounting and Business

Research, 26:347–355.Tyree, E. and J. Long: 1996, Assessing financial distress with probabilistic neural networks. In Neural Networks

in Financial Engineering, A. Refenes, Y. Abu-Mostafa, and J. Moody (Eds.), London: World Scientific.Wilson, R.L. and Sharda, R. 1994. Bankruptcy prediction using neural networks. Decision Support Systems,

11:545–557.Yao, X. 1999. Evolving artificial neural networks. Proceedings of the IEEE, 87(9):1423–1447.


Zhang, Q. 1997. Using wavelet network in nonparametric estimation. IEEE Trans. Neural Networks, 8(2):227–236.

Zhang, Q. and Benveniste, A. 1992. Wavelet Networks. IEEE Trans. Neural Networks, 3(6):889–898.

Victor M. Becerra received his first degree in Electrical Engineering from Simon Bolıvar University, Caracas,Venezuela, in 1990, and his PhD in Control Engineering from City University, London, UK, in 1994. He alsoobtained an MSc in Financial Management from Middlesex University, London, in 2001. He was a ResearchFellow at City University, London, between 1994 and 1999. Since January 2000, he has been employed as aLecturer at the Department of Cybernetics, the University of Reading, UK. His main research interests are in thefields of control engineering and artificial intelligence.

Roberto Kawakami Harrop Galvao received his BSc degree in Electronics Engineering (Summa cum Laude)in 1995 and his PhD in Systems and Control in 1999, both from Instituto Tecnologico de Aeronautica (ITA),Brazil. In 2001, he spent a post-doctoral period at the Department of Cybernetics, the University of Reading, UK.Since 1998, he has been with the Electronics Engineering Division of ITA, as Associate Professor of Systemsand Control. His main areas of interest are wavelet theory and applications, signal processing and multivariateanalysis.

Magda Abou-Seada is a Senior Lecturer in Accounting at Middlesex University, London, UK. She obtainedher first degree, BCom in Accounting, and Master’s degree in Accounting from Cairo University, Egypt. Shecompleted her PhD studies in 1999 at the University of the West of England, Bristol, UK. Her research interestsinclude auditing and financial accounting.

Neural and Wavelet Network Models for Financial Distress Classification

Documents

Transcript of Neural and Wavelet Network Models for Financial Distress Classification