Modular Neural Network Classifiers: A Comparative Study

Journal of Intelligent and Robotic Systems 21: 117–129, 1998. 117c© 1998 Kluwer Academic Publishers. Printed in the Netherlands.

Modular Neural Network Classifiers:A Comparative Study

GASSER AUDA and MOHAMED KAMELPattern Analysis and Machine Intelligence Lab., Systems Design Engineering Department,University of Waterloo, Canada N2L 3G1; e-mail: {gasser,mkamel}@watnow.uwaterloo.ca

(Received: 21 July 1997; accepted: 23 September 1997)

Abstract. There is a wide variety of Modular Neural Network (MNN) classifiers in the litera-ture. They differ according to the design of their architecture, task-decomposition scheme, learningprocedure, and multi-module decision-making strategy. Meanwhile, there is a lack of compara-tive studies in the MNN literature. This paper compares ten MNN classifiers which give a goodrepresentation of design varieties, viz., Decoupled; Other-output; ART-BP; Hierarchical; Multiple-experts; Ensemble (majority vote); Ensemble (average vote); Merge-glue; Hierarchical CompetitiveNeural Net; and Cooperative Modular Neural Net. Two benchmark applications of different degreeand nature of complexity are used for performance comparison, and the strength-points and draw-backs of the different networks are outlined. The aim is to help a potential user to choose anappropriate model according to the application in hand and the available computational resources.

Key words: modular neural networks, classification, cooperative decision making, performancecomparison.

1. Introduction

Modular Neural Networks (MNNs) present a new trend in neural network (NN)architectural designs. Motivated by the highly-modular biological network (Murre,1992), artificial NN designers aim to build architectures which are more “scal-able” and less subjected to interference than the traditional non-modular NNs.

There is now a wide variety of MNN designs for classification. Non-modularclassifiers tend to introduce high internal interference because of the strong cou-pling among their hidden-layer weights (Jacobs et al., 1991). On the other hand,complex tasks tend to introduce a wide range of overlap which, in turn, causesa wide range of deviations from efficient learning in the different regions of theinput space (Auda et al., 1995). A MNN-classifier attempts to reduce the effect ofthese problems via a “divide and conquer” approach. It, generally, decomposesthe large size/high complexity task into several sub-tasks, each one is handled bya simple, fast and efficient module. Then, sub-solutions are integrated via a multi-module decision-making strategy. Hence, MNN classifiers, generally, proved tobe more efficient than the non-modular alternatives (refer to (Alpaydin, 1993)through (Waibel, 1989)).

VTEX(P) PIPS No.: 147257 MATHKAPJINTCT2.tex; 18/12/1997; 15:06; v.7; p.1

118 G. AUDA AND M. KAMEL

There is a lack of comparative studies in the MNN literature which help apotential user understand their relative merits and drawbacks. This paper is anattempt to fill this gap by comparing ten MNN-classifiers, which cover the differ-ent architectural varieties, using two benchmark classification problems. Specialattention will be given to the features special to MNNs, such as task decomposi-tion algorithms and multi-module decision-making strategies. Section 2 surveysthe compared MNNs and the benchmark applications. Section 3 is a detailed per-formance analysis, based on which, Section 4 gives our suggestions for enhancingthe performance of each of the considered modular designs.

2. Experimental Set-up

The compared networks are, Decoupled; Other-output (both in (de Bollivier et al.,1991)); ART-BP (Tsai et al., 1994); Hierarchical (Corwin et al., 1994, Raafat andRashwan, 1993); Multiple-experts (Jacobs et al., 1991, Jacobs, 1990); Ensemblewith majority vote (Alpaydin, 1993); Ensemble with average vote (Battiti andColla, 1994); Merge-glue (Hackbarth and Mantel, 1991, Waibel, 1989); Hierar-chical Competitive Neural Net (HCMNN) (Auda, 1996); and Cooperative Modu-lar Neural Net (CMNN) (Auda, 1996, Auda et al., 1996, Auda et al., 1995). Thenetworks’ typical architectures are given in Figures 1–3. These networks coverthe whole range of known MNN designs with respect to task-decomposition(manually, using unsupervised networks, and during learning according to error),learning (one-phase and two-phase with merging the first-phase networks), multi-module decision-making (competition and cooperation), and structures (decou-pled, merged, hierarchical, and ensemble).

To guarantee a fair comparison among the networks, we have built our ownversion of these models, where all design aspects, other than the architecture, areunified. The applied supervised and unsupervised learning schemes in all net-works is the Backpropagation (BP) and the Adaptive Resonance Theory (ART),respectively. All learning parameters are unified, including the criteria for choos-ing the number of hidden nodes and for stopping learning.

The number of modules is unified as well as the distribution of classes overthe modules, wherever applicable. Finally, all learning, validating, and testingsets are unified. Refer to Figure 1 with the following outline of the consideredmodels.

Decoupled modules. The version implemented here uses an ART unsuper-vised network (Bartfai, 1994) for decomposing classes into several groups, eachone is assigned to one supervised module for separating its classes. Modulesare trained in parallel. The final classification decision is taken according to theabsolute maximum activation of all the modules (de Bollivier et al., 1991). Thereis no interaction among the different modules.

Other-output model. The version implemented here uses a similar task-decomposition technique to the Decoupled model. Modules are also trained in

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.2

MODULAR NEURAL NETWORK CLASSIFIERS 119

Decoupled modules Other-o/p model

ART-BP model Hierarchical modules

Multiple experts Ensemble (majority vote)

Ensemble (average vote) HCMNN

Figure 1. Eight of the ten compared MNN architectures.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.3


parallel. However, a cooperation strategy among the different modules is sug-gested for decision-making instead of the absolute-maximum decision. This isachieved by defining an “other” output bit at each module’s output. This outputgives an indication that the considered testing sample does not belong to thismodule’s classes (de Bollivier et al., 1991).

ART-BP network. The version implemented here uses a similar task-decom-position technique to the Decoupled and the Other-output models. Modules aretrained in parallel. The decision making relies on the ART network (previouslyused for task-decomposition) in “directing” the testing sample towards one ofthe “competing” modules. Similar to the version in (Tsai et al., 1994), a lowvigilance ART is used in order to guarantee low sensitivity for the competitionprocess.

Hierarchical network. There are different ideas for building a hierarchy ofnetworks before a final stage of supervised classifiers, for example, (Corwinet al., 1994) and (Raafat and Rashwan, 1993). The version implemented hereis a two-level hierarchy of BPs similar to (Corwin et al., 1994). The “stem”BP classifies groups of classes (defined by a low vigilance ART similar to theprevious network), and the “leaf” BPs offer finer classification within each group.

Multiple-experts network. During learning, and as learning samples are pre-sented to the network, the gating module performs task-decomposition by assign-ing different “weights” to the different experts. These weights are according toa probabilistic measure of their error improvement as a response to the inputsample (Jacobs et al., 1991, Jacobs, 1990).

Ensemble networks. They are multiple identical (differently initialized) non-modular networks each is solving the whole classification task. The final classi-fication decision is taken by a simple majority vote, where each network nomi-nates one class (Alpaydin, 1993, Battiti and Colla, 1994). An alternative decision-making strategy for the Ensemble networks “averages” all outputs for each class,i.e. considers the values of the NN outputs, and the maximum average is chosenas the classification decision (Alpaydin, 1993).

HCMNN. A two-level hierarchy of ARTs of low and high-vigilance directthe sample to one supervised module. The different output nodes of each ART“compete” for firing one of the higher-level modules (Auda, 1996). Then, eachsupervised module is dedicated to classify a number of classes, and they do notinteract.

Merge-and-glue network. In this model, task decomposition is performed,heuristically, according to the human experience about the considered problem.Hence, classes are assigned to different supervised modules accordingly (Hack-barth and Mantel, 1991, Waibel, 1989). Modules are separately trained until theyreach satisfactory performance. Then, keeping the values of all training connec-tion weights, all hidden and output layers are “merged” to form a large net.Learning can, then, be resumed beginning from a “smarter” point in the errorspace. Some hidden nodes can be added as “glue”, i.e. in order to discover more

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.4


Figure 2. Merge-and-glue network. Two learning phases.

features than what the modules’ hidden layers have discovered (Hackbarth andMantel, 1991, Waibel, 1989) (Figure 2).

Cooperative Modular Neural Network (CMNN). A 3-level hierarchy ofARTs of a rising vigilance from bottom to top is used to cluster the consideredclasses into different groups of 3 levels of overlap. The first two levels are similarto the HCMNN’s. The level of non-overlapping clusters is handled by the low-vigilance ART. The middle-level groups of classes are handled by a group ofsupervised modules, each recognizes its classes plus the outer boundary of theother groups. All modules would, then, vote for determining the group and theclass of the tested sample (Auda, 1996, Auda et al., 1995) (Figure 3). Finally,a group of “specialized modules” are assigned to high overlap regions in theinput-space detected by the high-vigilance ARTs.

Two different data-sets are used for comparing all MNNs, viz., the 2-dimension20-class recognition problem (Auda, 1996, Auda et al., 1996, Auda et al., 1995),and the Cleveland Heart Diseases data base (Murphy and Aha, 1994). In the2-dimension problem, 2-feature samples are randomly generated from Gaussiandistributions around 20 random class means (Figure 4). This data is used to givea clear illustration of the shapes of the resultant decision boundaries for the sakeof performance evaluation. It is a moderately complex data which has distinctclusters of classes and simple decision boundaries in most regions. There is ahigh overlap between classes 19 and 20 though.

The Cleveland Heart Diseases data is a 5-class classification problem whichrepresents one “healthy” and four “infected” classes of heart patients. The input-space is 13-dimension, representing 13 different patient attributes. It has a specialhigh overlap problem among 3 out of the 5 classes (classes 2, 3, and 5). None of

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.5


Figure 3. Decision-making major steps in CMNN.

Figure 4. The 20-class problem.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.6


the classes is “unique”, from the classification point of view. Therefore, Clevelanddata represent a highly complex classification task with a special overlap problem.

In both sets of experiments, the following set-up is used. Three data sets areused; learning, validation, and testing sets. However, the Cleveland data does nothave a validation set. In case of having a validation set available, the stoppingcriteria was to save the best network – with respect to this set – over 120,000iterations over the data records. A test for the best was performed every 1,000iterations. Otherwise, we have used the RMS error at the output layer with athreshold of 0.001. The initialization scheme of the incoming variable weightsto each layer is a uniform scheme. The low and high limits of it are +1 and −1.The utilized learning scheme is Backprop with extended Delta Bar Delta, tanhtransfer function, and softmax outputs. The runs are carried out for up to 4 timesand the best performing module is chosen.

3. Performance Analysis

The correct accuracy percentage and “the number of modified weights” for allmodels with the two data sets are summarized in Figure 5. The correct accuracypercentage is a global measure of the “generalization” abilities of the network.Although the number of iterations is the traditional way of measuring the net-works’ learning speed, the computational effort done for one iteration totallyvaries from one network to the other. Hence, computational work, measured bythe number of modified weights during learning, is a better measure for compar-ing the learning speed and consumed CPU-time during learning as well.

Decoupled modules. Due to the simplicity of the sub-problems, using sepa-rate decoupled modules requires little storage and gives very fast learning. Theaverage accuracy of the modules (when each one is tested using its own class-es) is quite high (77.53% and 85.00% for the 2-dimension and the Clevelanddata respectively). However, when tested using the whole testing set and theabsolute maximum of the outputs is considered as the final decision, it gives apoor performance. Due to the fact that every sub-net has no information aboutthe “others”, each module has a completely different “perception” of the inputspace, and hence, no inhibition is applied to any high reaction towards a classwhich belongs to a wrong group. Performance is better in all other models whichhave some organization among modules.

Other-output model. This model adds a simple “cooperation strategy” to thedecoupled model by teaching each of the two modules the “outer boundaries” ofthe “other” module. The result is a jump in the correct recognition percentage inthe 2-dimension data. However, the “other” node provides only one more bit ofinformation about the testing sample and hence, insufficient in case of solvingproblems where the number of modules is larger than two. Moreover, teachingthe module the “other” category is a difficult task due to the large number ofclasses which form it. Also, if we assume an equal number of available training

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.7


Figure 5. MNNs are performing relatively-similar on the 2 data sets.

samples, teaching the “other” decision using all “other” samples would give anunbalance in the number of training samples per module’s output. This unbalanceis proven to degrade the overall learning efficiency (Auda et al., 1995).

ART-BP. The coarse clustering of the low vigilance ART applied to the2-dimension data produces two groups of no overlap and clear distinction inthe feature space. Hence, it introduces no group-classification errors during test-ing. However, the higher level supervised BPs are relatively large. Hence, theyhave “scalability problems” similar to the large non-modular net, although toa less extent. Therefore, the performance of the large BP is enhanced, but notup to the performance of the other-output model, HCMNN or CMNN. For theCleveland data, low vigilance ART does not separate any classes, and hence,ART-BP tends to a non-modular BP network.

Hierarchical network. The stem BP was as perfect as the low-vigilance ARTused in ART-BP for the 2-dimension data. Therefore, performance is similar tothe ART-BP. However, in Cleveland data, the stem BP, unlike the low-vigilanceART, is able to separate the 3 groups, and performance was enhanced. It is to benoted that this model lags behind the ART-BP in speed and computational effort.

Multiple experts. Because the gating network depends on the modules’ outputerror in decomposing the task, separated sub-tasks should be clearly distinct.Otherwise, all subtasks will be equally learned by all modules and performancewill be equal or slightly better than that of the non-modular network. That is whatwe noticed during the learning process using both data sets. The strength of thisnetwork appears in function approximation tasks when the gating network is ableto assign fundamentally different functions to the expert modules (e.g., the “what”and “where” vision tasks (Jacobs et al., 1991)). To implement the multiple-experts’ internal relations, a large number of connections is used (NeuralWare,1993). This results in a large memory requirement and a large number of modifiedweights before convergence.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.8


Ensemble networks. Although using different initialization schemes for thedifferent networks may lead to different approaches to the solution, all of themodules are still having the drawbacks of non-modular NNs. Hence, they tendto introduce similar errors (e.g. failure to recognize classes 3, 16 and 17 in2-dimension and confusing classes 3 and 5 in Cleveland data). Performance isonly slightly better than the average performance of the ensemble. They couldnot even beat the best performance given by a single network. Majority voteresults in a large number of “ties” among class-winner, which degraded theperformance. Moreover, this voting strategy only considers the maximum outputactivation, and does not consider the information available at the other outputs.Although we do not believe that the values at the network’s outputs indicate aposteriori probabilities, we are convinced, empirically, that they indicate hownear the sample is from the output’s class. Hence, using the average vote of theoutput activations slightly enhanced the performance, of both 3 and 5 modularensembles, because more information is considered. Finally, multiple networksrequires the largest memory space and computational effort among other MNNs.

Merge-and-glue. Although merging the different modules occurs after a degreeof learning “maturity,” i.e., nearness to the global minimum of the error hyper-plane, the merged network is, still, very large. Therefore, it may still be subjectto fall in local minima, and its hidden layer is till suffering from tight coupling.Another drawback is noticed during the experiments; after learning develops onthe modular level in phase one (before merging), resuming learning in phase twowas a hard task, i.e., the network’s global error remained high for a long time. Itis obvious that temporal crosstalk (Jacobs et al., 1991) among the hidden-layersegments in the second learning stage is so high that the overall classificationperformance degraded. It is noticed from the experiments that the global accu-racy of the model is sensitive towards the way of grouping. This makes the jobof building the network’s structure (based on human experience) more difficultthan all of the other considered MNNs.

HCMNN. When applied to the 2-dimension problem, it uses the perfect lowvigilance “stem” ART as well as small and accurate supervised NNs for classifi-cation. Compared to the CMNN, it gives slightly less accuracy but much fasterlearning and much less computational requirements. Most of its errors (8.40%out of the 9.60% error) occur at the “leafs” layer, where the high vigilance ARTsare used for classification. However, high vigilance ARTs give almost no errorin recognizing some groups: (1-3-4), (5), (15-19-20), and (16-17). This suggeststhat the HCMNN can be used, with comparable performance to the CMNN, infairly non-overlapping regions to benefit from its more efficient resource uti-lization and faster learning. In the more complex environment of the Clevelanddata base, the high-vigilance ART fails more frequently in identifying differentgroups, and performance is weak.

CMNN. Although the CMNN requires higher storage and computational effortthan the non-modular BP, decoupled nets, ART-BP, hierarchical, and HCMNN,

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.9


(a)

(b)

Figure 6. CMNN realized some of the complexities between classes 2 and 3, and 19 and20, while the best single BP, used for the 2 types of Ensemble nets, could not.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.10


it enhanced the recognition accuracy substantially. For the 20-class problem,CMNN uses the low vigilance ART for classifying samples to one of the twodistinct clusters, and hence, guarantees its accuracy. Then, it uses the cooperativevoting scheme to determine the winner group (module) with high accuracy. It,then, selects the class from amongst the small number of classes handled bythe winner module. Finally, it focuses with its 19-20 specialized module on theboundary between classes 19 and 20, and succeed in drawing it accurately. This19-20 boundary was poorly drawn by all other models. Figure 6 illustrates theperformance of the CMNN versus the best Ensemble module. The effect of thevoting scheme and specialized modules appears when performance degrades bytheir absence. For the Cleveland data, the significance of the cooperative votingscheme and the specialized modules is more obvious. The 2-3-5 specializedmodule corrected more than 50% of their confusions given by the other models.

4. Conclusion

Decoupled-modules model can only be used for very simple classification prob-lems. We suggest defining a “minimum threshold” between the maximum andthe second-maximum output activations in order to reduce errors and enhancethe performance.

The Other-output model is better than the Decoupled modules as it suggests aform of cooperation among the modules. However, it gives precise classificationdecisions only for 2-class problems. We suggest a kind of an error-feedback,during learning, to be given by the other-output node to the other modules. Thisfeedback will help enhance the global classification performance.

Multiple-experts model is more suitable for function approximation appli-cations, provided that fundamentally different sub-functions can be defined, sothat they would be assigned to the different modules. This model needs an error-independent criteria in the task-decomposition equation in order to suite complexclassification problems.

ART-BP and HCMNN work well (fast learning and reasonable recognitionaccuracy) in case there are clear separations between clusters of classes in theinput-space, i.e., the existence of groups of classes of unique common features.However, as the classification task becomes more complex, the ART, like all otherunsupervised networks which use fixed-distance criteria, becomes uncapable ofdrawing accurate decision boundaries. In this case, a Hierarchical structure ofsupervised modules would be better although costly (from the computationalpoint of view).

The Ensemble model is a good alternative for complex problems. However,since it carries out no task-decomposition, it can only handle small- or moderate-size tasks. If the number of classes is large, some task-decomposition has to becarried out, and another architecture, which deals with sub-tasks in separatemodules, has to be used. To enhance the Ensemble model performance, we

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.11


suggest that the design allow some cooperation of the modules during learning.For example, apply voting during learning and use the error feedback to find abetter configuration of weights.

Similarly, Merge-glue can only handle small- or moderate-size tasks. More-over, some caution is needed to prevent the large network (after merge) fromfalling in local minima. We suggest to start the second learning phase (Figure 2)before the first-phase matures, i.e., before the modules’ local errors decline sub-stantially. This will decrease the effect of temporal crosstalk in phase two.

Finally, the CMNN is specifically useful for applications with a wide rangeof overlap in the input-space. CMNN’s “group-outputs” at the modules’ outputsresult in harder sub-tasks than those of the Decoupled, Other-output, ART-BP,Hierarchical and Merge MNNs. However, they give enough information whichenables the voting scheme to assign testing samples to their correct modules.Moreover, the specialized modules dedicated to the high-overlap regions is capa-ble of drawing quite complex boundaries.

In general, cooperative and ensemble schemes prove to be more efficient andcapable of handling more complex problems than the decoupled and competitiveapproaches. Their trade-off, however, is an increase in the required computationalwork during learning.

References

1. Alpaydin, E.: 1993, Multiple networks for function learning, in: Int. Conf. on Neural Networks,Vol. 1, CA, USA, 1993, pp. 9–14.

2. Auda, G., Kamel, M., and Raafat, H.: 1995, Voting schemes for cooperative neural networkclassifiers, in: IEEE International Conference on Neural Networks, ICNN’95, Vol. 3, Perth,Australia, November 1995, pp. 1240–1243.

3. Auda, G., Kamel, M., and Raafat, H.: 1996, Modular neural network architectures for classifi-cation, in: International Conference on Neural Networks (ICNN96), Vol. 2, Washington, D.C.,USA, June 3–6 1996, pp. 1279–1284.

4. Auda, G.: 1996, Cooperative modular neural network classifiers, PhD thesis, University ofWaterloo, Systems Design Engineering Department, Canada.

5. Bartfai, G.: 1994, Hierarchical clustering with art neural networks, in: World Congress onComputational Intelligence, Vol. 2, Florida, USA, June 1994, pp. 940–944.

6. Battiti, R. and Colla, A.: 1994, Democracy in neural nets: Voting schemes for classification,Neural Networks 7(4), 691–707.

7. Corwin, E., Greni, S., Logar, A., and Whitehead, K.: 1994, A multi-stage neural networkclassifier, in: World Congress on Neural networks, Vol. 3, San Diego, USA, June, 5–9 1994,pp. 198–203.

8. De Bollivier, M., Gallinari, P., and Thiria, S.: 1991, Cooperation of neural nets and taskdecomposition, in: Int. Joint Conf. on Neural Networks, Vol. 2, Seattle, USA, 1991, pp. 573–576.

9. Hackbarth, H. and Mantel, J.: 1991, Modular connectionist structure for 100-word recognition,in: Int. Joint Conf. on Neural Networks, Vol. 2, Seattle, USA, 1991, pp. 845–849.

10. Jacobs, R.: 1990, Task decomposition through competition in a modular connectionist archi-tecture, PhD thesis, University of Massachusets, Amherst, MA, USA.

11. Jacobs, R., Jordan, M., and Barto, A.: 1991, Task decomposition through competition in amodular con nectionist architecture: The what and where vision tasks, Neural Computation 3,79–87.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.12


12. Murphy, P. and Aha, D.: 1994, UCI repository of machine learning databases, Technical report,University of California, Irvine, USA, [http://www.ics.uci.edu/ ∼mlearn/MLRepository.html].

13. Murre, J.: 1992, Learning and Categorization in Modular Neural Networks, Harvester–Wheatcheaf.

14. NeuralWare Inc.: 1993, Neural Works II/Plus: Software manual, NeuralWare Inc., PA, USA.15. Raafat, H. and Rashwan, M.: 1993, A tree structured neural network, in: Int. Conf. on Document

Analysis and Recognition ICDAR93, Japan, October 1993, pp. 939–942.16. Tsai, W., Tai, H., and Reynolds, A.: 1994, An art2-bp supervised neural net, in: World Congress

on Neural Networks, Vol. 3, San Diego, USA, June 5–9 1994, pp. 619–624.17. Waibel, A.: 1989, Modular construction of time-delay neural networks for speech recognition,

Neural Computation 1, 39–46.

JINTCT2.tex; 18/12/1997; 15:06; v.7; p.13

Modular Neural Network Classifiers: A Comparative Study

Documents

Transcript of Modular Neural Network Classifiers: A Comparative Study