Generalization and Exclusive Allocation of Credit ... - CiteSeerX

Seediscussions,stats,andauthorprofilesforthispublicationat:https://www.researchgate.net/publication/2447075

GeneralizationandExclusiveAllocationofCreditinUnsupervisedCategoryLearning

ArticleinNetworkComputationinNeuralSystems·July1998

ImpactFactor:0.87·DOI:10.1088/0954-898X/9/2/008·Source:CiteSeer

CITATIONS

10

READS

7

2authors,including:

JonathanA.Marshall

WellRoundedSoftware,Inc.

41PUBLICATIONS462CITATIONS

SEEPROFILE

Availablefrom:JonathanA.Marshall

Retrievedon:17May2016

https://www.researchgate.net/publication/2447075_Generalization_and_Exclusive_Allocation_of_Credit_in_Unsupervised_Category_Learning?enrichId=rgreq-bd86517c-02b4-4972-a9db-22b509cf906f&enrichSource=Y292ZXJQYWdlOzI0NDcwNzU7QVM6MTAxOTA2NjEzNTM4ODI2QDE0MDEzMDc4Mzk1MTA%3D&el=1_x_2

https://www.researchgate.net/publication/2447075_Generalization_and_Exclusive_Allocation_of_Credit_in_Unsupervised_Category_Learning?enrichId=rgreq-bd86517c-02b4-4972-a9db-22b509cf906f&enrichSource=Y292ZXJQYWdlOzI0NDcwNzU7QVM6MTAxOTA2NjEzNTM4ODI2QDE0MDEzMDc4Mzk1MTA%3D&el=1_x_3

https://www.researchgate.net/?enrichId=rgreq-bd86517c-02b4-4972-a9db-22b509cf906f&enrichSource=Y292ZXJQYWdlOzI0NDcwNzU7QVM6MTAxOTA2NjEzNTM4ODI2QDE0MDEzMDc4Mzk1MTA%3D&el=1_x_1

https://www.researchgate.net/profile/Jonathan_Marshall?enrichId=rgreq-bd86517c-02b4-4972-a9db-22b509cf906f&enrichSource=Y292ZXJQYWdlOzI0NDcwNzU7QVM6MTAxOTA2NjEzNTM4ODI2QDE0MDEzMDc4Mzk1MTA%3D&el=1_x_4



Network: Comput. Neural Syst.9 (1998) 279–302. Printed in the UK PII: S0954-898X(98)91912-1

Generalization and exclusive allocation of credit inunsupervised category learning

Jonathan A Marshall† and Vinay S GuptaDepartment of Computer Science, CB 3175, Sitterson Hall, University of North Carolina, ChapelHill, NC 27599–3175, USA

Received 19 August 1997

Abstract. A new way of measuring generalization in unsupervised learning is presented. Themeasure is based on anexclusive allocation, or credit assignment, criterion. In a classifier thatsatisfies the criterion, input patterns are parsed so that the credit for each input feature is assignedexclusively to one of multiple, possibly overlapping, output categories. Such a classifier achievescontext-sensitive, global representations of pattern data. Two additional constraints,sequencemaskinganduncertainty multiplexing, are described; these can be used to refine the measure ofgeneralization. The generalization performance of EXIN networks, winner-take-all competitivelearning networks, linear decorrelator networks, and Nigrin’s SONNET-2 network are compared.

1. Generalization in unsupervised learning

The concept ofgeneralizationin pattern classification has been extensively treated in theliterature on supervised learning, but rather little has been written on generalization withregard tounsupervised learning. Indeed, it has been unclear what generalization evenmeansin unsupervised learning. This paper provides an appropriate definition for generalizationin unsupervised learning, a metric for generalization quality, and a qualitative evaluation(using the metric) of generalization in several simple neural network classifiers.

The essence of generalization is the ability to appropriately categorize unfamiliarpatterns, based on the categorization of familiar patterns. In supervised learning, the outputcategorizations for a training set of input patterns are given explicitly by an external teacher,or supervisor. Various techniques have been used to ensure that test patternsoutside thisset are correctly categorized, according to an external standard of correctness.

After supervised learning, a system’s ability to generalize can be measured in termsof task performance. For instance, a face-recognition system can be tested using differentimage viewpoints or illumination conditions, and performance can be evaluated in terms ofhow accurately the system’s outputs match the actual facial identities in images. However,in some situations, it may not be appropriate to measure the ability to generalize in termsof performance on a specific task. For example, on which task would one measure thegeneralization quality of the human visual system? Human vision is capable of so manytasks that no one task is appropriate as a benchmark. In fact, much of the power of thehuman visual system is its usefulness in completely novel tasks.

For general-purpose systems, like the human visual system, it would be useful to definegeneralization performance in a task-independent way, rather than in terms of a specific task.

† E-mail: [email protected]

0954-898X/98/020279+24$19.50c© 1998 IOP Publishing Ltd 279

280 J A Marshall and V S Gupta

It would be desirable to have a ‘general-purpose’ definition of generalization quality, suchthat if a system satisfies the definition, it is likely to perform well on many different tasks.This paper proposes such a general-purpose definition, based on unsupervised learning.The definition measures how well a system’s internal representations correspond to theunderlying structure of its input environment, under manipulations of context, uncertainty,multiplicity, and scale (Marshall 1995). For this definition, a good internalrepresentationis the goal, rather than good performance on a particular task.

In unsupervised learning, input patterns are assigned to output categories based on someinternal standard, such as similarity to other classified patterns. Patterns are drawn from atraining environment, or probability distribution, in which some patterns may be more likelyto occur than others. The classifications are typically determined by this input probabilitydistribution, with frequently-occurring patterns receiving more processing and hence finercategories. The categorization of patterns with a low or zero training probability, i.e. thegeneralization performance, is determined partly by the higher-probability patterns. Theremay be classifier systems that categorize patterns in the training environment similarly, butwhich respond to unfamiliar patterns in different ways. In other words, the generalizationsproduced by different classifiers may differ.

2. A criterion for evaluating generalization

How can one judge whether the classifications and parsings that an unsupervised classifiergenerates are good ones? Several criteria (e.g., stability, dispersion, selectivity, convergence,and capacity) for benchmarking unsupervised neural network classifier performance havebeen proposed in the literature. This paper describes an additional criterion: an exclusiveallocation (or credit assignment) measure (Bregman 1990, Marshall 1995). Exclusiveallocation as a criterion for evaluating classifications was first discussed by Marshall (1995).This paper refines and formalizes the intuitive concept of exclusive allocation, and itdescribes in detail how exclusive allocation can serve as a measure for generalization inunsupervised classifiers.

This paper also describes two regularization constraints,sequence maskinganduncertainty multiplexing, which can be used to evaluate further the generalizationperformance of unsupervised classifiers. In cases where there exist multiple possibleclassifications that would satisfy the exclusive allocation criterion, these regularizers allowa secondary measurement and ranking of the quality of the classifications.

The principle of credit assignment states that the ‘credit’ for a given input feature shouldbe assigned, or allocated, exclusively to a single classification. In other words, any givenpiece of data should count as evidence for one pattern at a time and should be preventedfrom counting as evidence for multiple patterns simultaneously. This intuitively simpleconcept has not been stated in a mathematically precise way; such a precise statement isgiven in this paper.

There are many examples (e.g., from visual perception of orientation, stereo depth,and motion grouping, from visual segmentation, from other perceptual modalities, and from‘blind source separation’ tasks) where a given datum should be allowed to count as evidencefor only one pattern at a time (Bell and Sejnowski 1995, Comonet al 1991, Hubbard andMarshall 1994, Jutten and Herault 1991, Marshall 1990a, c, Marshallet al 1996, 1997, 1998,Morse 1994, Schmitt and Marshall 1998). A good example comes from visual stereopsis,where a visual feature seen by one eye can be potentially matched with many visual featuresseen by the other eye (the ‘correspondence’ problem). Human visual systems assign thecredit for each such monocular visual feature to at most one unique binocular match; this

Generalization and exclusive allocation 281

property is known as the uniqueness constraint (Marr 1982, Marr and Poggio 1976). Instereo transparency (Prazdny 1985), individual visual features should be assigned to therepresentation of only one of multiple superimposed surfaces (Marshallet al 1996).

2.1. Neural network classifiers

A neural network categorizes an input pattern by activating some classifier output neurons.These activations constitute a representation of the input pattern, and the input featuresof that pattern are said to be assigned to that output representation. An input patternthat is not part of the training set, but which contains features present in two or moretraining patterns, can exist. Such an input is termed asuperimpositionof input patterns.Presentation of superimposed input patterns can lead to simultaneous activation of multiplerepresentations (neurons).

2.2. An exclusive allocation measure

One way to define an exclusive allocation measure for a neural network classifier is tospecify how input patterns (both familiar and unfamiliar) should ideally be parsed, in termsof a given training environment (the familiar patterns), and then to measure how well thenetwork’s actual parsings compare with the ideal. Consider, for instance, the network shownin figure 1, which has been trained to recognize patternsab andbc (Marshall 1995). Eachoutput neuron is given a ‘label’ (ab, bc) that reflects thefamiliar patterns to which theneuron responds. The parsings that the network generates are evaluated in terms of thoselabels. Whenab or bc is presented, then the ‘best’ parsing is for the correspondingly labelledoutput neuron to become fully active and for the other output neuron to become inactive(figure 1(A)). In a linear network, when half a pattern is missing (say the input pattern isa),and the other half does not overlap with other familiar patterns, the corresponding outputneuron should become half-active (figure 1(B)).

However, when the missing half renders the pattern’s classification ambiguous (saythe input pattern isb), the partially matching alternatives (ab and bc) should not bothbe half-active. Instead, the activation should bedistributed among the partially matchingalternatives. One such parsing, in which the activation fromb is distributed equally betweenab andbc, results in two activations at 25% of the maximum level (figure 1(C)). This parsingwould represent the network’s uncertainty about the classification of input patternb.

Another such parsing, in which the activation fromb is distributed unequally, toaband not tobc, results in 50% activation of neuronab (figure 1(D)). This parsing wouldrepresent a ‘guess’ by the network that the ambiguous input patternb should be classifiedas ab. Although thedistribution of credit from b to ab and bc is different in the twoparsings of figure 1(C) and 1(D), both parsings allocate the sametotal amount of credit.(An additional criterion, ‘uncertainty multiplexing,’ which distinguishes between parsingslike the ones in figures 1(C) and 1(D), will be presented in subsection 4.7.)

By the same reasoning, it would be incorrect to parse patternabc as ab (figure 1(E)),because then the contribution fromc is ignored. (That can happen if the inhibition betweenab and bc is too strong.) It would also be incorrect to parseabc as ab+ bc (figure 1(F))becauseb would be represented twice.

A correct parsing in this case would be to equally activate neuronsab and bc at 75%of the maximum level (figure 1(G)). That this is correct can be verified by comparing thesum of the input signals, 1+ 1 + 1 = 3, with the sum of the ‘size-normalized’ outputsignals. Each output neuron encodes a pattern of a certain preferred ‘size’ (or ‘scale’)


ab bc–

a b c

(A)

ab bc–

a b c

(E)

ab bc–

a b c

(F)

ab bc–

a b c

(B)

ab bc–

a b c

(C)

ab bc–

a b c

(G)

ab bc–

a b c

(D)

Figure 1. Parsings for exclusive allocation. (A) Normal parsing; the familiar patternab activatesthe correspondingly labelled output neuron. (B) The unfamiliar patterna half-activates the best-matching output neuron,ab. (C) The unfamiliar input patternb matchesab and bc equallywell, and its excitation credit is divided equally between the corresponding two output neurons,resulting in a 25% activation for each of the two neurons. (D) The excitation credit fromb isallocated entirely to neuronab (producing a 50% activation), and not to neuronbc. (E) Incorrectparsing in response to unfamiliar patternabc: neuronab is fully active, but the credit from inputunit c is lost. (F) Another incorrect parsing ofabc: the credit from unitb is counted twice,contributing to the full activation of both neuronsab and bc. (G) Correct parsing ofabc: thecredit fromb is divided equally between the best matchesab andbc, resulting in a 75% activationof both neuronsab andbc. (Redrawn with permission, from Marshall (1995), copyright ElsevierScience.)

(Marshall 1990b, 1995), which in figure 1 is the sum of the weights of the neuron’s inputconnections. The sum of the input weights to neuronab is 1+ 1+ 0 = 2, and the sumof the input weights to neuronbc is 0+ 1+ 1 = 2. Thus, both of these output neuronsare said to have a size of 2. The size-normalized output signal for each output neuron iscomputed by multiplying its activation by its size. The sum of the size-normalized outputsignals in figure 1(G) is(0.75× 2) + (0.75× 2) = 3. Because this equals the sum of theinput signals, the exclusively-allocated parsing in figure 1(G) is valid (unlike the parsingsin figure 1(E) and 1(F)).

Given the examples above, exclusive allocation can be informally defined as theconjunction of the following pair of conditions. Exclusive allocation is said to be achievedwhen:

• Condition 1. The activation of every output neuron is accounted for exactly once bythe input activations.

• Condition 2. The total input equals the total size-normalized output, as closely aspossible.

These two informal exclusive allocation conditions are made more precise in subsequentsections. They are used below to evaluate the generalization performance of several neuralnetwork classifiers.

3. Generalization performance of several networks

3.1. Response to familiar and unfamiliar patterns

This section compares the generalization performance of three neural network classifiers: awinner-take-all network, an EXIN network, and a linear decorrelator network. First, eachnetwork will be described briefly.

3.1.1. Winner-take-all competitive learning.Among the simplest unsupervised learningprocedures is the winner-take-all (WTA) competitive learning rule, which divides the space


of input patterns into hyper-polyhedral decision regions, each centered around a ‘prototype’pattern. The ART-1 network (Carpenter and Grossberg 1987) and the Kohonen network(Kohonen 1982) are examples of essentially WTA neural networks. When an input patternfirst arrives, it is assigned to the one category whose prototype pattern best matches it.The activation of neurons encoding other categories is suppressed (e.g., through stronginhibition). The prototype of the winner category is then modified to make it slightly closerto the input pattern. This is done by strengthening the winner’s input connection weightsfrom features in the input pattern and/or weakening the winner’s input connection weightsfrom features not in the input pattern. In these networks, generalization is based purely onsimilarity of patterns to individual prototypes.

There can exist input patterns (e.g.,abc or b in figure 1) that are not part of the trainingset but which contain features present in two or more training patterns. Such inputs maybear similarities to multiple individual prototypes and may be quite different from anyone individual prototype. However, a WTA network cannot activate multiple categoriessimultaneously. Hence, the network cannot parse the input in a way that satisfies thesecond exclusive allocation condition, so generalization performance suffers.

3.1.2. EXIN networks. In the EXIN (EXcitatory + INhibitory learning) neural networkmodel (Marshall 1990b, 1995), this problem is overcome by using an anti-Hebbian inhibitorylearning rule in addition to a Hebbian excitatory learning rule. If two output neurons arefrequently coactive, which would happen if the categories that they encode overlap orhave common features, the lateral inhibitory weights between them become stronger. Onthe other hand, if the activations of the two output neurons are independent, which wouldhappen if the neurons encode dissimilar categories, then the inhibitory weights between thembecome weaker. This results in category scission between independent category groupingsand allows the EXIN network to generate near-optimal parsings of multiple superimposedpatterns, in terms of multiple simultaneous activations (Marshall 1995).

3.1.3. Linear decorrelator networks.Linear decorrelator networks (Oja 1982, Foldiak1989) also use an anti-Hebbian inhibitory learning rule that can cause the lateral inhibitoryconnections to vanish during learning. This allows simultaneous neural activations.However, the linear decorrelator network responds essentially to differences, or distinctivefeatures (Andersonet al 1977, Sattath and Tversky 1987) among the patterns, rather thanto the patterns themselves (Marshall 1995).

3.1.4. Example. Figure 2 (Marshall 1995) compares the exclusive allocation performanceof winner-take-all competitive learning networks, linear decorrelator networks, and EXINnetworks and illustrates the intuitions on which the rest of this paper is based. The initialconnectivity pattern in the three networks is identical (figure 2(A)). The networks are trainedon patternsab, abc, andcd, which occur with equal probability. Within each of the threenetworks, a single neuron learns to respond selectively to each of the familiar patterns. Inthe WTA and EXIN networks, the neuron labelledab develops strong input connectionsfrom a and b and weak connections fromc and d. Similarly, the neurons labelledabcand cd develop appropriate selective excitatory input connections. In the WTA network,the weights on the lateral inhibitory connections among the output neurons remain uniform,fixed, and strong enough to ensure WTA behaviour. In the EXIN network, the inhibitionbetween neuronsab andabc and between neuronsabc andcd becomes strong because ofthe overlap in the category exemplars; the inhibition between neuronsab andcd becomes


a b c d

ab abc cd

+

–

a b c d

ab abc cd

+

–

Winner-Take-All

LinearDecorrelator EXIN

(B)

(E)

(H)

(C)

(F)

(I)

(D)

(G)

(J)

–Lateralinhibitoryconnections

+Feedforwardexcitatoryconnections

Input

Output

a b c d Train on patterns ab, abc, cd.

(A) Initiallynonspecificnetwork

a b

ab abc cd

+.5 +1 +.5 +1

–1

+1

–1

c d

a b

ab abc cd

+.5 +1 +.5 +1

–1

+1

–1

c d

a b c d

ab abc cd

+

–

a b c d

ab abc cd

+

–

a b

ab abc cd

+.5 +1 +.5 +1

–1

+1

–1

c d

Input = abc

Input = abcd

Input = c +

–

a b c d

ab abc cd

a b c d

ab abc cd

+

–

Figure 2. Comparison of WTA, linear decorrelator, and EXIN networks. (A) Initially, neuronsin the input layer project excitatory connections non-specifically to neurons in the output layer.Also, each neuron in the output layer projects lateral inhibitory connections non-specifically toall its neighbours (shaded arrows). (B), (C), (D) The excitatory learning rule causes each typeof neural network to become selective for patternsab, abc, andcd after a period of exposure tothose patterns; a different neuron becomes wired to respond to each of the familiar patterns. Eachnetwork’s response to patternabc is shown. (E) In WTA, the compound patternabcd (filledlower circles) causes the single ‘nearest’ neuron (abc) (filled upper circle) to become activeand suppress the activation of the other output neurons. (G) In EXIN, the inhibitory learningrule weakens the strengths of inhibitory connections between neurons that code non-overlappingpatterns, such as between neuronsab and cd. Then whenabcd is presented, both neuronsaband cd become active (filled upper circles), representing the simultaneous presence of thefamiliar patternsab and cd. (F) The linear decorrelator responds similarly to EXIN for inputpatternabcd. However, in response to the unfamiliar patternc, both WTA (H) and EXIN (J)moderately activate (partially filled circles) the neuron whose code most closely matches thepattern (cd), whereas the linear decorrelator (I) activates a more distant match (abc). (Reprinted,with permission, from Marshall (1995), copyright Elsevier Science.)


weak because the category exemplars have no common features. The linear decorrelatornetwork learns to respond to the differences among the patterns, rather than to the patternsthemselves. For example, the neuron labelledabc really becomes wired to respond optimallyto patternc-and-not-d. In the linear decorrelator, the weights on the lateral connectionsvanish, when the responses of the three neurons become fully decorrelated.

Each of the three networks responds correctly to the patterns in the training set, byactivating the appropriate output neuron (figure 2(B), 2(C) and 2(D)). Now consider theresponse of these trained networks to the unfamiliar patternabcd. The WTA networkresponds by activating neuronabc (figure 2(E)) because the input pattern is closest to thisprototype. However, the response of the WTA network to patternabcd is the same as theresponse to patternabc. The activation of neuronabc can be credited to the input featuresa,b, andc. The input featured is not accounted for in the output: it is not accounted for bythe activation of neuroncd because the activation of neuroncd is zero. Thus, condition 1from the pair of exclusive allocation conditions is not fully satisfied. Also, the total input(1+ 1+ 1+ 1= 4) does not equal the total size-normalized output (0+ (1× 3)+ 0= 3),so the WTA network does not satisfy condition 2 for input patternabcd.

On the other hand, in the linear decorrelator and the EXIN networks, neuronsaband cd are simultaneously activated (figure 2(F) and 2(G)), because during training, theinhibition between neuronsab and cd became weak or vanished. Thus, all input featuresare fully represented in the output, and the exclusive allocation conditions are met for inputpatternabcd, in the linear decorrelator and EXIN networks. These two networks exhibita globalcontext-sensitiveconstraint satisfaction property (Marshall 1995) in their parsingof abcd: the contextual presence or absence of small distinguishing features, or nuances,(like d) dramatically alters the parsing. Whenabc is presented, the network groupsa, b,and c together as a unit, but whend is added, the network breaksc away froma and band binds it withd instead, forming two separate groupings,ab andcd.

It is evident from figure 2(C), 2(F) and 2(I) that the size-normalization value for eachneuron must be computed not by examining the neuron’s input weight valuesper se(whichwould give the wrong values for the linear decorrelator), but rather by examining the size ofthe training patterns to which the neuron responds. Thus, the output neuron sizes are(2, 3, 2)for all the networks shown in figure 2.

Now consider the response of the three networks to the unfamiliar patternc. As shownin figure 2(H), 2(I) and 2(J), the WTA and the EXIN networks respond by partially activatingneuroncd. However, in the linear decorrelator network, neuronabc is fully activated. Since,during training, this neuron was fully activated when the patternabc was presented, its fullactivation is not accounted for by the presence of featurec alone in the input pattern. Thus,condition 1 is not satisfied by the linear decorrelator network for input patternc. Note thatabc (figure 2(I)) also does not satisfy condition 2 for patternc, because 16= 1× 3.

The example of figure 2 thus illustrates that allowing multiple simultaneous neuralactivations and learning common features, rather than distinctive features, among the inputpatterns enables an EXIN network to satisfy exclusive allocation constraints and to exhibitgood generalization performance when presented with multiple superimposed patterns. Incontrast, WTA networks (by definition) cannot represent multiple patterns simultaneously.Although linear decorrelator networks can represent multiple patterns simultaneously, theyare not guaranteed to satisfy the exclusive allocation constraints.

A basic idea of this paper is that exclusive allocation provides a meaningful,self-consistent way of specifying how a network should respond to unfamiliar patternsand is therefore a valuable criterion for generalization.


3.2. Equality of total input and total output

Condition 2 will be used to compare a linear decorrelator network, an EXIN network,and a SONNET-2 (Self-Organizing Neural NETwork-2) network (Nigrin 1993). (Since aWTA network does not allow simultaneous activation of multiple category winners, it isnot considered in this example.)

3.2.1. SONNET-2. SONNET-2 is a fairly complex network, involving the use of inhibitionbetweenconnections(Desimone 1992, Reggiaet al 1992, Yuille and Grzywacz 1989),rather than between neurons, to implement exclusive allocation. The discussion in thispaper will focus on the differences in how EXIN and SONNET-2 networks implement theinhibitory competition among perceptual categories. Because the inhibition in SONNET-2acts between connections, rather than between neurons, it is more selective. Connectionsfrom one input neuron to different output neurons compete for the ‘right’ to transmit signals;this competition is implemented through an inhibitory signal that is a combination of theexcitatory signal on the connection and the activation of the corresponding output neuron.For example, figures 3(C), 3(F) and 3(I) show that connections from input featureb to thetwo output neurons compete with each other; other connections in figures 3(C), 3(F) and 3(I)do not. As in EXIN networks, the excitatory learning rule involves prototype modification ofoutput layer competition winners, and the inhibitory learning rule is based on coactivation ofthe competing neurons; hence SONNET-2 displays the global context-sensitive constraintsatisfaction property (abc versusabcd) and the sequence masking property (Cohen andGrossberg 1986, 1987) (abc versusc) (Nigrin 1993) displayed by EXIN networks.

3.2.2. Example. Figure 3 shows three networks (linear decorrelator, EXIN, SONNET-2)trained on patternsab andbc, which are assumed to occur with equal training probability.The linear decorrelator network can end up in one of many possible final configurations,subject to the constraint that the output neurons are maximally decorrelated. A problemwith linear decorrelators is that they are not guaranteed to come up with a configuration thatresponds well to unfamiliar patterns. To illustrate this point, a configuration that respondscorrectly to familiar patterns but does not generalize well to unfamiliar patterns has beenchosen for figures 3(A), 3(D) and 3(G).

When the unfamiliar and ambiguous patternb is presented, the EXIN and SONNET-2networks respond correctly by activating both neuronab and neuronbc to about 25% of theirmaximum level (figures 3(E) and 3(F)), thus representing the uncertainty in the classification.This parsing is considered a good one because there are two alternatives for matching inputpatternb (ab andbc) and the input pattern comprises half of both alternatives. Condition 2 isthus satisfied for this input pattern. The linear decorrelator network activates both neuronsaband bc to 50% of their maximum level because the neurons receive no inhibitory input(figure 3(D)); condition 2 is not satisfied because 0+ 1+ 0 6= (2× 0.5)+ (2× 0.5).

When the unfamiliar (but not ambiguous) patternac is presented, the two neurons inthe linear decorrelator network receive a net input of zero and hence do not become active.This linear decorrelator network thus does not satisfy condition 1 for this input pattern. Inthe EXIN network,ab and bc are active to about 25% of their maximum activation. Thisbehaviour arises because neuronsab andac still exert an inhibitory influence on each otherbecause of the overlap in their category prototypes, even though the subpatternsa and cin the patternac have nothing in common. The EXIN network thus does not fully satisfycondition 1 for this input pattern. On the other hand, in the SONNET-2 network, bothneuronsab and bc are correctly active to 50% of their maximum level. This parsing is


Input = ab

Input = b

Input = ac

EXIN

(B)

(E)

(H)

+

–

d

ab bc

ba c

+

–

d

ab bc

ba c

+

–

d

ab bc

ba c

Linear Decorrelator

(A)

(D)

(G)

SONNET-2

(C)

(F)

(I)

+d

ab bc

ba c

–

+d

ab bc

ba c

–

+d

ab bc

ba c

–

+.5

ab bc

b

+.5

c –.5

+.5 +.5

a –.5

+.5

ab bc

b

+.5 +.5 +.5

–.5

a c –.5

+.5

ab bc

b

+.5 +.5 +.5

ca –.5 –.5

Figure 3. Comparison of linear decorrelator, EXIN, and SONNET-2 networks. Three differentnetworks trained on patternsab and bc. (A), (B), (C) A different neuron becomes wired torespond to each of the familiar patterns. The response of each network to patternab is shown.(D) When an ambiguous patternb is presented, in the linear decorrelator network, both neuronsab and bc are active at 50% of their maximum level; the input does not fully account forthese activations (see text). (E), (F) The EXIN and SONNET-2 networks respond correctly bypartially activating both neuronsab andbc, to about 25% of the maximum activation. (G) Whenpatternac is presented, the two neurons in the linear decorrelator network receive a net inputof zero and hence do not become active. (H) In the EXIN network, neuronsab and bc stillcompete with each other, even though the subpatternsa and c are disjoint. This results inincomplete representation of the input features at the output. (I) In SONNET-2, links fromato ab and fromc to bc do not inhibit each other; this ensures that neuronsab andbc are activesufficiently (at about 50% of their maximum level) to fully account for the input features.

considered to be correct because the subpatternsa andc within the input pattern comprisehalf of the prototypesab and bc respectively, and there is only one (partially) matchingalternative for each subpattern. The SONNET-2 network responds in this manner becausethe link from input featurea to neuronab does not compete with the link from inputfeaturec to neuronbc.

The SONNET-2 network satisfies condition 2 for all three input patterns shown infigure 3, the EXIN network satisfies condition 2 on two of the input patterns, and thelinear decorrelator network satisfies condition 2 on one of the input patterns. The greaterselectivity of inhibition in SONNET-2 leads to better satisfaction of the exclusive allocationconstraints and thus better generalization. The example of figure 3 thus elaborates theconcept of exclusive allocation and is incorporated in the formalization below.


3.3. Summary of generalization behaviour examples

Figure 4 summarizes the comparison between the networks that have been considered.A ‘+’ in the table indicates that the given network possesses the corresponding propertyto a satisfactory degree; a ‘−’ indicates that it does not. The general complexity of theneural dynamics and the architecture of the networks have also been compared; the ‘<’signs indicate that complexity increases from left to right in the table. WTA networkshave fixed, uniform inhibitory connections and are considered to be the simplest of allthe networks discussed. Linear decorrelators use a single learning rule for both excitatoryand inhibitory connections; further, the inhibitory connection weights can all vanish undercertain conditions. EXIN networks use slightly different learning rules for feedforward andlateral connections. SONNET-2 implements inhibition between input layer→ output layerconnections, rather than between neurons. Thus, sophisticated generalization performanceis obtained at the cost of increased complexity.

Example WTA LD EXIN SONNET-2

Figure 2: abcversus abcd – + + +

Figure 2: cdversus c + – + +

Figure 3: bversus ac – – – +

Complexity of network < < <

Figure 4. Summary of exclusive allocation examples. The table indicates how welldifferent networks behave on the representative examples of exclusive allocation discussed insubections 3.1 and 3.2. Key:+, network behaved properly;−, network did not behave properly;<, network complexity increases from left to right.

4. Formal expression of generalization conditions

Section 3 compared the generalization performance of several networks qualitatively. Theexclusive allocation conditions will now be framed in formal terms, so that a quantitativecomputation of how well a network adheres to the exclusive allocation constraints, and aquantitative measure of generalization performance, will be theoretically possible.

As mentioned earlier, classifications done by an unsupervised classifier are determinedby the patterns present in the training environment. Hence, to formalize the two exclusiveallocation conditions, a precise way to describe the concepts or category prototypes learnedby the network must be provided. Deriving such a description is analogous to therule-extraction task (Craven and Shavlik 1994): ‘Given a trained neural network andthe examples used to train it, produce a concise and accurate symbolic description of thenetwork’ (p 38). What does a neuron’s activation mean? A possible description of thepatterns encoded by the neuron can be obtained from the connection weights. However,because of the possible presence of lateral interactions, feedback, etc, connection weightsmay not provide an accurate picture of the patterns learned by the neuron.

Another approach would be to use symbolicif–then rules (Craven and Shavlik 1994).Such a description can be quite comprehensive and elaborate; however, the number of rulesrequired to describe a network can grow exponentially with the number of input features.


Moreover, the method described by Craven and Shavlik (1994) for obtaining the rules usesexamples not contained in the training set; the case of real-valued input features is also notconsidered.

4.1. Label vectors to describe network behaviour

The approach taken in this paper is to derive alabel for each output neuron, based on theneuron’s activations in response to the familiar input patterns. The label is symbolized as alabel vectorand quantitatively summarizes the features to which the neuron responds. Anadvantage of this method is that the label can be computed by using examples drawn onlyfrom the training set.

An input pattern is represented by neural activations in the input layer. Each neuronin the output layer responds to one or more patterns; the label defines a prototype for thisgroup of patterns. The label for each output neuron is expressed in terms of the input unitsthat feed it; multilayered networks would be analysed by considering successive layers insequence.

Consider an input patternX = (x1, x2, . . . , xI ), drawn from the network’s input space.Let the network’s training set be defined by probability distributionS on the input space,and letpS(X) be the probability ofX being the input pattern on any training presentation.WhenX is presented to a network, the activation of theith input neuron isxi , and theactivation of thej th output neuron isyj (X). Y (X) = (

y1(X), y2(X), . . . , yJ (X))

is thevector of output activation values in response to inputX. Abbreviateyj ≡ yj (X), andassume 06 xi, yj 6 1. Define

L′ij =∫X

pS(X) · xi · yj dX. (1)

If S consists of a finite number of patterns instead of a continuum, then the definitionbecomes

L′ij =∑X

pS(X) · xi · yj . (2)

TheL′ij values are normalized to obtain the label values that will be used in expressing theexclusive allocation conditions:

Lij =L′ij

maxk L′kj. (3)

The labelLj of the j th output neuron is the vector(L1j , L2j , . . . , LIj ), where I is thenumber of input units.

The labelLj is a summary that characterizes the set of patterns to which neuronjresponds. The use of labels for this characterization is reasonable for most unsupervisednetworks, where learning is based on pattern similarity, and where the decision regionsthus tend to be convex. However, labels wouldnot be appropriate for characterizingnetworks with substantially non-convex decision regions, e.g., the type of network producedby many supervised learning procedures. The process of computing labels is essentially arule extraction process, to infer the structure of a network, given knowledge only of thetraining input probabilities and the network’s ‘black box’ input–output behaviour on thetraining data. Each componentLij of a label is analogous to a weight in an inferred modelof the black box network. One benefit of this approach is that it facilitates comparing thegeneralization behaviour of different networks, without regard to differences in their internalstructure or operation.


4.2. Exclusive allocation conditions

When an input pattern is presented to a network, the network parses that pattern andrepresents the parsing via activations in the output layer. The activation of each inputneuron can be decomposed into parts, each part being accounted for by (assigned to) adifferent output neuron. Thus, for condition 1 to be satisfied, the sum of these parts shouldequal the activation of the input neuron, and together they should be able to account for theactivation of all output neurons.

One can describe the decomposition by usingparsing coefficients. The parsingcoefficientCij (X, Y ) describes how much of the ‘credit’ for the activation of input neuroni

is assigned to output neuronj , given an input pattern vectorX and an output vectorY .AbbreviateCij ≡ Cij

(X, Y (X)

). If the exclusive allocation constraints are fully satisfied,

then for each input patternX (and its corresponding output vectorY (X)) in the full patternspace there should exist parsing coefficientsCij > 0 such that for all output neuronsj ,∑

i:Lij 6=0

(xi

Cij∑k Cik

Lij∑k Lkj

)= yj . (4)

In equation (4), the normalized label valuesLij/∑

k Lkj are analogous to the weights of aneural network. The parsing coefficientsCij/

∑k Cik describe how the credit for each input

activation is allocated to output neurons, so that theL-weighted,C-allocated inputs exactlyproduce the outputs. It is assumed that

∑j Cij > 0 for all i. The sum is taken only over

the non-zeroLij values; otherwise theCij coefficients would be underconstrained.The idea of using parsing coefficients to express exclusive allocation constraints is

similar to the idea of usingdynamic gating weightsfor credit assignment (Rumelhart andMcClelland 1986, Morse 1994). The dynamic gating weights are computed using the actualor static weights on the connections in the network. In contrast, parsing coefficients arecomputed using the label vector for the output neurons and are independent of the actualconnection weights. As seen in the examples in section 3, networks that respond identicallyto patterns in the training set can have very different connection weights (the weights mayeven have different signs). Hence it is difficult to compare the generalization propertiesof these networks using dynamic gating weights. On the other hand, label vectors arecomputed from the response of a network to patterns in the training set; the label vectors inthese different networks (figure 2 or figure 3) are identical. The label vector method treatseach network as a black box (independent of the network’s connectivity and weights, whichare internal to the box), examining just the networks’ inputs and outputs. This facilitates acomparison of the generalization behaviour of the networks.

4.3. Minimization form of conditions

It is possible (e.g., in the presence of noise) that a network does not satisfy equation (4)exactly. Yet it would still be desirable to measurehow close the network comes tosatisfying the exclusive allocation conditions expressed by this equation. Hence theexclusive allocation requirement should instead be framed as a minimization condition.By squaring the difference between the left-hand and right-hand sides in equation (4) andsumming over all output neurons, one obtains

E1(X, Y ) = 1

J

∑j

(yj (X)−

∑i:Lij 6=0

xiCij (X, Y )∑k Cik(X, Y )

Lij∑k Lkj

)2

(5)


where X = (x1, x2, . . . , xI ) and Y = (y1, y2, . . . , yJ ) are placeholder variables. Thenormalization 1/J adjusts for the number of output unitsJ .

By integrating over the network’s parsings of all possible input patterns, one obtainsthe quantity

E1 =∫X

E1(X, Y (X)

)dX. (6)

Thus, for each input patternX, the objective is to find a set of parsingcoefficientsCij

(X, Y (X)

)that minimizes the measureE1 of the network’s exclusive

allocation ‘deficiency’, in a least-squares sense. The measureE1 is computed across allpatterns in the full pattern space, whereas (as shown in equation (1)) the labels are computedacross only the training setS. How the parsing coefficients can beobtained, for the purposeof measuring network behaviour, is a separate question, not treated in detail in this paper.

This analysis is concerned non-constructively with theexistenceof parsing coefficientsthat satisfy or minimize the equations. In practice, the minimization can be performed in anumber of ways, e.g., using an iterative procedure (Morse 1994) to find the coefficients.

4.4. Condition 2 is necessary but not sufficient

The E1 scores produced by equation (5) can be used as a criterion to grade a network’sgeneralization behaviour on particular input pattern parsings. For instance, in figure 2(I), itis easily seen that equation (5) will produce poor scores for the linear decorrelator’s parsingof patternc, with any set of parsing coefficients.

However, for certain input patterns there can be more than one parsing that would yieldgoodE1 scores; some of these parsings may reflect better generalization behaviour thanothers. An extreme example is illustrated in figure 5, which shows a network with twoinput neurons, markeda andb, and two output neurons, markedp andq. The network istrained on two patterns,(1, 0) and(ε, 1), which occur with equal probability during training,with 0< ε � 1. By equations (1)–(3), the neuron labels in this network are

Lap = 1 Lbp = 0 Laq = ε Lbq = 1.

Suppose that, after the network has been trained, the patternX = (1, 0) is presented.As shown in figure 5(A), the network could respond by activating output neuronp fully;Y (X) = (1, 0). Using the parsing coefficients

Cap = 1 Cbp = 0 Caq = 0 Cbq = 0,

this response satisfies equation (4) for all input and output neurons. However, a networkcould instead respond as in figure 5(B), where the output isY (X) = (0, ε/(1+ ε)). In thiscase, one set of valid parsing coefficients would be

Cap = 0 Cbp = 0 Caq = 1 Cbq = 0.

Even for this parsing, equation (4) is fully satisfied. If this same relationship holds whenε is made vanishingly small, then equation (4) will always be satisfied for neuronq withthe given set of parsing coefficients. This example shows that in the limit asε → 0, theactivation of an input neuroni could be assigned to aninactive output neuronj if thecorresponding labelLij were zero, and condition 1 would be satisfied. For this reason,equation (4) excludes labels of value zero.

Condition 2 is imposed to repair further this anomaly; the parsing in figure 5(B) doesnot satisfy the equations listed below. Define

Y ∗1 (X) ={Y : E1(X, Y ) = min

ˆY

E1(X,ˆY )

}. (7)


1 10ε

a b

p q

1 10ε

a b

p q

1 10ε

b

p q

a

(A) (B) (C) (D)

ε 110

b

p q

a

Figure 5. Credit assignment example. The label vector of each of the two output neurons(markedp andq) has two elements, corresponding to the two input neurons (markeda andb).The label of neuronp is (1, 0); the label of neuronq is (ε, 1). These values are indicated bythe numbers and arrows. Thin arrows denote a weak connection, with weightε. Dashed arrowsdenote a connection with weight 0. (A) When patterna is presented, the depicted networkfully activates the output neuron markedp; this parsing satisfies conditions 1 and 2. (B) Whenpatterna is presented, the depicted network activates the output neuron markedq to a smallvalue,ε/(1+ε); this parsing satisfies exclusive allocation condition 1, but not condition 2. (C) Inresponse to inputb, neuronp becomes active in the depicted network. However, ifCbp = 1,this parsing would satisfy condition 2 but not condition 1. (D) Here the label of neuronp

is (ε, 1); the label of neuronq is (0, 1). When input patterna is presented, the network’s bestresponse (according to the neuron labels) is to activate neuronp to the levelε/(1+ ε). Thisparsing satisfies condition 1, and condition 2 should be designed so that this parsing is judgedto satisfy it.

Y ∗1 (X) is the set of all output vectors that would best satisfy condition 1, given inputX.Next, define the function

M2(X, Y ) =(∑

i

xi −∑j

(yj∑k

Lkj

))2

. (8)

This function measures the difference between the total input and the total size-normalizedoutput. The factor

∑k Lkj represents the size of output neuronj .

Next, define

E2(X, Y ) =(M2(X, Y )− min

ˆY∈Y ∗1 (X)

M2(X,ˆY )

)2

. (9)

The ‘minM2’ term represents the best condition 2 score for any parsing of any outputvector that best satisfies condition 1. The equation thus measures the difference betweenthe condition 2 score for the given parsing and the condition 2 score for the best parsing.

Finally, define

E2 =∫X

E2(X, Y (X)

)dX. (10)

This equation computes an overall score for how well all the network’s parsings satisfycondition 2.

For input patternX = (1, 0) in figures 5(A) and 5(B), both the output vectorsY (X) = (1, 0) andY (X) = (0, ε/(1+ε)) are included in the setY ∗1 (X). However, the value

of M2(X, Y ) equals zero whenY = (1, 0) and exceeds zero whenY = (0, ε/(1+ ε)). The

minimum value of theM2 function in (9) is zero. Thus,E2(X, Y ) = 0 when Y = (1, 0),andE2(X, Y ) > 0 when Y = (

0, ε/(1+ ε)). The measureE2 is minimized in networksthat behave like the network of figure 5(A), rather than like the network of figure 5(B).


As figures 5(A) and 5(B) illustrate, condition 2 is a necessary part of the definition ofexclusive allocation. Figure 5(C) shows that condition 2 alone is not sufficient to defineexclusive allocation; the parsing

Cap = 0 Cbp = 1 Caq = 0 Cbq = 0

satisfies condition 2 but not condition 1. Therefore, both conditions are necessary in thedefinition of exclusive allocation.

The equations above for condition 2 were designed to insure that the parsing shown infigure 5(D) is considered valid. In this case, the total input (= 1) does not equal the totalsize-normalized output (= ε/(1+ ε). Nevertheless, the parsing is the best one possible,given the labels shown. In equation (9), the quality of a parsing is measured relative to thequality of the best parsing, rather than to an arbitrary external standard. For this reason, theclause ‘as closely as possible’ is included in condition 2.

4.5. Inexactness tolerance

Consider equations (7)–(9). In a realistic environment with noise, there might exist anoutput vectorY that is spuriously excluded from the setY ∗1 (X), yet whoseM2 value issignificantly smaller than that of any output vector inY ∗1 (X). This situation can occurbecause equation (7) requires that theE1(X, Y ) valueexactlyequal the smallest such value

for any ˆY ; but noise can preclude exact equality.Hence, near-equality, rather than exact equality, should be required; some degree of

inexactness tolerance is necessary. Equation (7) must therefore be revised. One way toremedy the problem is to replace equation (7) with

Y ∗1 (X) ={Y : E1(X, Y ) 6 T1 min

ˆY

E1(X,ˆY )

}(11)

whereT1 > 1 is an inexactness tolerance parameter. Using this new equation,E2 measuresthe degree to which condition 2 is satisfied, relative to the bestM2 value chosen fromamong the parsings that satisfy condition 1 tolerably well.T1 thus becomes an additionalfree parameter of the evaluation process.

The two exclusive allocation conditions will be discussed further in section 5. But first,two additional constraints that refine the measure of generalization will be introduced. Theexclusive allocation conditions leave ambiguous the choice between certain parsings (forexample, between figures 1(C) and 1(D)). The two additional constraints are useful becausethey further limit the allowable choices, thereby regularizing or disambiguating the parsings.The additional constraints are optional: there may be some instances for which the addedregularization is not needed.

4.6. Sequence masking constraint

The ‘sequence masking’ property (Cohen and Grossberg 1986, 1987, Marshall 1990b, 1995,Nigrin 1993) concerns the responses of a system to patterns of different sizes (or scales). Itholds thatlarge, completeoutput representations are better than small ones or incompleteones. For example, it is better to parse input patternab as a single output categoryab(figure 6(A)) than as two smaller output categoriesa + b (figure 6(B)). It is also better toparse inputab as the complete output categoryab (figure 6(A)) than as an incomplete partof a larger output categoryabcd (figure 6(C)).


(C)

a abcdb ab

a cb d

(A)

a abcdb ab

a cb d

(B)

a abcdb ab

a cb d

Figure 6. Sequence masking. Input patternab is presented. Possible output responses satisfyingconditions 1 and 2 are shown. (A) Output isab. (B) Output isa+b. (C) Output is 50% activationof abcd.

A new sequence masking constraint can optionally be imposed, to augment the exclusiveallocation criterion, as part of the definition of generalization. The sequence maskingconstraint biases the network evaluation measure toward preferring parsings that exhibit thesequence masking property. The sequence masking constraint can be stated as

• Condition 3. Large, complete output representations are better than small ones orincomplete ones.

One way to implement condition 3 is as follows. Let

Y ∗2 (X) ={Y : E2(X, Y ) 6 T2 min

ˆY∈Y ∗1 (X)

E2(X,ˆY )

}(12)

M3(Y ) = 1∑j

((yj∑

k Lkj)2/ (

1+∑k Lkj)) (13)

E3(X) =(M3(Y (X)

)− minˆY∈Y ∗2 (X)

M3( ˆY (X)

))2

. (14)

Here Y ∗2 (X) is the set of output vectors satisfying condition 1 that also best satisfycondition 2, in response to a given input patternX. The parameterT2 > 1 specifies theinexactness tolerance of the evaluation process with regard to satisfaction of condition 2.

The functionM3 computes a bias in favour of larger, complete output representations.Using this function, the network of figure 6(A) would have anM3 score of 3

4, the networkof figure 6(B) would have anM3 score of2

2, and the network of figure 6(C) would have anM3 score of 5

4; smaller values are considered to be better.In equation (14), the ‘minM3’ term represents the best condition 3 score for any parsing

of any output vector that best satisfies conditions 1 and 2. The equation thus measures thedifference between the condition 3 score for the given parsing and the condition 3 score forthe best parsing.

To compute an overall score for how well the network’s parsings satisfy condition 3,define

E3 =∫X

E3(X, Y (X)

)dX. (15)

This equation integrates theE3 scores across all possible input patterns.


The sequence masking constraint should be imposed in measurements of generalizationwhen larger, complete representations are more desirable than small ones or incompleteones.

4.7. Uncertainty multiplexing constraint

If an input pattern is ambiguous (i.e., there exists more than one valid parsing), thenconditions 1 and 2 do not indicate whether a particular parsing should be selected or whetherthe representations for the multiple parsings should be simultaneously active. For instance,in subsection 3.2.2 (figures 1(C) or 3(E) and 3(F)), when patternb is presented, conditions 1and 2 can be satisfied if neuronab is half-active and neuronbc is inactive, or ifab is inactiveandbc is half-active, or by an infinite number of combinations of activations between thesetwo extreme cases.

Marshall (1990b, 1995) discussed the desirability of representing the ambiguity in suchcases by partially activating the alternative representations, to equal activation values. Theoutput in which neuronsab andbc are equally active at the 25% level (figures 1(C), 3(E) and3(F)) would be preferred to one in which they were unequally active: for example, whenab is 50% active andbc is inactive (figure 1(D)). This type of representation expressesthe network’suncertaintyabout the true classification of the input pattern, bymultiplexing(simultaneously activating) partial activations of the best classification alternatives.

A new ‘uncertainty multiplexing’ constraint can optionally be imposed, to augmentthe exclusive allocation criterion. The uncertainty multiplexing constraint regularizes theclassification ambiguities by limiting the allowable relative activations of the representationsfor the multiple alternative parsings for ambiguous input patterns. The uncertaintymultiplexing constraint can be stated as

• Condition 4. When there is more than one best match for an input pattern, thebest-matching representations divide the input signals equally.

The notion of best match is specified by conditions 1 and 2, and (optionally) 3. (Otherdefinitions for best match can be used instead.)

One way to implement the uncertainty multiplexing constraint is as follows. Let

Y ∗3 (X) ={Y : E3(X, Y ) 6 T3 min

ˆY∈Y ∗2 (X)

E3(X,ˆY )

}(16)

Y ∗4 (X) = mean(Y ∗3 (X)

)(17)

E4(X, Y ) =(Y − Y ∗4 (X)

)2(18)

where mean(α) ≡ (∫ α dα)/‖α‖ is the element-by-element mean of the set of vectorsα,

where‖α‖ describes the size or ‘measure’ of the regionα, and whereα2 refers to the dotproduct of vectorα with itself. FunctionY ∗2 selects the set of output vectors satisfyingcondition 1 that also best satisfy condition 2. Of this set of output vectors, functionY ∗3selects the subset that also best satisfies condition 3. FunctionY ∗4 averages all the outputvectors in this subset together. Finally,E4 treats this average as the ‘ideal’ output vectorand measures the deviation of a given output vector from the ideal. ParameterT3 > 1specifies the inexactness tolerance of the evaluation process with regard to satisfaction ofcondition 3.

By these equations, the parsing of input patternb in figure 1(C) in which neuronsaband bc are equally active at the 25% level would be preferred to a parsing in which theywere unequally active: for example, whenab is 50% active andbc is inactive.


If enforcement of uncertainty multiplexing is desired but enforcement of sequencemasking is not desired, then equation (16) can be replaced byY ∗3 (X) = Y ∗2 (X).

To compute an overall score for how well the network’s parsings satisfy condition 4,define

E4 =∫X

E4(X, Y (X)

)dX. (19)

The uncertainty multiplexing constraint should be imposed in measurements ofgeneralization when balancing ambiguity among likely alternatives is more desirable thanmaking a definite guess.

4.8. Scoring network performance

To compare objectively the generalization performance of specific networks, given aparticular training environment, one can formulate a numerical score that incorporates thefour criteriaE1, E2, E3, andE4. One can assign each of these factors a weighting to yieldan overall network performance score. For instance, the score can be defined as

ET1,T2,T3 = a1E1+ a2E2+ a3E3+ a4E4 (20)

where a1, a2, a3, and a4 are weightings that reflect the relative importance of the fourgeneralization conditions, andT1, T2, and T3 are parameters of the evaluation process,specifying the degrees to which various types of inexactness are tolerated (equations (11),(12) and (16)). The score of each network can then be computed numerically (andlaboriously). A full demonstration of such a computation would be an interesting nextstep for this research. The choice of weightings for the various factors would affect thefinal rankings.

It is theoretically possible to eliminate the free parameters for inexactness toleranceby replacing them with a fixed calculation, such as a standard deviation from the mean.However, such expedients have not been explored in this research. In any case, a single setof parameter values or calculations should be chosen for any comparison across differentnetwork types.

To compare fully the generalization performance of the classifiers themselves (e.g., alllinear decorrelators versus all EXIN networks) might require evaluating

∫E(S) dS, across all

possible training environmentsS. Such a calculation is obviously infeasible, except perhapsby stochastic analysis. Nevertheless, the comparisons can be understood qualitatively onthe basis of key examples, like the ones presented in this paper.

5. Assessing generalization in EXIN network simulations

This section discusses how the generalization criteria can be applied to measure theperformance of an EXIN network. The EXIN network was chosen as an example becauseit yields good but not perfect generalization; thus, it shows effectively how the criteriaoperate.

Figure 7(A) shows an EXIN network that has been trained with six input patterns(Marshall 1995). Figure 7(B) shows the multiplexed, context-sensitive response of thenetwork to a variety of familiar and unfamiliar input combinations. All 64 possible binaryinput patterns were tested, and reasonable results were produced in each case (Marshall1995); figure 7(B) shows 16 of the 64 tested parsings.

Given the four generalization conditions described in the preceding sections, theperformance of this EXIN network (Marshall 1995) will be summarized below. A sample


2 3 4 3 3 4

abc cdab de def

a b c d e f

a } –

}+

(A)

(B)

Figure 7. Simulation results. (A) The state of an EXIN network after 3000 training presentationsof input patterns drawn from the seta, ab, abc, cd, de, def. The input pattern coded by eachoutput neuron (thelabel) is listed above the neuron body. The approximate ‘size’ normalizationfactor of each output neuron is shown inside the neuron. Strong excitatory connections (weightsbetween 0.992 and 0.999) are indicated by lines from input (lower) neurons to output (upper)neurons. All other excitatory connections (weights between 0 and 0.046) are omitted from thefigure. Strong inhibitory connections (weights between 0.0100 and 0.0330), indicated by thicklateral arrows, remain between neurons coding patterns that overlap. The thickness of the linesis proportional to the inhibitory connection weights. All other inhibitory connections (weightsbetween 0 and 0.0006) are omitted from the figure. (B) 16 copies of the network. Each copyillustrates the network’s response to a different input pattern. Network responses are indicatedby filling of active output neurons; fractional height of filling within each circle is proportional toneuron activation value. Rectangles are drawn around the networks that indicate the responsesto one of the training patterns. (Redrawn with permission, from Marshall (1995), copyrightElsevier Science.)

of the most illustrative parsings, and the degree to which they satisfy the conditions, will bediscussed. The most complex example below, patternabcdf, is examined in greater detail,using the equations described in the preceding section to evaluate the parsing.

5.1. EXIN network response to training patterns

Consider the response of the network to a pattern in the training set, such asa (figure 7(B)).The active output neuron has the labela. It is fully active, so it fully accounts for the inputpatterna. No other output neuron is active, so the activations across the output layer are


fully accounted for by the input pattern. Thus, condition 1 is satisfied on the patterns fromthe training set. The total input almost exactly equals the total size-normalized output, socondition 2 is well satisfied on these patterns. Condition 3 is also satisfied for the trainingpatterns: for example, patternab activates output neuronab, not a or abc. Since thetraining patterns are unambiguous, condition 4 is satisfied by default on these patterns. Asseen in figure 7(B), the generalization conditions are satisfied for all patterns in the trainingset (marked by rectangles).

5.2. EXIN network response to ambiguous patterns

Consider the response of the network to an ambiguous pattern such asd. Patternd is part offamiliar patternscd, de, anddef, and it matchescd andde most closely. The correspondingtwo output neurons are active, both between the 25% and 50% levels. Conditions 1 and 2appear to be approximately satisfied on this pattern: the activation ofd is accounted forby split activation acrosscd andde, and the activations ofcd andde are accounted for bydisjoint fractions of the activation ofd. Condition 3 is well satisfied because neurondefis inactive. Condition 4 is also approximately (but not perfectly) satisfied, since the twoneuron activations are nearly equal.

Now consider patternb, which is part of patternsab and abc. The network activatesneuronab at about the 50% level. Sinceb constitutes 50% of the patternab, the activationof neuronab fully accounts for the input pattern. Likewise, the activation ofab is fullyaccounted for by the input pattern. Patternb is more similar toab than to abc, so it iscorrect for neuronabc to be inactive in this case, by condition 3.

Similarly, patternc is part ofabc andcd. However, neuronabc is slightly active, andneuroncd is active at a level slightly less than 50%. Condition 1 is satisfied on patternc:the sum of the output activations attributable toc is still the same as that of the activationsattributable tob in the previous example, and the activation of neuronsabc and cd areattributable to disjoint fractions (approximately 25% and 75%) of the activation ofc; thus,condition 2 is well satisfied. Condition 3 is not as well satisfied here as in the previousexample. The difference can be explained by the weaker inhibition betweenabc and cdthan betweenab andabc; more coactivation is thus allowed.

Input patternc is unambiguous, by condition 4. To satisfy condition 4 on an inputpattern, a network must determine which output neurons represent the best matches for theinput pattern. The simultaneous partial activation ofabc andcd is a manifestation of someinexactness tolerance in determining the best matches, by the EXIN network. Alternatively,as described in subsection 3.2.2, greater selectivity in the interneuron inhibition (as inSONNET-2) can be used to satisfy condition 2 more exactly.

The results in figure 7 show that when presented with ambiguous patterns, the EXINnetwork activates the best match, and when there is more than one best match, it permitssimultaneous activation of the best matches. Thus, the generalization behaviour onambiguous patterns meets the exclusive allocation conditions satisfactorily.

5.3. EXIN network response to multiple superimposed patterns

Consider the response of the network to patternabcd. Patternabcd can be compared withpatternsab and cd; the response toabcd is the superposition of the separate responsesto ab andcd. Conditions 1, 2 and 4 are clearly met here. As discussed in subsection 3.1.4,this is in contrast to the response of a WTA network, where the output neuronabc would


become active. Condition 3 is also met here; there is no output neuron labelledabcd, andif neuronabc were fully active, then the input from neurond could be accounted for onlyby partial activation of another output neuron, such asde.

Whenf is added toabcd to yield the patternabcdf, a chain reaction alters the network’sresponse, fromdef down toa in the output layer. The presence ofd and f causes thedefneuron to become approximately 50% active. In turn, this inhibits thecd neuron more,which then becomes less active. As a result, theabc neuron receives less inhibition andbecomes more active. This in turn inhibits the activation of neuronab. Because neuronabis less active, neurona then becomes more active. These increases and decreases tend tobalance one another, thereby keeping conditions 1 and 2 satisfied on patternabcdf. Thedominant parsing appears to beab+ cd+def, but the overlap betweencd anddef preventsthose two neurons from becoming fully coactive. As a result, the alternative parsingsinvolving abc or a can become partially active. No strong violations of conditions 3 and 4are apparent. The responses to patternscdf, abcf, andbcdf are also shown for comparison.

The patterns listed above were selected for discussion on the basis of their interestingproperties. The network’s response to all the other patterns can also be evaluated usingthe exclusive allocation criterion. In each case, the EXIN network adheres well to thefour generalization conditions. Thus, the simulation indicates the high degree with whichEXIN networks show exclusive allocation, sequence masking, and uncertainty multiplexingbehaviours.

5.4. An example credit assignment computation

The generalization conditions can be formalized in a number of ways; the equations givenabove represent one such formalization. For example, a different computation could beused to express exclusive allocation deficiency, instead of the least-squares method ofequations (5), (8), (9), (14) and (18). Nonlinearities could be introduced in the creditassignment scheme (equation (4)). The formalization given here expresses the generalizationconditions in a relatively simple manner that is suitable from a computational viewpoint.

Figure 8 describes a computation of the extent to which the network in figure 7 satisfiesthe generalization conditions for a particular input pattern. The table in the rectangledescribes approximate parsing coefficients for patternabcdf. The coefficients shown inthe table were estimated manually, to two decimal places. These coefficients represent theportion of the credit that is assigned between each input neuron activation and each outputneuron activation. For example, the activation of input neurona is 1; 21% of its creditis allocated to output neurona, and 79% is allocated toab. The input to neuronabis 0.79 + 0.38, the sum of the contributions it receives from different input neuronsweighted by the activation of the input neurons. This input is divided by the neuron’snormalization factor (‘size’), 2. This normalization factor is derived from the neuron’slabel, which is determined by the training (familiar) patterns to which the neuron responds(equations (2) and (3)). The resulting attributed activation value, 0.59, is very close to theactual activation, 0.58, of neuronab in the simulation. The existence of parsing coefficients(e.g., those in figure 8) that produce attributed activations that are all close to the actualallocations shows that condition 1 (equation (4)) is well satisfied for the input patternabcdf.

Condition 2 is well satisfied because∑

i xi (which equals 5) is very close to∑j

(yj∑

k Lkj)

(which equals(0.19× 1)+ (0.58× 2)+ (0.24× 3)+ (0.58× 2)+ (0.00×2)+ (0.56×3) = 4.91). Numerical values for conditions 3 and 4 can also be calculated, butthe calculations would be much more computationally intensive, as they call for evaluationof all possible parsings of an input pattern, within a given training environment.


.21

.79

.00.38.62 .12

.88 .30.00.70

.00

.00 1.00

a b c d e f

aab

abccdde

def

From input neuron

To

ou

tpu

t n

euro

n

.211.17

.741.18

.001.70

/ 1 =/ 2 =/ 3 =/ 2 =/ 2 =/ 3 =

.21

.59

.25

.59

.00

.57

.19

.58

.24

.58

.00

.56

Att

rib

ute

dac

tiva

tio

n

Neu

ron

siz

e

To

tal p

arse

ener

gy

Act

ual

acti

vati

on

1 1 1 1 0 1

C

L

Σ Cij

ij

kik

1 / (

Σ L

)

kkj y j

Input activations xi

Figure 8. Parsing coefficients and attributed activations. The table inside the rectangle showsparsing coefficients: the inferred decomposition of the credit from each input neuron intoeach output neuron to produce the activations shown in figure 7. Because the results of thiscomputation are very close to the EXIN network’s simulated results (compare grey shadedcolumns on the right), it can be concluded that the EXIN network satisfies condition 1 forexclusive allocation very well, on patternabcdf. (Redrawn with permission, from Marshall(1995), copyright Elsevier Science.)

6. Discussion

The exclusive allocation criterion was used to compare qualitatively the generalizationperformance of four unsupervised classifiers: WTA competitive learning networks, lineardecorrelator networks, EXIN networks, and SONNET-2 networks. The comparisons suggestthat more sophisticated generalization performance is obtained at the cost of increasedcomplexity. The exclusive allocation behaviour of an EXIN network was examined in moredetail, and one parsing was analysed quantitatively. The concept of exclusive allocation,or credit assignment, is a conceptually useful way of defining generalization because itlends itself very well to the natural problem of decomposing and identifying independentsources underlying superimposed or ambiguous signals (blind source separation) (Bell andSejnowski 1995, Comonet al 1991, Jutten and Herault 1991).

This paper has described formal criteria for evaluating the generalization propertiesof unsupervised neural networks, based on the principles of exclusive allocation, sequencemasking, and uncertainty multiplexing. The examples and simulations show that satisfactionof the generalization conditions can enable a network to do context-sensitive parsing, inresponse to multiple superimposed patterns as well as ambiguous patterns. The methoddescribes a network in terms of its response to patterns in the training set and then placesconstraints on the response of the network to all patterns, both familiar and unfamiliar.The concepts of exclusive allocation, sequence masking, and uncertainty multiplexingthus provide a principled basis for evaluating the generalization capability of unsupervisedclassifiers.

The criteria in this paper define success for a system in terms of the quality of thesystem’s internal representations of its input environment, rather than in terms of a particularexternal task. The internal representations are inferred, without actually examining the


system’s internal processing, weights, etc, through a black-box approach of ‘labelling’:observing the system’s responses to its training (‘familiar’) inputs. Then the system’sgeneralization performance is evaluated by examining its responses to both familiar andunfamiliar inputs. This definition is useful when a system’s generalization cannot bemeasured in terms of performance on a specific external task, either when objectiveclassifications (‘supervision’) of input patterns are unavailable or when the system is generalpurpose.

Acknowledgments

This research was supported in part by the Office of Naval Research (Cognitive and NeuralSciences, N00014-93-1-0208) and by the Whitaker Foundation (Special Opportunity Grant).We thank George Kalarickal, Charles Schmitt, William Ross, and Douglas Kelly for valuablediscussions.

References

Anderson J A, Silverstein J W, Ritz S A and Jones R S 1977 Distinctive features, categorical perception, andprobability learning: Some applications of a neural modelPsychol. Rev.84 413–51

Bell A J and Sejnowski T J 1995 An information-maximization approach to blind separation and blinddeconvolutionNeural Comput.7 1129–59

Bregman A S 1990Auditory Scene Analysis: The Perceptual Organization of Sound(Cambridge, MA: MIT Press)Carpenter G A and Grossberg S 1987 A massively parallel architecture for a self-organizing neural pattern

recognition machineComput. Vision, Graphics Image Process.37 54–115Cohen M A and Grossberg S 1986 Neural dynamics of speech and language coding: developmental programs,

perceptual grouping, and competition for short term memoryHuman Neurobiol.5 1–22——1987 Masking fields: A massively parallel neural architecture for learning, recognizing and predicting multiple

groupings of patterned dataAppl. Opt.26 1866–91Comon P, Jutten C and Herault J 1991 Blind separation of sources, part II: problems statementSignal Process.24

11–21Craven M W and Shavlik J W 1994 Using sampling and queries to extract rules from trained neural networks

Machine Learning: Proc. 11th Int. Conf.(San Francisco, CA: Morgan Kaufmann) pp 37–45Desimone R 1992 Neural circuits for visual attention in the primate brainNeural Networks for Vision and Image

Processinged G A Carpenter and S Grossberg (Cambridge, MA: MIT Press) pp 343–64Foldiak P 1989 Adaptive network for optimal linear feature extractionProc. Int. Joint Conf. on Neural Networks

(Washington, DC)(Piscataway, NJ: IEEE) vol I, pp 401–5Hubbard R S and Marshall J A 1994 Self-organizing neural network model of the visual inertia phenomenon in

motion perceptionTechnical Report94-001 Department of Computer Science, University of North Carolinaat Chapel Hill, 26 pp

Jutten C and Herault J 1991 Blind separation of sources, part I: an adaptive algorithm based on neuromimeticarchitectureSignal Process.24 1–10

Kohonen T 1982 Self-organized formation of topologically correct feature mapsBiol. Cybern.43 59–69Marr D 1982 Vision: A Computational Investigation into the Human Representation and Processing of Visual

Information (San Francisco, CA: Freeman)Marr D and Poggio T 1976 Cooperative computation of stereo disparityScience194 238–87Marshall J A 1990a Self-organizing neural networks for perception of visual motionNeural Networks3 45–74——1990b A self-organizing scale-sensitive neural networkProc. Int. Joint Conf. on Neural Networks

(San Diego, CA)(Piscataway, NJ: IEEE) vol III, pp 649–54——1990c Adaptive neural methods for multiplexing oriented edgesProc. SPIE 1382 (Intelligent Robots and

Computer Vision IX: Neural, Biological, and 3-D Methods, Boston, MA)ed D P Casasent, pp 282–91——1992 Development of perceptual context-sensitivity in unsupervised neural networks: Parsing, grouping, and

segmentationProc. Int. Joint Conf. on Neural Networks (Baltimore, MD)(Piscataway, NJ: IEEE) vol III,pp 315–20

——1995 Adaptive perceptual pattern recognition by self-organizing neural networks: Context, uncertainty,multiplicity, and scaleNeural Networks8 335–62


Marshall J A, Kalarickal G J and Graves E B 1996 Neural model of visual stereomatching: Slant, transparency,and cloudsNetwork: Comput. Neural Syst.7 635–70

Marshall J A, Kalarickal G J and Ross W D 1997 Transparent surface segmentation and filling-in using localcortical interactionsInvestigative Ophthalmol. Visual Sci.38 641

Marshall J A, Schmitt C P, Kalarickal G J and Alley R K 1998 Neural model of transfer-of-binding in visualrelative motion perceptionComputational Neuroscience: Trends in Research, 1998ed J M Bower, to appear

Morse B 1994 Computation of object cores from grey-level imagesPhD ThesisDepartment of Computer Science,University of North Carolina at Chapel Hill

Nigrin A 1993 Neural Networks for Pattern Recognition(Cambridge, MA: MIT Press)Oja E 1982 A simplified neuron model as a principal component analyzerJ. Math. Biol.15 267–73Reggia J A, D’Autrechy C L, Sutton G G and Weinrich M 1992 A competitive redistribution theory of neocortical

dynamicsNeural Comput.4 287–317Rumelhart D E and McClelland J L 1986 On the learning of past tenses of English verbsParallel Distributed

Processing: Explorations in the Microstructure of Cognition. Volume 2: Psychological and Biological Models(Cambridge, MA: MIT Press) pp 216–71

Sattath S and Tversky A 1987 On the relation between common and distinctive feature modelsPsychol. Rev.9416–22

Schmitt C P and Marshall J A 1998 Grouping and disambiguation in visual motion perception: A self-organizingneural circuit model, in preparation

Yuille A L and Grzywacz N M 1989 A winner-take-all mechanism based on presynaptic inhibition feedbackNeuralComput.1 334–47

Generalization and Exclusive Allocation of Credit ... - CiteSeerX

Documents

Transcript of Generalization and Exclusive Allocation of Credit ... - CiteSeerX