Effects of category diversity on learning, memory, and generalization

14
Memory & Cognition 2005, 33 (2), 289-302 This article examines people’s ability to learn cate- gories by induction over exemplars, and in particular, how the diversity of training exemplars affects the rate of learning, the pattern of generalization, and the ability to distinguish new exemplars from old ones. Although we focus on perceptual categories, the relevance of di- versity to generalization applies widely to people’s abil- ity to infer properties of things as in, for example, in- ductive inference or conceptual reasoning, as well as perceptual classification. In normative terms, diverse ev- idence gives rise to stronger inductive arguments. This diversity principle has been emphasized in the philoso- phy of science (for a recent discussion, see, e.g., Wayne, 1995), and considerable experimental work has exam- ined the extent to which it is adhered to in everyday judg- ments by both adults (e.g., Osherson, Smith, Wilkie, Lopez & Shafir, 1990) and children (for a recent discus- sion, see, e.g., Heit & Hahn, 2001). With respect to se- mantic concepts, some categories are cognitively favored over others. An issue that has provoked much discussion is the extent to which the “coherence” of categories relies on similarities and is degraded by diversity (cf. Barsalou, 1983; Corter & Gluck, 1992; Hahn & Ramscar, 2001; Jones, 1983; Murphy & Medin, 1985; Rosch, Mervis, Gray, Johnson & Boyes-Braem, 1976). However, the rel- evance of diversity to perceptual categories is uncontro- versial. Studies of perceptual categories have shown that even infants are sensitive to category variability (e.g., Mareschal, French, & Quinn, 2000; Quinn, Eimas, & Rosenkrantz, 1993; Younger, 1985). Many studies of adults (reviewed below) have similarly found that diversity influ- ences category processing. The nature of that influence re- mains unclear. We can distinguish several fundamental mechanisms by which diversity might influence category processing. First, individual item similarities could produce diversity effects. With respect to generalization, if one has en- countered a wide range of examples in the past, a novelty selected at random from a wide range of possibilities is more likely to be similar to something familiar than if one has only a narrow base of highly similar experiences on which to draw. Alternatively, there could be general effects of diversity that are independent of individual item similarities. For example, category boundaries might be shifted away from diverse categories, or similarity re- lations among stimuli might be normalized according to the amount of variability among the stimuli. Our aim in this article is to distinguish among these different mechanisms and to determine the extent to which variability gives rise to general changes in how categories are processed. To assess the generality of di- versity effects across tasks, we examine category learn- ing, generalization, and memory for category members. We also assess the generality of diversity effects across the stimulus space from the outer edges of our categories to their centers and their boundaries. Category Learning Several studies have reported that more-variable cate- gories are harder to acquire than less-variable categories (Fried & Holyoak, 1984; Homa & Vosburgh, 1976; Pe- terson, Meagher, Chait, & Gillie, 1973; Posner, Gold- 289 Copyright 2005 Psychonomic Society, Inc. L.B.C.E. was supported by the Biotechnology and Biological Sci- ences Research Council and the Unilever Corporation. We thank Jacky Boivin for valuable advice on the MANOVA and Robert Nosofsky, Dorrit Billman, and several anonymous reviewers for helpful comments on earlier drafts of this article. Correspondence should be addressed to U. Hahn, School of Psychology, Cardiff University, Tower Building, Park Place, Cardiff CF10 3AT, Wales (e-mail: [email protected]). Effects of category diversity on learning, memory, and generalization ULRIKE HAHN, TODD M. BAILEY, and LUCY B. C. ELVIN Cardiff University, Cardiff, Wales In this study, we examined the effect of within-category diversity on people’s ability to learn per- ceptual categories, their inclination to generalize categories to novel items, and their ability to distin- guish new items from old. After learning to distinguish a control category from an experimental cate- gory that was either clustered or diverse, participants performed a test of category generalization or old–new recognition. Diversity made learning more difficult, increased generalization to novel items outside the range of training items, and made it difficult to distinguish such novel items from familiar ones. Regression analyses using the generalized context model suggested that the results could be ex- plained in terms of similarities between old and new items combined with a rescaling of the similarity space that varied according to the diversity of the training items. Participants who learned the diverse category were less sensitive to psychological distance than were the participants who learned a more clustered category.

Transcript of Effects of category diversity on learning, memory, and generalization

Memory amp Cognition2005 33 (2) 289-302

This article examines peoplersquos ability to learn cate-gories by induction over exemplars and in particularhow the diversity of training exemplars affects the rateof learning the pattern of generalization and the abilityto distinguish new exemplars from old ones Althoughwe focus on perceptual categories the relevance of di-versity to generalization applies widely to peoplersquos abil-ity to infer properties of things as in for example in-ductive inference or conceptual reasoning as well asperceptual classification In normative terms diverse ev-idence gives rise to stronger inductive arguments Thisdiversity principle has been emphasized in the philoso-phy of science (for a recent discussion see eg Wayne1995) and considerable experimental work has exam-ined the extent to which it is adhered to in everyday judg-ments by both adults (eg Osherson Smith WilkieLopez amp Shafir 1990) and children (for a recent discus-sion see eg Heit amp Hahn 2001) With respect to se-mantic concepts some categories are cognitively favoredover others An issue that has provoked much discussionis the extent to which the ldquocoherencerdquo of categories relieson similarities and is degraded by diversity (cf Barsalou1983 Corter amp Gluck 1992 Hahn amp Ramscar 2001Jones 1983 Murphy amp Medin 1985 Rosch MervisGray Johnson amp Boyes-Braem 1976) However the rel-evance of diversity to perceptual categories is uncontro-versial Studies of perceptual categories have shown that

even infants are sensitive to category variability (egMareschal French amp Quinn 2000 Quinn Eimas ampRosenkrantz 1993 Younger 1985) Many studies of adults(reviewed below) have similarly found that diversity influ-ences category processing The nature of that influence re-mains unclear

We can distinguish several fundamental mechanismsby which diversity might influence category processingFirst individual item similarities could produce diversityeffects With respect to generalization if one has en-countered a wide range of examples in the past a noveltyselected at random from a wide range of possibilities ismore likely to be similar to something familiar than ifone has only a narrow base of highly similar experienceson which to draw Alternatively there could be generaleffects of diversity that are independent of individualitem similarities For example category boundaries mightbe shifted away from diverse categories or similarity re-lations among stimuli might be normalized according tothe amount of variability among the stimuli

Our aim in this article is to distinguish among thesedifferent mechanisms and to determine the extent towhich variability gives rise to general changes in howcategories are processed To assess the generality of di-versity effects across tasks we examine category learn-ing generalization and memory for category membersWe also assess the generality of diversity effects acrossthe stimulus space from the outer edges of our categoriesto their centers and their boundaries

Category LearningSeveral studies have reported that more-variable cate-

gories are harder to acquire than less-variable categories(Fried amp Holyoak 1984 Homa amp Vosburgh 1976 Pe-terson Meagher Chait amp Gillie 1973 Posner Gold-

289 Copyright 2005 Psychonomic Society Inc

LBCE was supported by the Biotechnology and Biological Sci-ences Research Council and the Unilever Corporation We thank JackyBoivin for valuable advice on the MANOVA and Robert NosofskyDorrit Billman and several anonymous reviewers for helpful commentson earlier drafts of this article Correspondence should be addressed toU Hahn School of Psychology Cardiff University Tower BuildingPark Place Cardiff CF10 3AT Wales (e-mail hahnucardiffacuk)

Effects of category diversity on learningmemory and generalization

ULRIKE HAHN TODD M BAILEY and LUCY B C ELVINCardiff University Cardiff Wales

In this study we examined the effect of within-category diversity on peoplersquos ability to learn per-ceptual categories their inclination to generalize categories to novel items and their ability to distin-guish new items from old After learning to distinguish a control category from an experimental cate-gory that was either clustered or diverse participants performed a test of category generalization oroldndashnew recognition Diversity made learning more difficult increased generalization to novel itemsoutside the range of training items and made it difficult to distinguish such novel items from familiarones Regression analyses using the generalized context model suggested that the results could be ex-plained in terms of similarities between old and new items combined with a rescaling of the similarityspace that varied according to the diversity of the training items Participants who learned the diversecategory were less sensitive to psychological distance than were the participants who learned a moreclustered category

290 HAHN BAILEY AND ELVIN

smith amp Welton 1967 Posner amp Keele 1968) How-ever some of these studies are open to alternative inter-pretations so the evidence that diversity affects categorylearning is less overwhelming than it first appears In thestudies by Posner and colleagues involving random dis-tortions of prototype images stimulus variability is con-founded with recognizabilitymdashparticipants in the low-variability condition but not the high-variability conditionrecognized three out of four category prototypes as famil-iar shapes (a triangle an ldquoMrdquo and an ldquoFrdquo) Although thePosner studies are therefore unconvincing as evidence thatmore-variable categories are harder to learn two subse-quent studies found differences in learnability due to ex-emplar diversity even when prototype images were randomdot patterns rather than familiar images (Homa amp Vos-burgh 1976 Peterson et al 1973)

The only other study demonstrating an effect of di-versity on category learning Fried and Holyoak (1984)used random checkerboard patterns as prototypes andcreated exemplars of each category by randomly invert-ing the color of some squares relative to the prototypeDiverse categories (with many squares inverted relativeto the prototypes) were harder to learn than less-variablecategories (with fewer squares inverted) As Stewart andChater (2002) point out the probability that a sizablechunk of a checkerboard pattern remains constant ishigher among less-diverse patterns than among thosewith more variability Therefore if discrimination be-

tween checkerboard categories relies on chunks commonto stimuli in the same category (McLaren 1997 Palmeriamp Nosofsky 2001) Fried and Holyoakrsquos effect of diver-sity suggests that it is easier to abstract a few large fea-tures from less-diverse patterns than a larger number ofsmall features from more-diverse patterns A similar in-terpretation applies to the results discussed above in-volving distorted dot patterns

If this explanation is correctmdashthat learning difficultyis determined solely by the number of features requiredto distinguish the target categoriesmdashvariability within afixed number of features should not affect learnabilityThe categories in our study have been designed to test thisprediction They are distinguished by their locations rel-ative to two obvious continuous perceptual dimensionsthe stimuli are not composed of numerous componentparts that could be aggregated into abstract features or in-variant chunks

Generalization to Novel InstancesHow does diversity during category learning affect the

subsequent classification of novel items Many studieshave shown that generalization is affected by the diver-sity of category members but the effect of diversity ap-pears to vary with the location of test items relative tothe category prototypes Far away from a prototype thediversity of previously seen category members has a pos-itive effect on generalization of that category This has

Figure 1 Training stimuli for the clustered and diverse versions of the ex-perimental category and for the control category

Label

Label

Label

Clu

ster

edC

ontr

olD

iver

se

A

A

A B

B

B C

C

C D

D

D E

E

E

F

F

F

G

G

G

H

H

H

I

I

I

CATEGORY DIVERSITY 291

been shown both for items near the boundary betweencategories (Cohen Nosofsky amp Zaki 2001 Fried ampHolyoak 1984 Rips 1989 for examples of boundaryitems see Figure 2 items A I and G) and for peripheralitems on the outer fringes of previously seen exemplars(Flannagan Fried amp Holyoak 1986 Fried amp Holyoak1984 Homa amp Vosburgh 1976 Posner amp Keele 1968for examples of peripheral items see Figure 2 items DE and F) In contrast category diversity has a negativeeffect on generalization in the vicinity of the prototype

(Flannagan et al 1986 Fried amp Holyoak 1984 Homaamp Vosburgh 1976)1 A noteworthy exception to the gen-eral pattern was reported by Peterson et al (1973) whoobserved negative effects of category diversity on gen-eralization across a wide range of distances from cate-gory prototypes However the criterion-learning taskused by Peterson et al meant that participants learninghigher diversity categories received more training thandid those learning lower diversity categories This dif-ference in amount of training could explain the uni-

Figure 2 Positions of training stimuli (marked by labeled dots) within the parameter space defined byhead area and stem height One group of participants learned to distinguish the control category from theclustered category XC (A) and the other group learned to distinguish the same control category from thediverse category XD (B) Category prototypes (not seen during training) are indicated by stars and a lin-ear category boundary is shown as a dashed line midway between the prototypes

Figure 3 Positions of novel test items (marked by asterisks) and training items (dots) within the headarea and stem height parameter space for participants in the clustered category condition (A) and the di-verse category condition (B)

292 HAHN BAILEY AND ELVIN

formly negative effects of diversity observed by Peter-son et al (cf Flannagan et al 1986)

In summary the effect of diversity on generalizationseems to vary so that results for boundary items do notnecessarily carry over to central items or to peripheralitems and so on One possibility is that these various ef-fects simply reflect differences in item similarities Ex-emplars of a diverse category are on average more dis-tant from the category prototype (see Figure 3) whichcould account for a negative effect on generalization toa previously unseen prototype At the same time theseexemplars are closer to the category boundary and alsoto fixed peripheral items which could account for a pos-itive effect on generalization Are there any effects of di-versity above and beyond item similarities and if so arethese effects the same across the stimulus space Somestudies have controlled or factored out item similarities(Cohen et al 2001 Posner amp Keele 1968 Rips 1989Stewart amp Chater 2002 but see the discussion in E ESmith amp Sloman 1994 and Nosofsky amp Johansen 2000)and some have examined generalization right across thestimulus space (Flannagan et al 1986 Fried ampHolyoak1984 Homa amp Vosburgh 1976 Peterson et al1973) However no study has done both Our study wasdesigned to test for general effects of diversity after itemsimilarities were factored out while probing generaliza-tion across boundary central and peripheral items

We also aimed to distinguish between two basic mech-anisms that might alter generalization in response tovariability First response biases (ie a basic tendencyto favor one response category over another) might formin the classification decision process in favor of more-variable categories (Ashby amp Maddox 1990 Cohenet al 2001 see also Stewart amp Chater 2002) Secondsimilarity comparisons might be rescaled or normalizedby the amount of variability observed within the relevantcategories altering the perception of distance in the rel-evant stimulus space Thus if we did observe a generaleffect of variability our analyses were designed to dis-tinguish whether the effect was mediated by a responsebias or some form of rescaling

Memory for InstancesHow does diversity during category learning affect

our ability to distinguish new instances from old Al-though some theorists maintain that categorization doesnot rely on memory for specific exemplars no one deniesthat exemplars experienced during category-learningtasks normally form traces in (episodic) memory Arethose memory traces affected by the diversity of itemsexperienced during learning The mechanisms of re-sponse bias and rescaling discussed above in relation togeneralization may also be relevant to instance memoryEmpirically we are aware of only one study that has ex-amined the effect of diversity in a category-learning taskon subsequent recognition memory Neumann (1977)trained people on a single category then asked them torate the familiarity of novel items (ldquoHow certain are you

that you saw this during trainingrdquo) Responses varied de-pending on the distribution of training items This resultshowed that the perceived familiarity of new items was af-fected by the diversity of items in the training categoryHowever it is likely that some or all of the effect observedby Neumann reflected item similarities so it is not clearwhether more general effects of diversity were involved

OVERVIEW OF THE PRESENT STUDY

The goal of our study was to test the effect of categorydiversity on learning generalization and instance mem-ory Each participant in our study learned a control cat-egory which was the same for all participants and an ex-perimental category that was either clustered or diverseThis design made it possible in principle to detect re-sponse bias effects due to category diversity and also todistinguish between global rescaling across the wholestimulus space and category-specific rescaling aroundeach category according to its own variability The dis-tance between category prototypes and between thenearest exemplars of the control and experimental cate-gories was the same for all participants as was theamount of training

Our study used simple two-dimensional materialsflowers that varied in stem height and head area asshown in Figure 1 These materials allowed us to manip-ulate category variability without affecting the meaning-fulness of the stimuli (cf Peterson et al 1973 Posner ampKeele 1968) Also with these materials we could plau-sibly assume that peoplersquos category representationswould be based on two perceptual dimensions (corre-sponding closely to stem height and head area) whichmade it possible to factor out effects of item similarity inregression models Some reassurance for the justifica-tion of this assumption is provided by the fact that thetwo dimensions are assigned equal weights in best-fit re-gressions (see Appendixes A and B) Optimal classifi-cation accuracy for our stimuli requires equal attention tothe dimensions of head area and stem height If Cohenet al (2001) are correct that learners distribute attentionoptimally the equal weighting we observe lends credenceto a close correspondence between psychological andphysical dimensions for our stimuli

TrainingIn a supervised classification training task each par-

ticipant learned the control category (C) and an experi-mental category (X ) All participants had the same num-ber of training trials for all exemplars This controlledexemplar frequency which can enhance generalization(eg Nosofsky 1988) but did not guarantee the samelevel of category acquisition across both categories orboth groups as training to criterion performance wouldWe manipulated the dispersion of training items aroundthe category X prototype so that half the participantslearned a clustered category (XC) and half learned a di-verse category (XD) as shown in Figure 2

CATEGORY DIVERSITY 293

If it is generally more difficult to learn highly variablecategories than to learn less variable ones even when thenumber of distinguishing features is held constant par-ticipants learning XD should make more errors requiremore training before achieving criterion performanceand give slower responses than participants learning XC(though previous studies have not assessed whether cate-gory diversity affects response speed during learning)Moreover these effects might be either global affectingcategories C and X equally or category specific affectingonly category X

Generalization and RecognitionAfter the training task each participant completed ei-

ther a generalization or old ndashnew recognition task in-volving the novel items shown in Figure 3 These includethe previously unseen category prototypes internalitems within the region spanned by the training itemsboundary items that were equidistant from the two cate-gory prototypes (on the dashed lines in Figures 2 and 3)and peripheral items beyond the outermost trainingitems In addition to the novel items which were thesame for all participants the generalization task in-cluded some of the training items to reinforce categorydiversity during the task The recognition task includedall of the training items in addition to the novel ones

Participants in the generalization task classified itemsas members of category C category X or neither (as inPeterson et al 1973) The neither response provides par-ticipants with an alternative to random guessing for pe-ripheral items that may bear little resemblance to eithertraining category The availability of a neither responseelicits up to 60 more information compared with atwo-alternative forced-choice task and reveals the struc-ture of each category on the far side as well as near theboundary between them The ratio of C versus X re-sponses should be unaffected by the availability of a nei-ther response (Luce 1959) To see this suppose that with-out neither responses a particular peripheral item isclassified in category C rather than X 9 out of 10 timesWith neither responses suppose that 80 of the cate-gory C responses for this item change to neither alongwith the same fraction of category X responses The ratiobetween C and X responses remains unchanged

On the basis of item similarities and consistent withprevious results for items at various locations within thestimulus space (eg Fried amp Holyoak 1984 and otherworks discussed above) we predicted that on novelitems far from the prototype the diverse category XDwould generalize more than the clustered category XCproducing more category X responses to boundary itemsand to peripheral items beyond the category X trainingexemplars For the prototype of the experimental cate-gory we predicted that the diverse category would gen-eralize less producing fewer category X responses to thisitem than the clustered category would Following Friedand Holyoak we also predicted that people who learnedthe diverse category would give fewer neither responses

overall to novel items than would people who learned theclustered category

In the recognition task participants classified eachitem as old if they thought they had seen it during thetraining task or else new There are virtually no relevantprevious studies from which to derive detailed predic-tions but if diversity of training items affects recogni-tion people who learned the diverse category shouldproduce more false alarms to novel items than peoplewho learned the clustered category

To factor out differences in item similarities betweenthe two groups we fit regression models that predictedgeneralization and recognition responses to all test items(old and new) on the basis of their similarities to thetraining exemplars Analyses tested whether categorydiversity affected response biases and whether therewere either global or category-specific differences indistance scaling Our regression analyses were based onthe generalized context model (GCM Nosofsky 19841986 see also Medin amp Schaffer 1978) However ourinterest was not to test a particular theory of categoriza-tion or recognition but to test whether item similaritiesalone were sufficient to explain any differences we ob-served between groups The GCM has a well-documentedcapacity to model exemplar effects for perceptual classi-fication stimuli such as ours and it provides a powerfulanalytical tool to test for effects above and beyond the in-fluence of item similarities Details of our adaptations ofthe GCM are given in Appendixes A and B

METHOD

ParticipantsThe participants were 73 undergraduate psychology students who

received course credit for participating All but six were female Theparticipants were randomly assigned to one of two training condi-tions with 36 participants 18ndash24 years old (M 1944 SD 127)in the clustered category condition and 37 participants 18ndash31 yearsold (M 1954 SD 219) in the diverse condition No color-blindparticipants were recruited After the training task 41 participantsdid the generalization task (20 and 21 participants from the clusteredand diverse conditions respectively) The other 32 participants didthe recognition task (16 each from the two training conditions)

StimuliThe training task used three sets of nine flowers depicted in Fig-

ure 1 comprising exemplars of the control category C the clusteredcategory XC and the diverse category XD There were three variantsof each flower with different colors in the flower head Figure 2shows the locations of the training exemplars within the two-dimensional space of stem height and head area Categories XC andXD had the same prototype that is the same stem height and headarea averaged across training exemplars Apart from exemplar Iwhich is the same in XC and XD all other exemplars were closer tothe prototype in XC than in XD Exemplars of the control categorycorresponded to pairwise average coordinates of XC and XD exem-plars reflected across the diagonal category boundary shown as adashed line in Figure 2 Category C was therefore a mirror image av-erage of XC and XD with an intermediate level of diversity Neitherstem height nor head area alone were sufficient to distinguish cate-gory C from XC or XDmdashparticipants had to use both stimulus di-mensions to accurately classify members of the two categories

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

290 HAHN BAILEY AND ELVIN

smith amp Welton 1967 Posner amp Keele 1968) How-ever some of these studies are open to alternative inter-pretations so the evidence that diversity affects categorylearning is less overwhelming than it first appears In thestudies by Posner and colleagues involving random dis-tortions of prototype images stimulus variability is con-founded with recognizabilitymdashparticipants in the low-variability condition but not the high-variability conditionrecognized three out of four category prototypes as famil-iar shapes (a triangle an ldquoMrdquo and an ldquoFrdquo) Although thePosner studies are therefore unconvincing as evidence thatmore-variable categories are harder to learn two subse-quent studies found differences in learnability due to ex-emplar diversity even when prototype images were randomdot patterns rather than familiar images (Homa amp Vos-burgh 1976 Peterson et al 1973)

The only other study demonstrating an effect of di-versity on category learning Fried and Holyoak (1984)used random checkerboard patterns as prototypes andcreated exemplars of each category by randomly invert-ing the color of some squares relative to the prototypeDiverse categories (with many squares inverted relativeto the prototypes) were harder to learn than less-variablecategories (with fewer squares inverted) As Stewart andChater (2002) point out the probability that a sizablechunk of a checkerboard pattern remains constant ishigher among less-diverse patterns than among thosewith more variability Therefore if discrimination be-

tween checkerboard categories relies on chunks commonto stimuli in the same category (McLaren 1997 Palmeriamp Nosofsky 2001) Fried and Holyoakrsquos effect of diver-sity suggests that it is easier to abstract a few large fea-tures from less-diverse patterns than a larger number ofsmall features from more-diverse patterns A similar in-terpretation applies to the results discussed above in-volving distorted dot patterns

If this explanation is correctmdashthat learning difficultyis determined solely by the number of features requiredto distinguish the target categoriesmdashvariability within afixed number of features should not affect learnabilityThe categories in our study have been designed to test thisprediction They are distinguished by their locations rel-ative to two obvious continuous perceptual dimensionsthe stimuli are not composed of numerous componentparts that could be aggregated into abstract features or in-variant chunks

Generalization to Novel InstancesHow does diversity during category learning affect the

subsequent classification of novel items Many studieshave shown that generalization is affected by the diver-sity of category members but the effect of diversity ap-pears to vary with the location of test items relative tothe category prototypes Far away from a prototype thediversity of previously seen category members has a pos-itive effect on generalization of that category This has

Figure 1 Training stimuli for the clustered and diverse versions of the ex-perimental category and for the control category

Label

Label

Label

Clu

ster

edC

ontr

olD

iver

se

A

A

A B

B

B C

C

C D

D

D E

E

E

F

F

F

G

G

G

H

H

H

I

I

I

CATEGORY DIVERSITY 291

been shown both for items near the boundary betweencategories (Cohen Nosofsky amp Zaki 2001 Fried ampHolyoak 1984 Rips 1989 for examples of boundaryitems see Figure 2 items A I and G) and for peripheralitems on the outer fringes of previously seen exemplars(Flannagan Fried amp Holyoak 1986 Fried amp Holyoak1984 Homa amp Vosburgh 1976 Posner amp Keele 1968for examples of peripheral items see Figure 2 items DE and F) In contrast category diversity has a negativeeffect on generalization in the vicinity of the prototype

(Flannagan et al 1986 Fried amp Holyoak 1984 Homaamp Vosburgh 1976)1 A noteworthy exception to the gen-eral pattern was reported by Peterson et al (1973) whoobserved negative effects of category diversity on gen-eralization across a wide range of distances from cate-gory prototypes However the criterion-learning taskused by Peterson et al meant that participants learninghigher diversity categories received more training thandid those learning lower diversity categories This dif-ference in amount of training could explain the uni-

Figure 2 Positions of training stimuli (marked by labeled dots) within the parameter space defined byhead area and stem height One group of participants learned to distinguish the control category from theclustered category XC (A) and the other group learned to distinguish the same control category from thediverse category XD (B) Category prototypes (not seen during training) are indicated by stars and a lin-ear category boundary is shown as a dashed line midway between the prototypes

Figure 3 Positions of novel test items (marked by asterisks) and training items (dots) within the headarea and stem height parameter space for participants in the clustered category condition (A) and the di-verse category condition (B)

292 HAHN BAILEY AND ELVIN

formly negative effects of diversity observed by Peter-son et al (cf Flannagan et al 1986)

In summary the effect of diversity on generalizationseems to vary so that results for boundary items do notnecessarily carry over to central items or to peripheralitems and so on One possibility is that these various ef-fects simply reflect differences in item similarities Ex-emplars of a diverse category are on average more dis-tant from the category prototype (see Figure 3) whichcould account for a negative effect on generalization toa previously unseen prototype At the same time theseexemplars are closer to the category boundary and alsoto fixed peripheral items which could account for a pos-itive effect on generalization Are there any effects of di-versity above and beyond item similarities and if so arethese effects the same across the stimulus space Somestudies have controlled or factored out item similarities(Cohen et al 2001 Posner amp Keele 1968 Rips 1989Stewart amp Chater 2002 but see the discussion in E ESmith amp Sloman 1994 and Nosofsky amp Johansen 2000)and some have examined generalization right across thestimulus space (Flannagan et al 1986 Fried ampHolyoak1984 Homa amp Vosburgh 1976 Peterson et al1973) However no study has done both Our study wasdesigned to test for general effects of diversity after itemsimilarities were factored out while probing generaliza-tion across boundary central and peripheral items

We also aimed to distinguish between two basic mech-anisms that might alter generalization in response tovariability First response biases (ie a basic tendencyto favor one response category over another) might formin the classification decision process in favor of more-variable categories (Ashby amp Maddox 1990 Cohenet al 2001 see also Stewart amp Chater 2002) Secondsimilarity comparisons might be rescaled or normalizedby the amount of variability observed within the relevantcategories altering the perception of distance in the rel-evant stimulus space Thus if we did observe a generaleffect of variability our analyses were designed to dis-tinguish whether the effect was mediated by a responsebias or some form of rescaling

Memory for InstancesHow does diversity during category learning affect

our ability to distinguish new instances from old Al-though some theorists maintain that categorization doesnot rely on memory for specific exemplars no one deniesthat exemplars experienced during category-learningtasks normally form traces in (episodic) memory Arethose memory traces affected by the diversity of itemsexperienced during learning The mechanisms of re-sponse bias and rescaling discussed above in relation togeneralization may also be relevant to instance memoryEmpirically we are aware of only one study that has ex-amined the effect of diversity in a category-learning taskon subsequent recognition memory Neumann (1977)trained people on a single category then asked them torate the familiarity of novel items (ldquoHow certain are you

that you saw this during trainingrdquo) Responses varied de-pending on the distribution of training items This resultshowed that the perceived familiarity of new items was af-fected by the diversity of items in the training categoryHowever it is likely that some or all of the effect observedby Neumann reflected item similarities so it is not clearwhether more general effects of diversity were involved

OVERVIEW OF THE PRESENT STUDY

The goal of our study was to test the effect of categorydiversity on learning generalization and instance mem-ory Each participant in our study learned a control cat-egory which was the same for all participants and an ex-perimental category that was either clustered or diverseThis design made it possible in principle to detect re-sponse bias effects due to category diversity and also todistinguish between global rescaling across the wholestimulus space and category-specific rescaling aroundeach category according to its own variability The dis-tance between category prototypes and between thenearest exemplars of the control and experimental cate-gories was the same for all participants as was theamount of training

Our study used simple two-dimensional materialsflowers that varied in stem height and head area asshown in Figure 1 These materials allowed us to manip-ulate category variability without affecting the meaning-fulness of the stimuli (cf Peterson et al 1973 Posner ampKeele 1968) Also with these materials we could plau-sibly assume that peoplersquos category representationswould be based on two perceptual dimensions (corre-sponding closely to stem height and head area) whichmade it possible to factor out effects of item similarity inregression models Some reassurance for the justifica-tion of this assumption is provided by the fact that thetwo dimensions are assigned equal weights in best-fit re-gressions (see Appendixes A and B) Optimal classifi-cation accuracy for our stimuli requires equal attention tothe dimensions of head area and stem height If Cohenet al (2001) are correct that learners distribute attentionoptimally the equal weighting we observe lends credenceto a close correspondence between psychological andphysical dimensions for our stimuli

TrainingIn a supervised classification training task each par-

ticipant learned the control category (C) and an experi-mental category (X ) All participants had the same num-ber of training trials for all exemplars This controlledexemplar frequency which can enhance generalization(eg Nosofsky 1988) but did not guarantee the samelevel of category acquisition across both categories orboth groups as training to criterion performance wouldWe manipulated the dispersion of training items aroundthe category X prototype so that half the participantslearned a clustered category (XC) and half learned a di-verse category (XD) as shown in Figure 2

CATEGORY DIVERSITY 293

If it is generally more difficult to learn highly variablecategories than to learn less variable ones even when thenumber of distinguishing features is held constant par-ticipants learning XD should make more errors requiremore training before achieving criterion performanceand give slower responses than participants learning XC(though previous studies have not assessed whether cate-gory diversity affects response speed during learning)Moreover these effects might be either global affectingcategories C and X equally or category specific affectingonly category X

Generalization and RecognitionAfter the training task each participant completed ei-

ther a generalization or old ndashnew recognition task in-volving the novel items shown in Figure 3 These includethe previously unseen category prototypes internalitems within the region spanned by the training itemsboundary items that were equidistant from the two cate-gory prototypes (on the dashed lines in Figures 2 and 3)and peripheral items beyond the outermost trainingitems In addition to the novel items which were thesame for all participants the generalization task in-cluded some of the training items to reinforce categorydiversity during the task The recognition task includedall of the training items in addition to the novel ones

Participants in the generalization task classified itemsas members of category C category X or neither (as inPeterson et al 1973) The neither response provides par-ticipants with an alternative to random guessing for pe-ripheral items that may bear little resemblance to eithertraining category The availability of a neither responseelicits up to 60 more information compared with atwo-alternative forced-choice task and reveals the struc-ture of each category on the far side as well as near theboundary between them The ratio of C versus X re-sponses should be unaffected by the availability of a nei-ther response (Luce 1959) To see this suppose that with-out neither responses a particular peripheral item isclassified in category C rather than X 9 out of 10 timesWith neither responses suppose that 80 of the cate-gory C responses for this item change to neither alongwith the same fraction of category X responses The ratiobetween C and X responses remains unchanged

On the basis of item similarities and consistent withprevious results for items at various locations within thestimulus space (eg Fried amp Holyoak 1984 and otherworks discussed above) we predicted that on novelitems far from the prototype the diverse category XDwould generalize more than the clustered category XCproducing more category X responses to boundary itemsand to peripheral items beyond the category X trainingexemplars For the prototype of the experimental cate-gory we predicted that the diverse category would gen-eralize less producing fewer category X responses to thisitem than the clustered category would Following Friedand Holyoak we also predicted that people who learnedthe diverse category would give fewer neither responses

overall to novel items than would people who learned theclustered category

In the recognition task participants classified eachitem as old if they thought they had seen it during thetraining task or else new There are virtually no relevantprevious studies from which to derive detailed predic-tions but if diversity of training items affects recogni-tion people who learned the diverse category shouldproduce more false alarms to novel items than peoplewho learned the clustered category

To factor out differences in item similarities betweenthe two groups we fit regression models that predictedgeneralization and recognition responses to all test items(old and new) on the basis of their similarities to thetraining exemplars Analyses tested whether categorydiversity affected response biases and whether therewere either global or category-specific differences indistance scaling Our regression analyses were based onthe generalized context model (GCM Nosofsky 19841986 see also Medin amp Schaffer 1978) However ourinterest was not to test a particular theory of categoriza-tion or recognition but to test whether item similaritiesalone were sufficient to explain any differences we ob-served between groups The GCM has a well-documentedcapacity to model exemplar effects for perceptual classi-fication stimuli such as ours and it provides a powerfulanalytical tool to test for effects above and beyond the in-fluence of item similarities Details of our adaptations ofthe GCM are given in Appendixes A and B

METHOD

ParticipantsThe participants were 73 undergraduate psychology students who

received course credit for participating All but six were female Theparticipants were randomly assigned to one of two training condi-tions with 36 participants 18ndash24 years old (M 1944 SD 127)in the clustered category condition and 37 participants 18ndash31 yearsold (M 1954 SD 219) in the diverse condition No color-blindparticipants were recruited After the training task 41 participantsdid the generalization task (20 and 21 participants from the clusteredand diverse conditions respectively) The other 32 participants didthe recognition task (16 each from the two training conditions)

StimuliThe training task used three sets of nine flowers depicted in Fig-

ure 1 comprising exemplars of the control category C the clusteredcategory XC and the diverse category XD There were three variantsof each flower with different colors in the flower head Figure 2shows the locations of the training exemplars within the two-dimensional space of stem height and head area Categories XC andXD had the same prototype that is the same stem height and headarea averaged across training exemplars Apart from exemplar Iwhich is the same in XC and XD all other exemplars were closer tothe prototype in XC than in XD Exemplars of the control categorycorresponded to pairwise average coordinates of XC and XD exem-plars reflected across the diagonal category boundary shown as adashed line in Figure 2 Category C was therefore a mirror image av-erage of XC and XD with an intermediate level of diversity Neitherstem height nor head area alone were sufficient to distinguish cate-gory C from XC or XDmdashparticipants had to use both stimulus di-mensions to accurately classify members of the two categories

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 291

been shown both for items near the boundary betweencategories (Cohen Nosofsky amp Zaki 2001 Fried ampHolyoak 1984 Rips 1989 for examples of boundaryitems see Figure 2 items A I and G) and for peripheralitems on the outer fringes of previously seen exemplars(Flannagan Fried amp Holyoak 1986 Fried amp Holyoak1984 Homa amp Vosburgh 1976 Posner amp Keele 1968for examples of peripheral items see Figure 2 items DE and F) In contrast category diversity has a negativeeffect on generalization in the vicinity of the prototype

(Flannagan et al 1986 Fried amp Holyoak 1984 Homaamp Vosburgh 1976)1 A noteworthy exception to the gen-eral pattern was reported by Peterson et al (1973) whoobserved negative effects of category diversity on gen-eralization across a wide range of distances from cate-gory prototypes However the criterion-learning taskused by Peterson et al meant that participants learninghigher diversity categories received more training thandid those learning lower diversity categories This dif-ference in amount of training could explain the uni-

Figure 2 Positions of training stimuli (marked by labeled dots) within the parameter space defined byhead area and stem height One group of participants learned to distinguish the control category from theclustered category XC (A) and the other group learned to distinguish the same control category from thediverse category XD (B) Category prototypes (not seen during training) are indicated by stars and a lin-ear category boundary is shown as a dashed line midway between the prototypes

Figure 3 Positions of novel test items (marked by asterisks) and training items (dots) within the headarea and stem height parameter space for participants in the clustered category condition (A) and the di-verse category condition (B)

292 HAHN BAILEY AND ELVIN

formly negative effects of diversity observed by Peter-son et al (cf Flannagan et al 1986)

In summary the effect of diversity on generalizationseems to vary so that results for boundary items do notnecessarily carry over to central items or to peripheralitems and so on One possibility is that these various ef-fects simply reflect differences in item similarities Ex-emplars of a diverse category are on average more dis-tant from the category prototype (see Figure 3) whichcould account for a negative effect on generalization toa previously unseen prototype At the same time theseexemplars are closer to the category boundary and alsoto fixed peripheral items which could account for a pos-itive effect on generalization Are there any effects of di-versity above and beyond item similarities and if so arethese effects the same across the stimulus space Somestudies have controlled or factored out item similarities(Cohen et al 2001 Posner amp Keele 1968 Rips 1989Stewart amp Chater 2002 but see the discussion in E ESmith amp Sloman 1994 and Nosofsky amp Johansen 2000)and some have examined generalization right across thestimulus space (Flannagan et al 1986 Fried ampHolyoak1984 Homa amp Vosburgh 1976 Peterson et al1973) However no study has done both Our study wasdesigned to test for general effects of diversity after itemsimilarities were factored out while probing generaliza-tion across boundary central and peripheral items

We also aimed to distinguish between two basic mech-anisms that might alter generalization in response tovariability First response biases (ie a basic tendencyto favor one response category over another) might formin the classification decision process in favor of more-variable categories (Ashby amp Maddox 1990 Cohenet al 2001 see also Stewart amp Chater 2002) Secondsimilarity comparisons might be rescaled or normalizedby the amount of variability observed within the relevantcategories altering the perception of distance in the rel-evant stimulus space Thus if we did observe a generaleffect of variability our analyses were designed to dis-tinguish whether the effect was mediated by a responsebias or some form of rescaling

Memory for InstancesHow does diversity during category learning affect

our ability to distinguish new instances from old Al-though some theorists maintain that categorization doesnot rely on memory for specific exemplars no one deniesthat exemplars experienced during category-learningtasks normally form traces in (episodic) memory Arethose memory traces affected by the diversity of itemsexperienced during learning The mechanisms of re-sponse bias and rescaling discussed above in relation togeneralization may also be relevant to instance memoryEmpirically we are aware of only one study that has ex-amined the effect of diversity in a category-learning taskon subsequent recognition memory Neumann (1977)trained people on a single category then asked them torate the familiarity of novel items (ldquoHow certain are you

that you saw this during trainingrdquo) Responses varied de-pending on the distribution of training items This resultshowed that the perceived familiarity of new items was af-fected by the diversity of items in the training categoryHowever it is likely that some or all of the effect observedby Neumann reflected item similarities so it is not clearwhether more general effects of diversity were involved

OVERVIEW OF THE PRESENT STUDY

The goal of our study was to test the effect of categorydiversity on learning generalization and instance mem-ory Each participant in our study learned a control cat-egory which was the same for all participants and an ex-perimental category that was either clustered or diverseThis design made it possible in principle to detect re-sponse bias effects due to category diversity and also todistinguish between global rescaling across the wholestimulus space and category-specific rescaling aroundeach category according to its own variability The dis-tance between category prototypes and between thenearest exemplars of the control and experimental cate-gories was the same for all participants as was theamount of training

Our study used simple two-dimensional materialsflowers that varied in stem height and head area asshown in Figure 1 These materials allowed us to manip-ulate category variability without affecting the meaning-fulness of the stimuli (cf Peterson et al 1973 Posner ampKeele 1968) Also with these materials we could plau-sibly assume that peoplersquos category representationswould be based on two perceptual dimensions (corre-sponding closely to stem height and head area) whichmade it possible to factor out effects of item similarity inregression models Some reassurance for the justifica-tion of this assumption is provided by the fact that thetwo dimensions are assigned equal weights in best-fit re-gressions (see Appendixes A and B) Optimal classifi-cation accuracy for our stimuli requires equal attention tothe dimensions of head area and stem height If Cohenet al (2001) are correct that learners distribute attentionoptimally the equal weighting we observe lends credenceto a close correspondence between psychological andphysical dimensions for our stimuli

TrainingIn a supervised classification training task each par-

ticipant learned the control category (C) and an experi-mental category (X ) All participants had the same num-ber of training trials for all exemplars This controlledexemplar frequency which can enhance generalization(eg Nosofsky 1988) but did not guarantee the samelevel of category acquisition across both categories orboth groups as training to criterion performance wouldWe manipulated the dispersion of training items aroundthe category X prototype so that half the participantslearned a clustered category (XC) and half learned a di-verse category (XD) as shown in Figure 2

CATEGORY DIVERSITY 293

If it is generally more difficult to learn highly variablecategories than to learn less variable ones even when thenumber of distinguishing features is held constant par-ticipants learning XD should make more errors requiremore training before achieving criterion performanceand give slower responses than participants learning XC(though previous studies have not assessed whether cate-gory diversity affects response speed during learning)Moreover these effects might be either global affectingcategories C and X equally or category specific affectingonly category X

Generalization and RecognitionAfter the training task each participant completed ei-

ther a generalization or old ndashnew recognition task in-volving the novel items shown in Figure 3 These includethe previously unseen category prototypes internalitems within the region spanned by the training itemsboundary items that were equidistant from the two cate-gory prototypes (on the dashed lines in Figures 2 and 3)and peripheral items beyond the outermost trainingitems In addition to the novel items which were thesame for all participants the generalization task in-cluded some of the training items to reinforce categorydiversity during the task The recognition task includedall of the training items in addition to the novel ones

Participants in the generalization task classified itemsas members of category C category X or neither (as inPeterson et al 1973) The neither response provides par-ticipants with an alternative to random guessing for pe-ripheral items that may bear little resemblance to eithertraining category The availability of a neither responseelicits up to 60 more information compared with atwo-alternative forced-choice task and reveals the struc-ture of each category on the far side as well as near theboundary between them The ratio of C versus X re-sponses should be unaffected by the availability of a nei-ther response (Luce 1959) To see this suppose that with-out neither responses a particular peripheral item isclassified in category C rather than X 9 out of 10 timesWith neither responses suppose that 80 of the cate-gory C responses for this item change to neither alongwith the same fraction of category X responses The ratiobetween C and X responses remains unchanged

On the basis of item similarities and consistent withprevious results for items at various locations within thestimulus space (eg Fried amp Holyoak 1984 and otherworks discussed above) we predicted that on novelitems far from the prototype the diverse category XDwould generalize more than the clustered category XCproducing more category X responses to boundary itemsand to peripheral items beyond the category X trainingexemplars For the prototype of the experimental cate-gory we predicted that the diverse category would gen-eralize less producing fewer category X responses to thisitem than the clustered category would Following Friedand Holyoak we also predicted that people who learnedthe diverse category would give fewer neither responses

overall to novel items than would people who learned theclustered category

In the recognition task participants classified eachitem as old if they thought they had seen it during thetraining task or else new There are virtually no relevantprevious studies from which to derive detailed predic-tions but if diversity of training items affects recogni-tion people who learned the diverse category shouldproduce more false alarms to novel items than peoplewho learned the clustered category

To factor out differences in item similarities betweenthe two groups we fit regression models that predictedgeneralization and recognition responses to all test items(old and new) on the basis of their similarities to thetraining exemplars Analyses tested whether categorydiversity affected response biases and whether therewere either global or category-specific differences indistance scaling Our regression analyses were based onthe generalized context model (GCM Nosofsky 19841986 see also Medin amp Schaffer 1978) However ourinterest was not to test a particular theory of categoriza-tion or recognition but to test whether item similaritiesalone were sufficient to explain any differences we ob-served between groups The GCM has a well-documentedcapacity to model exemplar effects for perceptual classi-fication stimuli such as ours and it provides a powerfulanalytical tool to test for effects above and beyond the in-fluence of item similarities Details of our adaptations ofthe GCM are given in Appendixes A and B

METHOD

ParticipantsThe participants were 73 undergraduate psychology students who

received course credit for participating All but six were female Theparticipants were randomly assigned to one of two training condi-tions with 36 participants 18ndash24 years old (M 1944 SD 127)in the clustered category condition and 37 participants 18ndash31 yearsold (M 1954 SD 219) in the diverse condition No color-blindparticipants were recruited After the training task 41 participantsdid the generalization task (20 and 21 participants from the clusteredand diverse conditions respectively) The other 32 participants didthe recognition task (16 each from the two training conditions)

StimuliThe training task used three sets of nine flowers depicted in Fig-

ure 1 comprising exemplars of the control category C the clusteredcategory XC and the diverse category XD There were three variantsof each flower with different colors in the flower head Figure 2shows the locations of the training exemplars within the two-dimensional space of stem height and head area Categories XC andXD had the same prototype that is the same stem height and headarea averaged across training exemplars Apart from exemplar Iwhich is the same in XC and XD all other exemplars were closer tothe prototype in XC than in XD Exemplars of the control categorycorresponded to pairwise average coordinates of XC and XD exem-plars reflected across the diagonal category boundary shown as adashed line in Figure 2 Category C was therefore a mirror image av-erage of XC and XD with an intermediate level of diversity Neitherstem height nor head area alone were sufficient to distinguish cate-gory C from XC or XDmdashparticipants had to use both stimulus di-mensions to accurately classify members of the two categories

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

292 HAHN BAILEY AND ELVIN

formly negative effects of diversity observed by Peter-son et al (cf Flannagan et al 1986)

In summary the effect of diversity on generalizationseems to vary so that results for boundary items do notnecessarily carry over to central items or to peripheralitems and so on One possibility is that these various ef-fects simply reflect differences in item similarities Ex-emplars of a diverse category are on average more dis-tant from the category prototype (see Figure 3) whichcould account for a negative effect on generalization toa previously unseen prototype At the same time theseexemplars are closer to the category boundary and alsoto fixed peripheral items which could account for a pos-itive effect on generalization Are there any effects of di-versity above and beyond item similarities and if so arethese effects the same across the stimulus space Somestudies have controlled or factored out item similarities(Cohen et al 2001 Posner amp Keele 1968 Rips 1989Stewart amp Chater 2002 but see the discussion in E ESmith amp Sloman 1994 and Nosofsky amp Johansen 2000)and some have examined generalization right across thestimulus space (Flannagan et al 1986 Fried ampHolyoak1984 Homa amp Vosburgh 1976 Peterson et al1973) However no study has done both Our study wasdesigned to test for general effects of diversity after itemsimilarities were factored out while probing generaliza-tion across boundary central and peripheral items

We also aimed to distinguish between two basic mech-anisms that might alter generalization in response tovariability First response biases (ie a basic tendencyto favor one response category over another) might formin the classification decision process in favor of more-variable categories (Ashby amp Maddox 1990 Cohenet al 2001 see also Stewart amp Chater 2002) Secondsimilarity comparisons might be rescaled or normalizedby the amount of variability observed within the relevantcategories altering the perception of distance in the rel-evant stimulus space Thus if we did observe a generaleffect of variability our analyses were designed to dis-tinguish whether the effect was mediated by a responsebias or some form of rescaling

Memory for InstancesHow does diversity during category learning affect

our ability to distinguish new instances from old Al-though some theorists maintain that categorization doesnot rely on memory for specific exemplars no one deniesthat exemplars experienced during category-learningtasks normally form traces in (episodic) memory Arethose memory traces affected by the diversity of itemsexperienced during learning The mechanisms of re-sponse bias and rescaling discussed above in relation togeneralization may also be relevant to instance memoryEmpirically we are aware of only one study that has ex-amined the effect of diversity in a category-learning taskon subsequent recognition memory Neumann (1977)trained people on a single category then asked them torate the familiarity of novel items (ldquoHow certain are you

that you saw this during trainingrdquo) Responses varied de-pending on the distribution of training items This resultshowed that the perceived familiarity of new items was af-fected by the diversity of items in the training categoryHowever it is likely that some or all of the effect observedby Neumann reflected item similarities so it is not clearwhether more general effects of diversity were involved

OVERVIEW OF THE PRESENT STUDY

The goal of our study was to test the effect of categorydiversity on learning generalization and instance mem-ory Each participant in our study learned a control cat-egory which was the same for all participants and an ex-perimental category that was either clustered or diverseThis design made it possible in principle to detect re-sponse bias effects due to category diversity and also todistinguish between global rescaling across the wholestimulus space and category-specific rescaling aroundeach category according to its own variability The dis-tance between category prototypes and between thenearest exemplars of the control and experimental cate-gories was the same for all participants as was theamount of training

Our study used simple two-dimensional materialsflowers that varied in stem height and head area asshown in Figure 1 These materials allowed us to manip-ulate category variability without affecting the meaning-fulness of the stimuli (cf Peterson et al 1973 Posner ampKeele 1968) Also with these materials we could plau-sibly assume that peoplersquos category representationswould be based on two perceptual dimensions (corre-sponding closely to stem height and head area) whichmade it possible to factor out effects of item similarity inregression models Some reassurance for the justifica-tion of this assumption is provided by the fact that thetwo dimensions are assigned equal weights in best-fit re-gressions (see Appendixes A and B) Optimal classifi-cation accuracy for our stimuli requires equal attention tothe dimensions of head area and stem height If Cohenet al (2001) are correct that learners distribute attentionoptimally the equal weighting we observe lends credenceto a close correspondence between psychological andphysical dimensions for our stimuli

TrainingIn a supervised classification training task each par-

ticipant learned the control category (C) and an experi-mental category (X ) All participants had the same num-ber of training trials for all exemplars This controlledexemplar frequency which can enhance generalization(eg Nosofsky 1988) but did not guarantee the samelevel of category acquisition across both categories orboth groups as training to criterion performance wouldWe manipulated the dispersion of training items aroundthe category X prototype so that half the participantslearned a clustered category (XC) and half learned a di-verse category (XD) as shown in Figure 2

CATEGORY DIVERSITY 293

If it is generally more difficult to learn highly variablecategories than to learn less variable ones even when thenumber of distinguishing features is held constant par-ticipants learning XD should make more errors requiremore training before achieving criterion performanceand give slower responses than participants learning XC(though previous studies have not assessed whether cate-gory diversity affects response speed during learning)Moreover these effects might be either global affectingcategories C and X equally or category specific affectingonly category X

Generalization and RecognitionAfter the training task each participant completed ei-

ther a generalization or old ndashnew recognition task in-volving the novel items shown in Figure 3 These includethe previously unseen category prototypes internalitems within the region spanned by the training itemsboundary items that were equidistant from the two cate-gory prototypes (on the dashed lines in Figures 2 and 3)and peripheral items beyond the outermost trainingitems In addition to the novel items which were thesame for all participants the generalization task in-cluded some of the training items to reinforce categorydiversity during the task The recognition task includedall of the training items in addition to the novel ones

Participants in the generalization task classified itemsas members of category C category X or neither (as inPeterson et al 1973) The neither response provides par-ticipants with an alternative to random guessing for pe-ripheral items that may bear little resemblance to eithertraining category The availability of a neither responseelicits up to 60 more information compared with atwo-alternative forced-choice task and reveals the struc-ture of each category on the far side as well as near theboundary between them The ratio of C versus X re-sponses should be unaffected by the availability of a nei-ther response (Luce 1959) To see this suppose that with-out neither responses a particular peripheral item isclassified in category C rather than X 9 out of 10 timesWith neither responses suppose that 80 of the cate-gory C responses for this item change to neither alongwith the same fraction of category X responses The ratiobetween C and X responses remains unchanged

On the basis of item similarities and consistent withprevious results for items at various locations within thestimulus space (eg Fried amp Holyoak 1984 and otherworks discussed above) we predicted that on novelitems far from the prototype the diverse category XDwould generalize more than the clustered category XCproducing more category X responses to boundary itemsand to peripheral items beyond the category X trainingexemplars For the prototype of the experimental cate-gory we predicted that the diverse category would gen-eralize less producing fewer category X responses to thisitem than the clustered category would Following Friedand Holyoak we also predicted that people who learnedthe diverse category would give fewer neither responses

overall to novel items than would people who learned theclustered category

In the recognition task participants classified eachitem as old if they thought they had seen it during thetraining task or else new There are virtually no relevantprevious studies from which to derive detailed predic-tions but if diversity of training items affects recogni-tion people who learned the diverse category shouldproduce more false alarms to novel items than peoplewho learned the clustered category

To factor out differences in item similarities betweenthe two groups we fit regression models that predictedgeneralization and recognition responses to all test items(old and new) on the basis of their similarities to thetraining exemplars Analyses tested whether categorydiversity affected response biases and whether therewere either global or category-specific differences indistance scaling Our regression analyses were based onthe generalized context model (GCM Nosofsky 19841986 see also Medin amp Schaffer 1978) However ourinterest was not to test a particular theory of categoriza-tion or recognition but to test whether item similaritiesalone were sufficient to explain any differences we ob-served between groups The GCM has a well-documentedcapacity to model exemplar effects for perceptual classi-fication stimuli such as ours and it provides a powerfulanalytical tool to test for effects above and beyond the in-fluence of item similarities Details of our adaptations ofthe GCM are given in Appendixes A and B

METHOD

ParticipantsThe participants were 73 undergraduate psychology students who

received course credit for participating All but six were female Theparticipants were randomly assigned to one of two training condi-tions with 36 participants 18ndash24 years old (M 1944 SD 127)in the clustered category condition and 37 participants 18ndash31 yearsold (M 1954 SD 219) in the diverse condition No color-blindparticipants were recruited After the training task 41 participantsdid the generalization task (20 and 21 participants from the clusteredand diverse conditions respectively) The other 32 participants didthe recognition task (16 each from the two training conditions)

StimuliThe training task used three sets of nine flowers depicted in Fig-

ure 1 comprising exemplars of the control category C the clusteredcategory XC and the diverse category XD There were three variantsof each flower with different colors in the flower head Figure 2shows the locations of the training exemplars within the two-dimensional space of stem height and head area Categories XC andXD had the same prototype that is the same stem height and headarea averaged across training exemplars Apart from exemplar Iwhich is the same in XC and XD all other exemplars were closer tothe prototype in XC than in XD Exemplars of the control categorycorresponded to pairwise average coordinates of XC and XD exem-plars reflected across the diagonal category boundary shown as adashed line in Figure 2 Category C was therefore a mirror image av-erage of XC and XD with an intermediate level of diversity Neitherstem height nor head area alone were sufficient to distinguish cate-gory C from XC or XDmdashparticipants had to use both stimulus di-mensions to accurately classify members of the two categories

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 293

If it is generally more difficult to learn highly variablecategories than to learn less variable ones even when thenumber of distinguishing features is held constant par-ticipants learning XD should make more errors requiremore training before achieving criterion performanceand give slower responses than participants learning XC(though previous studies have not assessed whether cate-gory diversity affects response speed during learning)Moreover these effects might be either global affectingcategories C and X equally or category specific affectingonly category X

Generalization and RecognitionAfter the training task each participant completed ei-

ther a generalization or old ndashnew recognition task in-volving the novel items shown in Figure 3 These includethe previously unseen category prototypes internalitems within the region spanned by the training itemsboundary items that were equidistant from the two cate-gory prototypes (on the dashed lines in Figures 2 and 3)and peripheral items beyond the outermost trainingitems In addition to the novel items which were thesame for all participants the generalization task in-cluded some of the training items to reinforce categorydiversity during the task The recognition task includedall of the training items in addition to the novel ones

Participants in the generalization task classified itemsas members of category C category X or neither (as inPeterson et al 1973) The neither response provides par-ticipants with an alternative to random guessing for pe-ripheral items that may bear little resemblance to eithertraining category The availability of a neither responseelicits up to 60 more information compared with atwo-alternative forced-choice task and reveals the struc-ture of each category on the far side as well as near theboundary between them The ratio of C versus X re-sponses should be unaffected by the availability of a nei-ther response (Luce 1959) To see this suppose that with-out neither responses a particular peripheral item isclassified in category C rather than X 9 out of 10 timesWith neither responses suppose that 80 of the cate-gory C responses for this item change to neither alongwith the same fraction of category X responses The ratiobetween C and X responses remains unchanged

On the basis of item similarities and consistent withprevious results for items at various locations within thestimulus space (eg Fried amp Holyoak 1984 and otherworks discussed above) we predicted that on novelitems far from the prototype the diverse category XDwould generalize more than the clustered category XCproducing more category X responses to boundary itemsand to peripheral items beyond the category X trainingexemplars For the prototype of the experimental cate-gory we predicted that the diverse category would gen-eralize less producing fewer category X responses to thisitem than the clustered category would Following Friedand Holyoak we also predicted that people who learnedthe diverse category would give fewer neither responses

overall to novel items than would people who learned theclustered category

In the recognition task participants classified eachitem as old if they thought they had seen it during thetraining task or else new There are virtually no relevantprevious studies from which to derive detailed predic-tions but if diversity of training items affects recogni-tion people who learned the diverse category shouldproduce more false alarms to novel items than peoplewho learned the clustered category

To factor out differences in item similarities betweenthe two groups we fit regression models that predictedgeneralization and recognition responses to all test items(old and new) on the basis of their similarities to thetraining exemplars Analyses tested whether categorydiversity affected response biases and whether therewere either global or category-specific differences indistance scaling Our regression analyses were based onthe generalized context model (GCM Nosofsky 19841986 see also Medin amp Schaffer 1978) However ourinterest was not to test a particular theory of categoriza-tion or recognition but to test whether item similaritiesalone were sufficient to explain any differences we ob-served between groups The GCM has a well-documentedcapacity to model exemplar effects for perceptual classi-fication stimuli such as ours and it provides a powerfulanalytical tool to test for effects above and beyond the in-fluence of item similarities Details of our adaptations ofthe GCM are given in Appendixes A and B

METHOD

ParticipantsThe participants were 73 undergraduate psychology students who

received course credit for participating All but six were female Theparticipants were randomly assigned to one of two training condi-tions with 36 participants 18ndash24 years old (M 1944 SD 127)in the clustered category condition and 37 participants 18ndash31 yearsold (M 1954 SD 219) in the diverse condition No color-blindparticipants were recruited After the training task 41 participantsdid the generalization task (20 and 21 participants from the clusteredand diverse conditions respectively) The other 32 participants didthe recognition task (16 each from the two training conditions)

StimuliThe training task used three sets of nine flowers depicted in Fig-

ure 1 comprising exemplars of the control category C the clusteredcategory XC and the diverse category XD There were three variantsof each flower with different colors in the flower head Figure 2shows the locations of the training exemplars within the two-dimensional space of stem height and head area Categories XC andXD had the same prototype that is the same stem height and headarea averaged across training exemplars Apart from exemplar Iwhich is the same in XC and XD all other exemplars were closer tothe prototype in XC than in XD Exemplars of the control categorycorresponded to pairwise average coordinates of XC and XD exem-plars reflected across the diagonal category boundary shown as adashed line in Figure 2 Category C was therefore a mirror image av-erage of XC and XD with an intermediate level of diversity Neitherstem height nor head area alone were sufficient to distinguish cate-gory C from XC or XDmdashparticipants had to use both stimulus di-mensions to accurately classify members of the two categories

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

294 HAHN BAILEY AND ELVIN

Thirteen novel flowers for the generalization and recognitiontasks were located as in Figure 3 We included more peripheralitems for category X in anticipation of the diversity manipulation af-fecting it more than category C The novel items for category C oc-cupied mirror image coordinates with respect to correspondingitems for category X All 13 novel flowers were used in the general-ization task along with training exemplars Acent Dcent F cent H cent and I centfrom category C and the corresponding exemplars from either XC orXD as appropriate The recognition task used the novel flowers plusthe exemplars of category C and either XC or XD as appropriate

Three different flower head color schemes were used to create atask-irrelevant dimension of variation with three variants of eachflower During training each participant saw each flower the samenumber of times in each color scheme In the generalization andrecognition tasks each flower occurred once in each color schemeIn addition to reducing ceiling effects during training the colorvariations allowed us to probe recognition memory for each flowerthree times (once in each color scheme) without actually present-ing any specific stimulus more than once

ProcedureThe participants were tested individually in a quiet room First

the experimenter read instructions for the learning task aloud Theparticipants were encouraged to respond quickly but accurately Acomputer controlled by a SuperLab script presented stimuli on adisplay screen and recorded the participantsrsquo keypresses and laten-cies During training the participants classified flowers into aldquogoldrdquo category (category C) and a ldquosilverrdquo category (category X )by pressing keys labeled with an appropriately colored star Aftereach response the flower image was replaced with feedback sayingeither CORRECT or INCORRECT for 500 msec before the next flowerwas presented There were 15 blocks of training trials with each ofthe 18 training flowers presented once per block in a different ran-dom order Random orders were matched across participants in thetwo training conditions Colors were randomized but each flowerappeared five times in each color scheme The participants had ashort break after every three blocks

After training each participant performed either the generaliza-tion or oldndashnew recognition task Both tasks began with instruc-tions read aloud by the experimenter In the generalization task theparticipants classified flowers into the categories ldquogoldrdquo ldquosilverrdquoor ldquoneitherrdquo (a blue sticker marked the key for a neither response)There were three blocks of generalization trials with 13 novel flow-

ers and 10 old flowers in each Colors were randomized with eachflower appearing once in each color scheme In the oldndashnew recog-nition task the participants decided whether each flower had ap-peared during the training task and responded by pressing keys la-beled old or new There were three blocks of recognition trials with13 novel and 18 old flowers in each Again colors were random-ized with each flower appearing once in each color scheme Theparticipants received no feedback during either the generalizationor recognition tasks Random orders were matched across partici-pants in the clustered and diverse training conditions After eachclassification or oldndashnew response the test flower was replacedwith the question ldquoHow confident are yourdquo and the participantsresponded on a scale from 1 (low confidence) to 9 (high confi-dence) The participants had the opportunity for a short break be-tween each block

RESULTS

LearningThe first block of trials was excluded from analysis

because it would primarily reflect guessing Participantslearning the diverse category generally made more er-rors than those learning the clustered category as shownin Figure 4 Inferential tests were based on three mea-sures of difficulty computed for each participant thefirst training block completed with no errors the totalnumber of errors made and the mean reaction time (RT)for correct responses Here and throughout two-tailedtests are reported and significance is evaluated at analpha level of 05

In computing the first error-free training block (from2ndash15) two participants learning the clustered categoryand 10 learning the diverse one did not achieve an error-free block These participants were assigned scores of16 Scores for the first error-free block did not satisfyparametric assumptions so this variable was analyzedseparately from errors and RTs Participants learning theclustered category achieved their first error-free blockearlier in training [median 40 IQR (interquartilerange) 675] than did participants learning the diverse

Figure 4 Fraction of error responses during category learning as a functionof training group (clustered or diverse) category (C or X) and block of training

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 295

category (median 90 IQR 120) This differencewas significant in a MannndashWhitney test (U 4460n1 36 n2 37 p 014)

The total numbers of errors were computed separatelyfor the control and experimental categories and a squareroot transform was applied to the totals for each partici-pant to reduce skewness kurtosis and heterogeneity ofvariance The mean root errors were 262 and 277 forthe clustered condition categories C and X respectivelyand 324 and 360 for the diverse condition (SE 02380189 0206 and 0226 respectively) These mean val-ues are detransformed and expressed as error rates inTable 1 RTs for correct responses were log transformedThree outlying RTs (out of 16765) less than 200 msecor greater than 5000 msec were excluded from analysisThe mean log RTs were 2775 and 2779 for the clusteredcondition categories C and X and 2836 and 2848 forthe diverse condition (SE 0016 0016 0014 and0015 respectively) These mean values are detransformedand shown in Table 1 A two-way multivariate analysis ofvariance (MANOVA) was performed on root errors andlog reaction times with training diversity (clustered ordiverse) as a between-subjects factor and stimulus cate-

gory (control or experimental) as a within-subjects factor2Here and throughout there were no multivariate outlierswithin levels of training diversity ( p 001) This meansthat the other statistics from these MANOVAs are not un-duly influenced by a few unrepresentative data points

In combination errors and RTs were affected by cat-egory diversity [F(270) 767 p 001] People learn-ing the clustered category made fewer errors and gavefaster responses than did people learning the diverse cat-egory There was no significant difference between thecontrol and experimental categories [F(270) 162p 21] nor an interaction between diversity conditionand category [F(270) 1] Thus the diversity of cate-gory X affected responses to both categories C and Xequally

Errors and RTs made independent contributions to thecombined differences between training groups as indi-cated by a unique effect for each variable after factoringout the other in RoyndashBargmann stepdown analyses[Fs(170) 524 ps 025] In univariate ANOVAs themain effect of category diversity was significant both forerrors [F(171) 778 p 007] and for RTs [F(171) 951 p 003]

GeneralizationAverage response percentages for various types of test

item in the category generalization task are shown inTable 2 Asterisks identify those entries that are relevantto the predicted diversity effects we outlined above Tolimit the family-wise error rate we tested only a few keycontrasts from this table and followed up with regressionanalyses encompassing the whole of the data Confidenceratings for the generalization and recognition tasks havenot yet been analyzed and are not reported

Table 1Summary Measures of Training Difficulty Including Error

Rates and Reaction Times (RTs in Milliseconds) for CorrectResponses as a Function of Training Group (Clustered or

Diverse) and Test Category (C or X)

Error Rate () RTs

Training Category C Category X Category C Category X

Clustered 55 61 595 601Diverse 83 103 685 705

Table 2Distribution of Generalization Responses Showing

Percentage of Each Response Alternative as a Function ofStimulus Region and Type of Novel Item

Response

Control Experimental Neither

Stimulus Region Item Type Training SD SD SD

Boundary Boundary Clustered 54 5 18 4 28 4Diverse 53 5 23 4 24 4

Control Prototype Clustered 95 3 2 2 3 2Diverse 95 3 0 0 5 3

Inside Clustered 97 2 0 0 3 2Diverse 95 3 2 2 3 2

Peripheral Clustered 90 6 5 5 5 4Diverse 87 7 0 0 13 7

Experimental Prototype Clustered 2 2 88 4 10 4Diverse 2 2 76 7 22 7

Inside Clustered 0 0 92 3 8 3Diverse 2 2 79 7 19 7

Peripheral Clustered 6 2 50 4 44 4Diverse 8 2 66 4 26 3

Overall Clustered 37 3 38 3 26 2

Diverse 37 3 43 3 21 2

Entries relevant to predicted diversity effects outlined in text

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

296 HAHN BAILEY AND ELVIN

Analysis of response fractions Our initial analysisof generalization examined four measures the fractionof neither responses across all novel items (neither) thefraction of category X responses for novel boundaryitems (boundaryX ) the fraction of category X responsesfor novel items on the periphery of category X (periphX )and the fraction of category X responses for the unseencategory X prototype (protoX ) The protoX scores werenear ceiling and could not be analyzed parametricallyAlthough protoX scores were slightly lower for partici-pants trained on the diverse category compared with theclustered category this difference between groups wasnot significant [MannndashWhitney (corrected for ties) Z 116 n1 20 n2 21 p 25]

A one-way MANOVA was performed on the variablesneither boundaryX and periphX with training diversity(clustered or diverse) as a between-subjects factor Incombination these variables were affected by trainingdiversity [F(337) 319 p 035] People trained onthe diverse category gave fewer neither responses over-all more category X responses for boundary items andmore category X responses for items on the periphery ofcategory X as predicted RoyndashBargmann stepdown analy-sis (details not reported here) indicated that the signifi-cance of the overall effect was due primarily to the pe-riphX variablemdashpeople trained on the diverse categorygave fewer neither responses to items on the periphery ofcategory X and classified them as members of category Xinstead

Regression models In our stimuli similarities be-tween test and training items necessarily varied as afunction of training diversity so it is possible that the ef-fect of diversity on periphX for example reflects the factthat peripheral items were closer to some of the diversetraining exemplars than to the corresponding clusteredexemplars To factor out the effect of item similaritiesand to test for a general effect of variability we fit re-gression models as described in Appendix A to predicteach participantrsquos responses to the full set of test items(both old and new) The predicted probability of a par-ticular response was a function of the similarity betweeneach test item and the training exemplars of categories Cand X The full model included three parameters of in-terest (1) a distance scaling parameter s that deter-mined the sensitivity of response probabilities to a givenchange in head area or stem height (2) a category C rela-tive response bias biasC | CX that determined the bias (ona scale from 0 to 1) to classify something as category Crather than X and (3) a neither response bias biasneitherthat determined the bias to say that something belongedto neither category C nor X Changes in s and biasneithercan have superficially similar effects on predicted re-sponse probabilities but a change in biasneither will notaffect the ratio of C to X responses whereas a change insensitivity will The two parameters consequently havedistinguishable effects on regression fits

Goodness of fit statistics for models with and withoutseparate biasC | CX parameters for each participant indi-

cated that it did not vary significantly within traininggroups [c2(39) 41 p 37] nor between traininggroups [c2(1) 021 p 65] Subsequent analysesfixed this parameter at the overall median value whichfavored category C over X by a ratio of more than 21This level of bias means that an item that was equallysimilar to both categories would be classified as a mem-ber of category C twice as often as category X Thisstrong bias may reflect participantsrsquo expectation that thetwo categories would occur equally often whereas therewas a preponderance of novel items on the category Xside of the stimulus space3

With biasC | CX held at the overall median value we de-termined best-fit values of s and biasneither for each partic-ipant Figure 5 shows response contours for the two train-ing groups based on median values of s and biasneither Asthe 70 contours show the inclusion of a neither re-sponse allowed us to model category structure on the farside of each category as well as on the side near theboundary The median values of biasneither were 00297and 00310 for the clustered and diverse training groupsrespectively (IQRs 00446 and 00549) This differ-ence was not significant in a MannndashWhitney test (U 175 n1 20 n2 21 p 36)

The median values of s were 272 and 226 for theclustered and diverse groups (IQRs 140 and 055)

Figure 5 Categorization training and test items and predictedresponse contours Markers show training items for clustered ()or diverse () training groups and novel test items () Solid linesare for the clustered training group showing the equiprobabilitycategory boundary and the 70 categorization contours for eachcategory Dotted lines show the same information for the diversetraining group

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 297

Participants who learned the diverse category had sig-nificantly lower sensitivity to distance than did thosewho learned the clustered category (U 131 n1 20n2 21 p 039) The effect of this difference in s val-ues is illustrated in Figure 6A which shows the influ-ence of each training item on the classification of a testitem as a function of the distance between the two Therescaled similarity of an exemplar at distance 04 (wherethe two curves are furthest apart) is 20 greater for par-ticipants who learned the diverse category than for thosewho learned the clustered category Distance 04 corre-sponds to a circle around a training item on Figure 2 or3 extending not quite one grid unit in each direction

Did the effect of training diversity affect distance scal-ing for categories C and X equally The GCM ordinarilyassumes that the same scaling applies throughout thestimulus space but Nosofsky and Johansen (2000) havesuggested that category-specific scaling could explainthe apparent bias toward diverse categories reported inRipsrsquo (1989) classic pizza-or-coin study We fit an en-hanced regression model that had separate s parametersfor categories C and X (for each participant) and foundno significant difference in goodness of fit between theenhanced model and the original [c2(41) 37 p 66]Our results therefore suggest global rescaling across thewhole stimulus space and do not indicate category-specific rescaling

RecognitionThe proportion of incorrect responses for various

types of test item in the oldndashnew recognition task aresummarized in Table 3 Miss rates for training exemplarsranged from 22 to 33 whereas false alarm rates formost new items were over 50 The two training groupsdiffered markedly on their responses to peripheral itemsbeyond the outer edge of category X exemplars People

trained on the clustered category mistook these for olditems on only 22 of trials whereas those trained on thediverse category mistook them on 50 of trials

Analysis of false alarms Due to the very limitedamount of relevant past research the predictions (out-lined above) for the recognition task were very generalso we conducted only a limited set of direct inferentialtests Our analysis pooled novel items into three groupsand examined rates of false alarms (FA) for novel itemson the category C side of the stimulus space (FA_C) thecategory X side (FA_X ) and on the boundary betweencategories (FA_Bndy) A one-way MANOVA was per-formed on the three false alarm scores with training di-versity (clustered or diverse) as a between-subjects fac-tor In combination the false alarm scores were affectedby training diversity [F(328) 548 p 004] RoyndashBargmann stepdown analysis indicated that the signifi-cance of the overall effect was due primarily to FA_X par-

Table 3Error Rates in OldndashNew Recognition Task by Training

Diversity Showing Percentage for Training Exemplars (Old)and Novel Test Items (New) as a Function of Stimulus

Region and Type of Novel Item

Training

Clustered Diverse

Region Type SD SD

Old Control 25 29 33 30Experimental 22 24 29 28

New Boundary Boundary 58 37 65 35

Control Prototype 83 24 79 24Inside 77 29 58 26Peripheral 58 35 46 34

Experimental Prototype 83 24 77 23Inside 69 26 77 26Peripheral 22 30 50 36

Figure 6 Generalized context model similarity as a function of training group (clustered or diverse) and distance froman exemplar Derived from median best-fit values of s for each training group when tested on generalization (A) andrecognition (B)

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

298 HAHN BAILEY AND ELVIN

ticipants trained on the clustered category made fewerfalse alarms on the category X side of the boundary (ad-justed for FA_C and FA_Bndy mean FA_X 375SE 447) than did participants trained on the diversecategory (adjusted mean FA_X 584 SE 447)Inspection of false alarm rates for individual items (Table 3)suggested that the effect was largely attributable to periph-eral items

Regression models The effect of diversity on FA_Xcould be due to item similarities since the peripheralitems on the category X side were closer to the nearestexemplars of the diverse category than to those of theclustered one To factor out item similarities we fit re-gression models as described in Appendix B to predicteach participantrsquos responses to the full set of test itemsThe predicted probability of an old response was a func-tion of the similarity between each test item and thetraining exemplars of both categories C and X Therewere two free parameters of interest including the dis-tance scaling parameter s as discussed above and a re-sponse bias Bnew that determined the bias (on a scalefrom 0 to 1) to say that something was new rather thanold Figure 7 shows response contours for the two train-ing groups based on median values of s and Bnew Theresponse probability distribution was generally broaderand less peaked in the diverse condition than in the clus-tered condition

The median values of Bnew were 23 and 33 for the clus-tered and diverse training groups respectively (IQRs 26 and 17) However the difference was not statisticallysignificant (U 81 n1 16 n2 16 p 080) Thusas a nonsignificant trend participants who learned theclustered category were somewhat less inclined to givenew responses (equivalently they required somewhat lessevidence for an old response) than were participants wholearned the diverse category

The median values of s were 240 and 117 for the clus-tered and diverse training groups respectively (IQRs 148 and 140) Participants who learned the diverse cat-egory had significantly lower sensitivity to distance thandid those who learned the clustered category (U 33n1 16 n2 16 p 001) This difference in sensitiv-ity echoes that observed in the generalization study Theeffect of the sensitivity difference on recognition is illus-trated in Figure 6B The difference in exemplar influenceis greatest at distance 065 (about 1 13 grid units in Fig-ures 2 and 3) where the influence is 168 greater forparticipants who learned the diverse category than forthose who learned the clustered category

Once more there was no significant improvement in fitfor a model incorporating separate scaling parameters forcategories C and X compared with the simpler model witha single global scaling parameter [c2(32) 234 p 86]

GENERAL DISCUSSION

Our experiments suggest that the diversity of percep-tual categories affects learning generalization and itemrecognition by altering the scale of similarities as well asthe positions of exemplars within the stimulus spacePeople learning a diverse category made more errors andgave slower responses than did people learning a clus-tered category After having the same amount of cate-gory training people who learned a diverse categorywere more likely than those who learned a clustered cat-egory to accept distant peripheral items as members ofthe category And people who learned a diverse categorywere more likely to wrongly think that they had previ-ously seen a distant peripheral item than were peoplewho learned a clustered category The effects of diversityon generalization and recognition were generally con-sistent with responding on the basis of item similaritiesHowever regression modeling with the GCM revealed afurther effect Participantsrsquo perception of the stimuli wasaltered so that people who learned a diverse category re-quired a greater physical difference to perceive the samepsychological difference between stimuli This perceptualrescaling effect led to lower levels of accuracy in the train-ing task wider generalization and poorer item recogni-tion over and above effects that could be attributed to itemsimilarities

This study sought to distinguish different mechanismsthat might produce diversity effects and the use of threedifferent but related tasks makes the present study par-ticularly informative There are numerous accounts of

Figure 7 Recognition test items and predicted response con-tours Markers show training items for clustered () or diverse() training groups and novel test items () Solid lines are forthe clustered training condition showing 50 (outer) and 70(inner) recognition contours Dotted lines show the same infor-mation for the diverse training group

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 299

category learning in the literature that could explain thereduced learning accuracy we observed in the diverse con-dition but which do not simultaneously explain the ef-fects we also observed in generalization and recognitionFor example Fried and Holyoakrsquos (1984) category den-sity model represents categories by their means (proto-types) and variances estimated from a sample of exem-plars Diverse categories are learned more slowly thanare clustered ones because estimates of the means andvariances converge more slowly when the stimuli are lessuniform However because the category density modelis concerned only with categorization it has nothing atall to say about the diversity effects we observed in oldndashnew recognition And although the category densitymodel predicts that diversity will affect generalizationthe pattern of responses in our data are predicted muchbetter by exemplar similarities than by similarities to thecategory prototypes (cf J D Smith amp Minda 2002 seeAppendixes A and B for details)

Diversity effects could plausibly arise from item simi-larities shifts in category boundaries (as reflected in cat-egorization biases) or rescaling of similarity relations Acombination of methodological factors contributed to ourability to distinguish these different mechanisms Re-gression models allowed us to separate out item similar-ities to simultaneously consider response probabilitiesacross a large set of test items (including old and newitems peripheral and internal as well as boundary items)and in the generalization study to simultaneously con-sider the distribution of responses across all three choicesWe would not have detected the global rescaling effect ifwe had examined just a few critical items on the bound-ary between categories because the effect of the diversitymanipulation on individual items was quite small afteritem similarities were factored out In the generalizationtask the neither response increased the information ob-tained on each trial (just as a four-choice multiple choicetest is more informative than the same number of truefalsequestions) and made it possible to detect changes in cate-gory structure on the far sides of the categories

Most important the neither response made it possiblein principle to distinguish between changes in categorybiases and rescaling of similarities We found no evidencethat similarity relations were category specific with dif-ferent scaling factors for different regions of stimulusspace according to the variability of categories in thoseregions Our data also showed no evidence that responsebiases varied as a function of category variability Ratherthere was a global change in sensitivity to psychologicaldistance in response to increased variability This findingis consistent with the effect of stimulus range observed byBraida and Durlach (1972) in studies of sound intensityperception Our results show that the effect is not con-fined to unidimensional stimuli and also demonstratethat it affects standard category learning and generaliza-tion tasks as well as item recognition

With regard specifically to generalization our model-ing demonstrates that exemplar similarities can explain

the seemingly different results observed in previousstudies for central peripheral and boundary items Onlya few previous studies have controlled exemplar similar-ities in their tests of diversity effects on generalization(Cohen et al 2001 Rips 1989 Stewart amp Chater 2002)These studies focused on boundary items between aclustered category and a diverse category Ripsrsquo pizza orcoin study found that judgments of category member-ship differed from judgments of category similarity andin particular that categorization of boundary items wasbiased in favor of diverse categories rather than clusteredcategories Rips concluded that because judgments forthe categorization task differed from those for the simi-larity task categorization could not be based on similar-ity Nosofsky and colleagues have suggested alternativeinterpretations within the framework of the GCM (Cohenet al 2001 Nosofsky amp Johansen 2000) Moreover ourown results and those of numerous other studies (egShin amp Nosofsky 1992 see also works cited in Nosofskyamp Johansen 2000) show that similarity scaling can varymarkedly between different tasks This factor alone couldexplain the differences that Rips observed between judg-ments of category membership and category similarity

The same confound applies to Cohen et alrsquos (2001) Ex-periment 1 which contrasted performance on identifica-tion and categorization tasks On the basis of a single testitem Cohen et al concluded that high-diversity categoriesexert a ldquopullrdquo on category judgments However their re-sults could equally be explained by global effects of thesort observed in the present studymdashthat is by similarityscaling throughout the stimulus space adjusted accordingto the particular task as well as the overall diversity of thestimulus items In contrast Cohen et alrsquos Experiment 2provides more compelling evidence for a category-specificeffect of diversity on categorization However becausetheir analysis focused on a single boundary item there isno way to tell whether the category-specific effect involvedresponse biases or category-specific rescaling of similari-ties These different mechanisms can only be distinguishedby examining responses across a wider range of stimuli

Finally our f inding that category diversity affectsglobal similarity relations has implications for the inter-pretation of Stewart and Chaterrsquos (2002) results Theiranalyses of individual data indicated that for most par-ticipants boundary items were less likely to be classi-fied as members of an experimental category with highdiversity than with low diversity On the basis of theseresults Stewart and Chater suggest that response biasesmight be sensitive to category diversity Alternativelytheir results could reflect the same kind of global rescal-ing of similarities that we observed in the present study

Although our results point to a global effect of diver-sity that affects the whole stimulus space in learninggeneralization and recognition tasks they do not iden-tify what aspect of diversity is relevant (nor have previ-ous studies of diversity effects in categorization) Partic-ipants might be sensitive to the total range of stimuli(Braida amp Durlach 1972) or they might be sensitive to

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

300 HAHN BAILEY AND ELVIN

average absolute deviations from the grand mean or av-erage variance about category means or some other spe-cific measure of variability Only detailed comparisonsacross various stimulus distributions will eventually dis-tinguish among these possibilities

REFERENCES

Ashby F G amp Maddox W T (1990) Integrating information fromseparable psychological dimensions Journal of Experimental Psy-chology Human Perception amp Performance 16 598-612

Barsalou L W (1983) Ad-hoc categories Memory amp Cognition 11211-227

Braida L D amp Durlach N I (1972) Intensity perception II Res-olution in one-interval paradigms Journal of the Acoustical Societyof America 51 483-502

Cohen A L Nosofsky R M amp Zaki S R (2001) Category vari-ability exemplar similarity and perceptual classification Memory ampCognition 29 1165-1175

Corter J E amp Gluck M A (1992) Explaining basic categoriesFeature predictability and information Psychological Bulletin 111291-303

Flannagan M J Fried L S amp Holyoak K J (1986) Distributionalexpectations and the induction of category structure Journal of Ex-perimental Psychology Learning Memory amp Cognition 12 241-256

Fried L S amp Holyoak K J (1984) Induction of category distribu-tions A framework for classification learning Journal of Experi-mental Psychology Learning Memory amp Cognition 10 234-257

Hahn U amp Ramscar M C A (2001) Mere similarity In U Hahnand M C A Ramscar (Eds) Similarity and categorization (pp 257-272) Oxford Oxford University Press

Heit E amp Hahn U (2001) Diversity-based reasoning in childrenCognitive Psychology 43 243-273

Homa D amp Vosburgh R (1976) Category breadth and the abstrac-tion of prototypical information Journal of Experimental Psychol-ogy Human Learning amp Memory 2 322-330

Jones G V (1983) Identifying basic categories Psychological Bul-letin 94 423-428

Luce R D (1959) Individual choice behavior A theoretical analysisNew York Wiley

Mareschal D French R M amp Quinn P C (2000) A connec-tionist account of asymmetric category learning in early infancy De-velopmental Psychology 36 635-645

McLaren I P L (1997) Categorization and perceptual learning Ananalogue of the face inversion effect Quarterly Journal of Experi-mental Psychology 50A 257-273

Medin D L amp Schaffer M M (1978) Context theory of classifi-cation learning Psychological Review 85 207-238

Murphy G L amp Medin D L (1985) The role of theories in con-ceptual coherence Psychological Review 92 289-316

Neumann P G (1977) Visual prototype formation with discontinuousrepresentation of dimensions of variability Memory amp Cognition 5187-197

Nosofsky R M (1984) Choice similarity and the context theory ofclassification Journal of Experimental Psychology Learning Mem-ory amp Cognition 10 104-114

Nosofsky R M (1986) Attention similarity and the identificationndashcategorization relationship Journal of Experimental PsychologyGeneral 115 39-57

Nosofsky R M (1988) Similarity frequency and category represen-tations Journal of Experimental Psychology Learning Memory ampCognition 14 54-65

Nosofsky R M amp Johansen M K (2000) Exemplar-based ac-counts of ldquomultiple-systemrdquo phenomena in perceptual categoriza-tion Psychonomic Bulletin amp Review 7 375-402

Osherson D N Smith E E Wilkie O Lopez A amp Shafir E(1990) Category-based induction Psychological Review 97 185-200

Palmeri T J amp Nosofsky R M (2001) Central tendencies extremepoints and prototype enhancement effects in ill-defined perceptualcategorization Quarterly Journal of Experimental Psychology 54A197-235

Peterson M J Meagher R B Jr Chait H amp Gillie S (1973)The abstraction and generalization of dot patterns Cognitive Psy-chology 4 378-398

Posner M I Goldsmith R amp Welton K E Jr (1967) Perceiveddistance and the classification of distorted patterns Journal of Ex-perimental Psychology 73 28-38

Posner M I amp Keele S W (1968) On the genesis of abstract ideasJournal of Experimental Psychology 77 353-363

Quinn P C Eimas P D amp Rosenkrantz S L (1993) Evidence forrepresentations of perceptually similar natural categories by 3-month-old and 4-month-old infants Perception 22 463-475

Rips L J (1989) Similarity typicality and categorization In S Vos-niadou amp A Ortney (Eds) Similarity and analogical reasoning(pp 21-59) New York Cambridge University Press

Rosch E Mervis C B Gray W D Johnson D M amp Boyes-Braem P (1976) Basic objects in natural categories Cognitive Psy-chology 8 382-439

Shin H J amp Nosofsky R M (1992) Similarity-scaling studies of dot-pattern classification and recognition Journal of Experimental Psy-chology General 121 278-304

Smith E E amp Sloman S A (1994) Similarity- versus rule-basedcategorization Memory amp Cognition 22 377-386

Smith J D amp Minda J P (2002) Distinguishing prototype-basedand exemplar-based processes in dot-pattern category learning Jour-nal of Experimental Psychology Learning Memory amp Cognition28 800-811

Stewart N amp Chater N (2002) The effect of category variabilityin perceptual categorization Journal of Experimental PsychologyLearning Memory amp Cognition 28 893-907

Tabachnick B G amp Fidell L S (1996) Using multivariate statis-tics (3rd ed) New York HarperCollins

Wayne A (1995) Bayesianism and diverse evidence Philosophy ofScience 62 111-121

Younger B A (1985) The segregation of items into categories by ten-month-old infants Child Development 56 1574-1583

NOTES

1 These studies all involved nonnormal distributions of exemplarswhich were generally sampled from donut-shaped regions of stimulusspace The effect of changes in the variance of normally distributedtraining exemplars has received little if any experimental attention

2 The MANOVA computes an optimal linear combination of severaldependent variables (DV) and computes an ANOVA on the resultingcomposite DV If the original DVs are correlated (positively or nega-tively) the MANOVA protects against the inflated Type I error rate pro-duced by separate tests of the DVs (see Tabachnick amp Fidell 1996)

3 Thanks to Dorrit Billman for suggesting this explanation

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

CATEGORY DIVERSITY 301

APPENDIX ADetails of Modeling for Generalization Data

According to the generalized context model (GCM) classification decisions are based on similarity com-parisons between a test item and individual exemplars of each category stored in memory In our generaliza-tion task participants judged whether item i was a member of category C category X or neither The GCMpredicts the probability of a category C response on the basis of the total weighted similarity of the test itemto exemplars of category C divided by the weighted similarity of the item to exemplars of both categoriesplus a constant response threshold for neither responses

(A1)

The term dij is the distance between stimulus i and exemplar j computed as a weighted Euclidean distance

(A2)

where the xs and ys refer to head area and stem height measurements respectively and the dimension weightswx and wy sum to 1 BiasC and biasX are response bias parameters and s is a scaling parameter that determinesthe sensitivity of response probabilities to a given change in distance The probability of a category X or nei-ther response is computed in a similar way by replacing the numerator of Equation 1 with the appropriate termfrom the denominator

Without loss of generality we factor the response threshold K into the product of a bias parameter and theaverage total similarity of each exemplar to all the others E Thus K biasneither E where

(A3)

N is the total number of exemplars Higher values of biasneither produce more neither responses There are two keyadvantages of this innovative treatment of the neither response strength The effect of the parameter biasneither ongoodness of fit is largely orthogonal to other parameters of the model including the scaling parameter s As aconsequence regressions converge more quickly than with an undifferentiated K response strength parameterAlso biasneither is independent of the number of exemplars and the average similarity among them so mean-ingful comparisons can be made between best-fit values of biasneither obtained across different stimulus sets Incontrast such comparisons with the undifferentiated K are generally not meaningful

Also without loss of generality we restrict the bias parameters (biasC biasX and biasneither) to the rangefrom 0 to 1 and require them to sum to 1 (there are only two degrees of freedom among these three param-eters) Finally we are usually interested in the relative magnitudes of biasC and biasX compared with eachother so we define biasC biasC | CX (1 biasneither) and biasX biasX | CX (1 biasneither) The parametersbiasC | CX and biasX | CX determine the relative magnitudes of the two category response biases They rangefrom 0 to 1 and sum to 1 (having only one degree of freedom between them)

The full model defined above is fully specified with reference to four independent parameters s wXbiasC | CX and biasneither We used this model to predict participantsrsquo classification responses averaged acrossthe three presentations of each flower (once in each color scheme) Thus for each of the 23 test flowers themodel predicted the proportions of category C category X and neither responses out of three opportunitiesModels were fit separately to the data for each participant by minimizing a likelihood-ratio chi-square good-ness of fit statistic over the observed (O) and expected (E) response proportions

To obtain the most accurate tests of parameter values we eliminated parameters that did not make a sig-nificant contribution to goodness of fit The difference in G obtained for a particular participant by two re-gression models that differ by the presence or absence of a single parameter can be tested against the chi-square distribution with one degree of freedom Chi-square difference scores can be summed across allparticipants to obtain a composite statistic that can be tested against the chi-square distribution with degreesof freedom equal to the number of participants Parameters that did not achieve a significance level of p 10 on such tests were eliminated to simplify the model as described below

For the full model with four free parameters for each participant the goodness of fit was G(164) 604Setting s to 0 resulted in a significantly poorer fit to the data [ c2(82) 920 p 001 when s 0 the pa-rameter wX has no effect so there are just two degrees of freedom per participant] This result indicates thatparticipantsrsquo judgments were related to the distances between test and training items (ie those distancesgave better predictions of responses than assuming that all items were treated identically) Setting wX to 05(equal weighting for both dimensions) did not significantly reduce fitness in comparison with the full model[c2(41) 48 p 22] Setting biasC | CX to 05 (equal bias for both categories) resulted in a significantly poorer

G O OE

= EcircEuml

ˆmacrAcirc2 ln

EN

es d j l

lj

= EcircEumlAacute

ˆmacr

- lozAcircAcirc1

d w x x w y yi j x i j y i j = -( ) + -( )2 2

p i

e

e e KC

C

s di j

j C

C

s di j

j CX

s di l

l X

response

bias

bias bias|

( ) =

loz

loz + loz +

- loz

Œ- loz

Œ

- loz

Œ

Acirc

Acirc Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc

302 HAHN BAILEY AND ELVIN

APPENDIX A (Continued)

fit compared with the full model [c2(41) 81 p 001] The neither response bias biasneither is requiredsince participants did in fact avail themselves of the option to respond neither Taken together these resultssuggest that the dimension weight parameter is not needed for our data but that the other parameters are rel-evant Accordingly the dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group rather than varying freely among all participants within each group Setting s to themedian value of each group resulted in a significantly poorer fit to the data compared with including a sepa-rate free parameter for each participant [c2(39) 58 p 027] Thus distance scaling varied within groupsSetting biasC|CX to the median value within each group did not result in a significantly poorer fit comparedwith including separate parameters for each participant [c2(39) 41 p 37] A further comparison foundthat setting this parameter to the overall median value (biasC | CX 068) fit the data just as well as the modelthat used the median of each training group [c2(1) 021 p 65] Setting biasneither to the median value ofeach group resulted in a significantly poorer fit to the data [c2(39) 74 p 001] This result indicates thatthe bias to respond neither varied among participants Taken together these results indicate that s and biasneitherare required for each participant BiasC|CX did not vary among participants

In response to a reviewerrsquos query we also implemented a post hoc prototype mixture model that combinedsimilarity to instances and to prototypes as a weighted average with separate scaling parameters s for ex-emplar and prototype similarities The pure prototype model was much worse than the pure instance model(goodness of fit G 1597 and 693 respectively) The best-fitting mixture model fit only slightly better thanthe pure exemplar model and the difference was not significant [c2(82) 297 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

APPENDIX BDetails of Modeling for Recognition Data

In our recognition task participants judged whether item i was old or new The GCM predicts the proba-bility of an old response on the basis of total weighted similarity of the test item to all exemplars divided bythis similarity plus a constant response threshold for new responses

(B1)

The term dij is the Euclidean distance between stimulus i and exemplar j as in Appendix A Following Ap-pendix A we factor K into the product of the average similarity of each exemplar to all E multiplied by abias term Thus K Bnew E Higher values of Bnew produce more new responses Bold and Bnew range from0 to 1 and sum to 1 (there is one degree of freedom between them) This full model includes three indepen-dent parameters s wX and Bnew For each of the 31 test flowers the model predicted the proportions of oldand new responses out of three opportunities (once in each color scheme) Models were fit separately to thedata for each participant

Again we eliminated parameters that did not make a significant contribution to goodness of fit Setting sto 0 resulted in significantly worse fits to the data compared with the full model [c2(64) 159 p 001]Setting wX to 05 did not significantly reduce the fit of the model [c2(32) 24 p 85] Bnew is required inorder to predict any new judgments so this parameter cannot be removed or set to 0 Taken together these re-sults suggest that the dimension weight parameter is not needed for our data but that the other parameters arerelevant The dimension weight parameter was set to 5 in subsequent tests

A further set of regressions tested whether the remaining parameters could be set to the median value foreach participant group The fit of the model with one s parameter for each group of participants was not sig-nificantly worse than the fit of the model with separate parameters for every participant [c2(30) 26 p 69] A further comparison found that setting s to the overall median of all participants resulted in a signifi-cantly worse fit to the data compared with separate values for each training group [c2(1) 115 p 001]This result indicates that the diversity manipulation affected s The fit of the model with one Bnew parameterfor each group was significantly worse than the model with separate parameters for each participant [c 2(30) 86 p 001] This result indicates that the bias to respond new varied among participants

Again we implemented a post hoc prototype mixture model that combined similarity to instances and to pro-totypes as a weighted average The pure prototype model was substantially worse than the pure instance model(goodness of fit G 546 and 411 respectively) The best-fitting mixture model fit only slightly better than didthe pure exemplar model and the difference was not significant [c2(64) 142 p 1] Thus similarity toprototypes had no significant effect on responses after similarity to exemplars was taken into account

(Manuscript received December 12 2002revision accepted for publication June 18 2004)

p i

e

e K

s di j

j

s di j

j

response

B

Bold

old

old

|

( ) =

+

- loz

- loz

Acirc

Acirc