Reduction of Uncertainty in Human Sequential Learning: Evidence from Artificial Grammar Learning

6
Reduction of Uncertainty in Human Sequential Learning: Evidence from Artificial Grammar Learning Luca Onnis ([email protected]) Department of Psychology, University of Warwick, Coventry, CV47AL, UK Morten H. Christiansen ([email protected]) Department of Psychology, Cornell University, Ithaca, NY 14853, USA Nick Chater (nick.chater@ warwick.ac.uk) Institute for Applied Cognitive Science and Department of Psychology, University of Warwick, Coventry, CV47AL, UK Rebecca Gómez ([email protected]) Department of Psychology, University of Arizona, Tucson, AZ 85721, USA Abstract Research on statistical learning in adults and infants has shown that humans are particularly sensitive to statistical properties of the input. Early experiments in artificial grammar learning, for instance, show a sensitivity for transitional n-gram probabilities. It has been argued, however, that this source of information may not help in detecting nonadjacent dependencies, in the presence of substantial variability of the intervening material, thus suggesting a different focus of attention involving change versus non-change (Gómez, 2002). Following Gómez proposal, we contend that alternative sources of information may be attended to simultaneously by learners, in an attempt to reduce uncertainty. With several potential cues in competition, performance crucially depends on which cue is strong enough to be relied upon. By carefully manipulating the statistical environment it is possible to weigh the contribution of each cue. Several implications for the field of statistical learning and language development are drawn. Introduction Research in artificial grammar learning (AGL) and artificial language learning (ALL) in infants and adults has revealed that humans are extremely sensitive to the statistical properties of the environment they are exposed to. This has opened up a new trend of investigations aimed at determining empirically the processes involved in so-called statistical learning. Several mechanisms have been proposed as the default that learners use to detect structure, although crucially there is no consensus in the literature over which is most plausible or whether there is a default at all. Some researchers have shown that learners are particularly sensitive to transitional probabilities of bigrams (Saffran, Aslin, & Newport, 1996): confronted with a stream of unfamiliar concatenated speech-like sound they tend to infer word boundaries between two syllables that rarely occur adjacently in the sequence. Sensitivity to transitional probabilities seems to be present across modalities, for instance in the segmentation of streams of tones (Saffran, Johnson, Aslin, and Newport, 1999) and in the temporal presentation of visual shapes (Fiser & Aslin, 2002). Other researchers have proposed exemplar- or fragment-based models, based on knowledge of memorised chunks of bigrams and trigrams (Dulany et al., 1984; Perruchet & Pacteau, 1990; Servan-Schreiber & Anderson, 1990) and learning of whole items (Vokey & Brooks, 1992). Yet others have postulated rule- learning in transfer tasks (Reber, 1967; Marcus, Vijayan, Rao & Voshton, 1999). In addition, knowledge of chained events such as sentences in natural languages require learners to track nonadjacent dependencies where transitional probabilities are of little help (Gómez, 2002). In this paper we propose that there may be no default process in human sequential learning. Instead, learners may be actively engaged in search for good sources of reduction in uncertainty. In their quest, they seek alternative sources of predictability by capitalizing on information that is likely to be the most statistically reliable. This hypothesis was initiated by (Gómez, 2002) and is consistent with several theoretical formulations such as reduction of uncertainty (Gibson, 1991) and the simplicity principle (Chater, 1996), that the cognitive system attempts to seek the simplest hypothesis about the data available. Given performance constraints, the cognitive system may be biased to focus on data that will be likely to reduce uncertainty as far as possible 1 . Specifically, whether the system focuses on transitional probabilities or non-adjacent dependencies may depend on the statistical properties of the 1 We assume that this process of selection is not necessarily conscious, and might for example involve distribution of processing activity in a neural network. 886

Transcript of Reduction of Uncertainty in Human Sequential Learning: Evidence from Artificial Grammar Learning

Reduction of Uncertainty in Human Sequential Learning: Evidence fromArtificial Grammar Learning

Luca Onnis ([email protected])Department of Psychology, University of Warwick, Coventry, CV47AL, UK

Morten H. Christiansen ([email protected])Department of Psychology, Cornell University, Ithaca, NY 14853, USA

Nick Chater (nick.chater@ warwick.ac.uk)Institute for Applied Cognitive Science and Department of Psychology, University of Warwick, Coventry, CV47AL,

UK

Rebecca Gómez ([email protected])Department of Psychology, University of Arizona, Tucson, AZ 85721, USA

Abstract

Research on statistical learning in adults and infants hasshown that humans are particularly sensitive to statisticalproperties of the input. Early experiments in artificialgrammar learning, for instance, show a sensitivity fortransitional n-gram probabilities. It has been argued,however, that this source of information may not help indetecting nonadjacent dependencies, in the presence ofsubstantial variability of the intervening material, thussuggesting a different focus of attention involvingchange versus non-change (Gómez, 2002). FollowingGómez proposal, we contend that alternative sources ofinformation may be attended to simultaneously bylearners, in an attempt to reduce uncertainty. Withseveral potential cues in competition, performancecrucially depends on which cue is strong enough to berelied upon. By carefully manipulating the statisticalenvironment it is possible to weigh the contribution ofeach cue. Several implications for the field of statisticallearning and language development are drawn.

IntroductionResearch in artificial grammar learning (AGL) andartificial language learning (ALL) in infants and adultshas revealed that humans are extremely sensitive to thestatistical properties of the environment they areexposed to. This has opened up a new trend ofinvestigations aimed at determining empirically theprocesses involved in so-called statistical learning.

Several mechanisms have been proposed as thedefault that learners use to detect structure, althoughcrucially there is no consensus in the literature overwhich is most plausible or whether there is a default atall. Some researchers have shown that learners areparticularly sensitive to transitional probabilities ofbigrams (Saffran, Aslin, & Newport, 1996): confrontedwith a stream of unfamiliar concatenated speech-likesound they tend to infer word boundaries between twosyllables that rarely occur adjacently in the sequence.

Sensitivity to transitional probabilities seems to bepresent across modalities, for instance in thesegmentation of streams of tones (Saffran, Johnson,Aslin, and Newport, 1999) and in the temporalpresentation of visual shapes (Fiser & Aslin, 2002).

Other researchers have proposed exemplar- orfragment-based models, based on knowledge ofmemorised chunks of bigrams and trigrams (Dulany etal., 1984; Perruchet & Pacteau, 1990; Servan-Schreiber& Anderson, 1990) and learning of whole items (Vokey& Brooks, 1992). Yet others have postulated rule-learning in transfer tasks (Reber, 1967; Marcus,Vijayan, Rao & Voshton, 1999). In addition, knowledgeof chained events such as sentences in naturallanguages require learners to track nonadjacentdependencies where transitional probabilities are oflittle help (Gómez, 2002).

In this paper we propose that there may be no defaultprocess in human sequential learning. Instead, learnersmay be actively engaged in search for good sources ofreduction in uncertainty. In their quest, they seekalternative sources of predictability by capitalizing oninformation that is likely to be the most statisticallyreliable. This hypothesis was initiated by (Gómez,2002) and is consistent with several theoreticalformulations such as reduction of uncertainty (Gibson,1991) and the simplicity principle (Chater, 1996), thatthe cognitive system attempts to seek the simplesthypothesis about the data available. Given performanceconstraints, the cognitive system may be biased to focuson data that will be likely to reduce uncertainty as far aspossible1. Specifically, whether the system focuses ontransitional probabilities or non-adjacent dependenciesmay depend on the statistical properties of the

1 We assume that this process of selection is not necessarilyconscious, and might for example involve distribution ofprocessing activity in a neural network.

886

environment that is being sampled. Therefore, bymanipulating the statistical structure of thatenvironment, it is perhaps possible to investigatewhether active search is at work in detecting structure.

In two experiments, we investigated participants’degree of success at detecting invariant structure in anAGL task in 5 conditions where the test items and testtask are the same but the probabilistic environment ismanipulated so as to change the statistical landscapesubstantially. We propose that a small number ofalternative statistical cues might be available tolearners. We aim to show that, counter to intuition,orthogonal sources of reliability might be at work indifferent experimental conditions leading to successfulor unsuccessful learning. We also asked whether ourresults are robust across perceptual modalities byrunning two variations of the same experiment, one inthe auditory modality and one in the visual modality.Our experiments are an extension of a study by Gómez(2002), which we first introduce.

Detection of invariant structure throughcontext variability

Many sequential patterns in the world involve trackingnonadjacent dependencies. For example, in Englishauxiliaries and inflectional morphemes (e.g., amcooking, has travelled) as well as dependencies innumber agreement (the books on the shelf are dusty) areseparated by various intervening linguistic material.One potential source of learning in this case might beembedding of first-order conditionals such as bigramsinto higher-order conditionals such as trigrams. Thatlearners attend to n-gram statistics in a chunkingfashion is evident in a number of studies (Schvaneveldt& Gómez, 1998; Cohen, Ivry, & Keele, 1990). In theexample above chunking involves noting that am andcook as well as cook and ing are highly frequent andsubsequently noting that am cooking is highly frequenttoo as a trigram. Hence we may safely argue that higherorder n-gram statistics represent a useful source ofinformation for detecting nonadjacent dependencies.

However, sequences in natural languages typicallyinvolve some items belonging to a relatively small set(functor words and morphemes like am, the, -ing, -s,are) interspersed with items belonging to a very largeset (e.g. nouns, verbs, adjectives). Crucially, thisasymmetry translates into patterns of highly invariantnonadjacent items separated by highly variable material(am cooking, am working, am going, etc.). Gómez(2002) suggested that knowledge of n-gramconditionals cannot be invoked for detecting invariantstructure in highly variable contexts because first-ordertransitional probabilities, P(Y|X), decrease as the setsize of Y increases. Similarly, second-order transitionalprobabilities, P(Z|XY), also decrease as a function ofset size of X. Hence, statistical estimates for thesetransitional probabilities tend to be unreliable. Gómez

exposed infants and adult participants to sentences of anartificial language of the form A-X-B. The languagecontained three families of nonadjacent pairs, notablyA1—B1, A2—B2, and A3—B3. She manipulated the setsize of the middle element X in four conditions bysystematically increasing the number from 2 to 6 to 12and 24 word-like elements. In this way, conditionalbigram and trigram probabilities decreased as a functionof number of middle words. In the test phase,participants were required to subtly discriminate correctnonadjacent dependencies, (e.g. A2-X1-B2) fromincorrect ones (*A2-X1-B1). Notice that the incorrectsentences were new as trigrams, although both singlewords and bigrams had appeared in the training phasein the same positions. Hence the test requires very finedistinctions to be made. Gómez hypothesized that iflearners were focusing on n-gram dependencies theyshould learn nonadjacent dependencies better whenexposed to small sets of middle items becausetransitional probabilities between adjacent elements arehigher for smaller than for larger set sizes. Conversely,if learners spotted the invariant structure better in thelarger set size, Gómez hypothesized that increasingvariability in the context must have led them todisregard the highly variable middle elements. Herresults support the latter hypothesis: learners performedpoorly with low variability whereas they wereparticularly good when the set size of the middle itemwas largest (24 middle items; see Figure 1).

Testing the zero-variability hypothesisGómez proposed that both infant and adult learners aresensitive to change versus non-change, and use theirsensitivity to capitalize on stable structure. Learnersmight opportunistically entertain different strategies indetecting invariant structure, driven by a reduction ofuncertainty principle. In this study we are interested intaking this proposal further by exploring what happenswhen variability between the end-item pairs and themiddle items is reversed in the input. Gómez attributedpoor results in the middle set sizes to low variability:the variability effect seems to be attended to reliablyonly in the presence of a critical mass of middle items.However, an alternative explanation is that in small setsize conditions both nonadjacent dependencies andmiddle items vary, but none of them considerably morethan the other. This may confuse learners, in that it isnot clear which structure is non-variant. With larger setsizes middle items are considerably more variable thanfirst-last item pairings, making the nonadjacent pairsstand out as invariant. We asked what happens whenvariability in middle position is eliminated, thus makingthe nonadjacent items variable. We replicated Gómez’experiment with adults and added a new condition,namely the zero-variability condition, in which there isonly one middle element (e.g. A3-X1-B3, A1-X1-B1). Ourprediction is that non-variability of the middle item will

887

make the end-items stand out, and hence detecting theappropriate nonadjacent relationships will becomeeasier, increasing mean performance rates. Intuitively,sampling transitional probabilities with large contextvariability results in low information gain as the dataare too few to be reliable; by the same vein, the lack ofvariability should produce low information gain fortransitional probabilities as well, because it is justobvious what the bigram structure is. Hence this shouldmake nonadjacent dependencies stand out, aspotentially more informative sources of information, bycontrast.

The final predicted picture is a U-shape learningcurve in detecting nonadjacent dependencies, on theassumption that learning is a flexible and adaptiveprocess.

Figure 1. Total percentage endorsements from Gómez(2002) for the different conditions of variability of the

middle item.

Experiment 1

MethodParticipants Sixty undergraduate and postgraduatestudents at the University of Warwick participated andwere paid £3 each.Materials In the training phase participants listened toauditory strings generated by one of two artificiallanguages (L1 or L2). Strings in L1 had the form aXd,bXe, and cXf. L2 strings had the form aXe, bXf, cXd.Variability was manipulated in 5 conditions, bydrawing X from a pool of either 1, 2, 6, 12, or 24elements. The strings, recorded from a female voice,were the same that Gómez used in her study and wereoriginally chosen as tokens among several recordedsample strings in order to eliminate talker-induceddifferences in individual strings.

The elements a, b, and c were instantiated as pel, vot,and dak; d, e, and f, were instantiated as rud, jic, tood.The 24 middle items were wadim, kicey, puser, fengle,coomo, loga, gople, taspu, hiftam, deecha, vamey,

skiger, benez, gensim, feenam, laeljeen, chla, roosa,plizet, balip, malsig, suleb, nilbo, and wiffle. Followingthe design by Gómez (2002) the group of 12 middleelements were drawn from the first 12 words in the list,the set of 6 were drawn from the first 6, the set of 2from the first 2 and the set of 1 from the first word.Three strings in each language were common to all fivegroups and they were used as test stimuli. The three L2items served as foils for the L1 condition and viceversa. In Gómez (2002) there were six sentencesgenerated by each language, because the smallest setsize had 2 middle items. To keep the number of testitems equal to Gómez we presented the 6 test stimulitwice in two blocks, randomizing within blocks for eachparticipant. Words were separated by 250-ms pausesand strings by 750-ms pauses.

Procedure Six participants were recruited in each ofthe five set size conditions (1, 2, 6, 12, 24) and for eachof the two language conditions (L1, L2) resulting in 12participants per set size. Learners were asked to listenand pay close attention to sentences of an inventedlanguage and they were told that there would be a seriesof simple questions relating to the sentences after thelistening phase. During training, participants in all 5conditions listened to the same overall number ofstrings, a total of 432 token strings. This way,frequency of exposure to the nonadjacent dependencieswas held constant across conditions. For instanceparticipants in set-size 24 heard six iterations of each of72 type strings (3 dependencies x 24 middle items),participants in set-size 12 encountered each string twiceas often as those exposed to set size 24 and so forth.Hence whereas nonadjacent dependencies where heldconstant, transitional probabilities decreased as set sizeincreased.

Training lasted about 18 minutes. Before the test,participants were told that the sentences they had heardwere generated according to a set of rules involvingword order, and they would now hear 12 strings, 6 ofwhich would violate the rules. They were asked to press“Y” on a keyboard if they thought a sentence followedthe rules and to press “N” otherwise.

Results and DiscussionAn analysis of variance with Set Size (1 vs. 2 vs. 6 vs.12 vs. 24) and Language (L1 vs. L2) as between-subjects and Grammaticality (Trained vs. Untrainedstrings) as a within-subjects variable resulted in a maineffect of Grammaticality, F (1,50)=24.70, p<.001, amain Set Size effect, F(4,50)=3.85, p<.008, and aLanguage x Set Size interaction, F(4,50)=2.59, p<.047.We were particularly interested in determining whetherperformance across the different set-size conditionswould result in a U-shaped function. Consistent withour prediction, a polynomial trend analysis yielded asignificant quadratic effect, F(1,50)=5.85, p<.05. In

Total percentage endorsements (Gómez, 2002)

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

2 6 12 24

Variability

% c

orr

ect

888

contrast to Gómez (2002), there was not a significantincrease between set size 12 and set size 24, t(22)=.57,p=.568. This leveling off is responsible for a significantcubic effect, F(1,50)=9.49, p<.005. Figure 2summarizes total percentage endorsements for correctanswers.

Figure 2. Total percentage endorsements in Experiment1 for different variability.

Experiment 2

MethodParticipants Sixty undergraduate and postgraduatestudents at the University of Warwick participated andwere paid £3 each. None of them had participated inExperiment 1.

Materials. The stimuli were identical to those used inExperiment 1, except that they were presented visuallyinstead of auditorily.

Procedure. Exactly the same procedure as inExperiment 1 was used. Participants sat and looked atthe strings as they appeared on the screen. Traininglasted approximately 18 minutes, as in Experiment 1.Each string from the language was flashed up in blacktypeface against white background on a computerscreen. Each string stayed on the screen for 2 secondsand was followed by a 750-ms white screen so that thestrings could be perceived as independent one from theother. These values were chosen so that training lastedas long as training in Experiment 1. The test phase wasthe same as in Experiment 1, except that test stimuliwere presented visually on the screen.

Results and discussionAn analysis of variance with Set Size (1 vs. 2 vs. 6 vs.12 vs. 24) and materials (L1 vs. L2) as between-subjects and grammaticality (trained vs. untrainedstrings) as a within-subjects variable resulted in a maineffect of Grammaticality, F(1, 50) =16.39, p <.001, butno significant Grammaticality x Set Size interaction,

F(4, 50)=.971, p<.505. There were no other maineffects or interactions. In contrast to Experiment 1, apolynomial trend analysis did not show a significantquadratic effect, F<1. Figure 3 presents the percentageof endorsements for total accuracy in each of the fiveset-size conditions.

Figure 3. Total percentage endorsements in Experiment2 for different variability.

General discussionWe used a simple artificial language to enquire into theway learners track remote dependencies. Knowledge ofsequence events in the world, including language,involves detecting fixed nonadjacent dependenciesinterspersed with highly variable material. Gómez(2002) found what we dub a variability effect, i.e. afacilitatory effect in detecting invariant structure whenthe context is highly variable, but not when it ismoderately or even little variable. In general, this pointsto a specific sensitivity to change versus non-change.Conditions 2 to 4 in our Experiment 1 replicate herfindings, although performance in terms of percentaccuracy seems to improve only gradually from set size2 to 24, whereas Gómez found a significant differencebetween set size 12 and 24.

Overall, Gómez’ original results do not square wellwith recent findings of learners’ striking sensitivity ton-gram transitional probabilities. Because transitionalprobabilities are higher in set sizes 2, 6, and 12,performance should be better. Instead, the opposite isthe case. We reasoned that perhaps variability in boththe middle item and end-point items leave learners indoubt as to what is the invariant structure. Hence, byeliminating variability in the middle item in a newcondition, the variability of the nonadjacent itemsstands out again, this time reversed. However, the effectis, quite counter intuitively, not reversed. Indeed similarperformance results are obtained for set size 1 and setsize 24. In set size 1 performance is near 100% andsignificantly better than set size 2 (Experiment 1). Onecould argue that word trigrams, if recorded perfectly,could suffice to account for performance in set size 1,thus trivializing our results and explaining away thevariability effect in this condition. However, as a

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

1 2 6 12 24

Variability

% C

orr

ect

50%

55%

60%

65%

70%

75%

80%

85%

90%

95%

100%

1 2 6 12 24

Variability

% c

orr

ect

889

counter to that it would be reasonable to expect goodperformance in set size 2 condition too, given the highnumber of repetitions (72) for only six type strings. Acontrol condition is currently being run involvinglearning six frames (instead of three) with 1 differentmiddle item each (e.g. A3-X3-B3, A6-X6-B6) so as toreproduce the same number of type and tokenfrequencies of set size 2, but with no middle item beingshared by different frames. Similarly, one could arguethat good performance in set size 24 could be achievedby strikingly but not impossibly memorizing 72 typestrings. However, this would imply good performancein all smaller set sizes as well and this runs counter todata.

Notice also that in all conditions, including set size 1,bigram transitional probabilities by themselves are notsufficient for detecting the correct string pel wadim rudfrom the incorrect one *pel wadim jic (example takenfrom L1) as both pel wadim, wadim rud, and wadim jicappear as bigrams during training. Moreover, Gómez(2002) conjectured that perhaps low discriminationrates in small set sizes are due to overexposure of stringtokens during training, resulting in boredom anddistraction. Our findings disconfirm this hypothesis: ifit held true performance would drop even lower in thezero-variability condition, as the type/token ratiodecreases even more. Crucially, the finding that there isa statistically significant difference in learning in thetwo conditions becomes intriguing for several reasons.

A larger project underway examines the extent towhich a U-shape learning curve is modality-independent. In Experiment 2 training and test stimuliwere presented visually on a computer screen. Theobtained U-shape curve is less marked. One possibleexplanation is that attending to visually presented word-like strings is less demanding cognitively, suggesting aceiling effect. This explanation is preliminary and needsfurther evidence. However, the fact that results inExperiment 2 show the same trend as Experiment 1 areencouraging.

The implications of our findings might inform invarious degrees both the AGL community andresearchers of language development. AGL researchersworking mainly with adults have long debated whetherthere are one or more mechanisms at work in learningstructured events from experience. Our results suggestthat associative learning based on adjacent material maynot be the only source of information. There seems tobe a striking tendency to detect variant versus invariantstructure, and the way learners do it is extremelyadaptive to the informational demands of their input.Without claiming exhaustiveness we explored twoputative sources of information, namely n-gramtransitional probabilities and the variability effect. Atthis stage we can only give an informal explanation ofthe reduction of uncertainty hypothesis. Intuitively,sampling bigrams involving middle items under novariability yields no information gain, as the middle

item is always the same. Under this condition learnersmay be driven to shift attention towards nonadjacentstructure. Likewise, sampling bigrams with largevariability yields no reduction of uncertainty, as bigramtransitional probabilities are very low. In a similar waythen, learners may be lead to focus on nonadjacentdependencies. With low variability, sampling bigramsmay be reliable enough, hence “distracting” learnersaway from nonadjacent structure. Other sources may beat work and disentangling the contribution of each ofthem to learning is an empirical project yet to beinvestigated. For instance, post-test verbal reports fromthe majority of our participants suggest that, regardlessof their performance, they were aware of the positionaldependencies of single words in the strings. This pieceof information may be misleading for our task: on theone side it reduces uncertainty by eliminating irrelevanthypotheses about words in multiple positions (eachword is either initial, middle, or final), on the other sidedistinguishing pel wadim rud from *pel wadim jicrequires more than positional knowledge. We believethat positional knowledge deserves more research in thecurrent AGL literature. Studies of sequential learninghave found that it is an important source of information.However, many nonadjacent dependencies are freeranging and hence non-positionally dependent. Furtherexperiments are needed to investigate whether peoplecan detect such non-positionally dependent constraintsas A_x_y_B, A_x_y_w_B, A_x_y_w_z_B, equally well.

Our results have been modeled successfully using aconnectionist model. Onnis et al. (submitted) usesimple recurrent neural networks (SRNs) trained inexperimental conditions akin to the adult data reportedhere, obtaining a very similar U-shape curve. SRNs canbe thought of as reducing uncertainty in that predictionstend to converge towards the optimal conditionalprobabilities of observing a particular successive itemto the sequence presented up to that point. The SRNsspecific task was to predict the third nonadjacentelement Bi correctly. Minimizing the sum squared errormaximizes the probability of the next element, givenpreviously occurring adjacent elements (McClelland,1998). This is equivalent to exploiting bigramprobabilities. As we have seen, conditional probabilitymatching only yields suboptimal behaviour. Toovercome this, SRNs possess a stack of memory unitsthat help them maintain information about previouslyencountered material. Crucially, they maintain a traceof the correct non-adjacent item Ai under either novariability or large variability only. This happens byforming separate graded representations in the hiddenunits for each nonadjacent dependency.

The reduction of uncertainty hypothesis may also begiven a formal account in terms of active data selection(MacKay, 1992, Oaksford & Chater, 1994), a form ofrational analysis (Anderson, 1990). However, thedetails of such model are outside the scope of this paper(see Monaghan, Chater & Onnis, in preparation).

890

Overall, framing our results within a reduction ofuncertainty principle should prompt new researchaimed at discovering in which carefully controlledstatistical environments multiple sources are attended toand either discarded or integrated.

Finally, our findings might inform research inlanguage development. Gómez (2002) found thatinfants attend to the variability effect. We are currentlyinvestigating whether the U-shape curve found in ourexperiments applies to infant learning as well. The factthat performance in the zero-variability condition isvery good is consistent with various findings thatchildren develop productive linguistic knowledge onlygradually building from fixed item-based constructions.According to the Verb Island hypothesis for example(for a review, see Tomasello, 2000) early knowledge ofverbs and verb frames is extremely idiosyncratic foreach specific verb. In addition, morphological markingsare unevenly distributed across verbs. In this view I-am-eat-ing is first learnt as an unanalyzed chunk and ittakes the child a critical mass of verbs to realize that theframe am—ing can be used productively with differentverbs. Two- and three-year olds have been found togeneralize minimally, their repertoire consisting of ahigh number of conservative utterances and a lownumber of productive ones. The speculation is that acritical number of exemplars is vital for triggeringschematization. Perhaps then, young children exploit n-gram statistics as a default option, because theirknowledge of language is limited to a few type items.This situation is similar to learning in small set sizesand it only works if each string is learnt as a separateitem. When children’s repertoire is variable enough(arguably at ages three to four), then switching tochange versus non-change as a source of informationbecomes more relevant and helps the learner reduceuncertainty by detecting variant versus invariantstructure. Although our experiments do not test forgeneralisation, the fact that learners in the large set sizediscard the middle item could be interpreted as a formof generalisation for material in the middle itemposition. At this stage the link between AGL results andlanguage learning can only be speculative, but invites tointringuing research for the immediate future.

AcknowledgmentsLuca Onnis and Nick Chater were supported byEuropean Union Project HPRN-CT-1999-00065.Morten Christiansen was supported by Human FrontiersScience Program. Rebecca Gómez was supported byGrant NIH RO1 HD42170-01.

ReferencesAnderson, J. (1990). The Adaptive Character of

Thought. Hillsdale, NJ: Erlbaum Associates.

Chater, N. (1996). Reconciling simplicity andlikelihood principles in perceptual organization.Psychological Review, 103, 566-581.

Dulany, D.E., Carlson, R.A., & Dewey, G.I. (1984). Acase of syntactical learning and judgement: Howconscious and how abstract? Journal of ExperimentalPsychology: General, 113, 541-555.

Fiser, J., & Aslin, R.N. (2002). Statistical learning ofhigher-order temporal structure from visual shape-sequences. Journal of Experimental Psychology:Learning, Memory, and Cognition, 130, 658-680.

Gibson, E.J. (1991). An Odyssey in Learning andPerception. Cambridge, MA: MIT Press.

Gómez, R. (2002). Variability and detection ofinvariant structure. Psychological Science, 13, 431-436.

MacKay, D.J.C., (1992). Information-based objectivefunctions for active data selection. NeuralComputation, 4, 589-603.

Marcus, G.F., Vijayan, S., Bandi Rao, S., Vishton, P.M.(1999). Rule Learning by Seven-Month-Old Infants.Science, 283: 77-80.

McClelland, J.L. (1998). Connectionist models andBayesian inference. In M.Oaksford, & N. Chater(Eds.) Rational models of cognition. Oxford: OxfordUniversity Press.

Oaksford, M., & Chater, N. (1994). A rational analysisof the selection task as optimal data selection.Psychological Review, 101(4), 608-631.

Onnis, L., Destrebecq, A., Christiansen, M., Chater, N.,& Cleeremans, A. (submitted). The U-shape nature ofthe Variability Effect: a connectionist model.

Monaghan, P., Chater, N., & Onnis, L. (in preparation).Optimal data selection in sequential AGL learning.

Perruchet, P., & Pacteau, C. (1990). Synthetic grammarlearning: Implicit rule abstraction or explicitfragmentary knowledge? Journal of ExperimentalPsychology: General, 119, 264-275.

Saffran, J.R., Aslin, R.N., and Newport, E.L. (1996).Statistical learning by 8-month-old infants. Science,274, 1926-1928.

Saffran, J.R., Johnson, E.K., Aslin, R.N., and Newport,E. L. (1999). Statistical learning of tone sequences byhuman infants and adults. Cognition, 70, 27-52.

Schvaneveldt, R.W., & Gómez, R.L. (1998). Attentionand probabilistic sequence learning. PsychologicalResearch, 61, 175-190.

Servan-Schreiber, E., & Anderson, J.R. (1990).Learning artificial grammars with competitivechunking. Journal of Experimental Psychology:Learning, Memory, and Cognition, 16, 592-608.

Tomasello, M. (2000). The item based nature ofchildren's early syntactic development. Trends inCognitive Sciences, 4, 156-163.

Vokey, J.R. & Brooks, L.R. (1992). Salience of itemknowledge in learning artificial grammar. Journal ofExperimental Psychology: Learning, Memory, andCognition, 20, 328-344.

891