Learning mechanisms: statistical, rule-like, or both

22
1 Learning mechanisms: statistical, rule-like, or both? Graziela Pigatto Bohn May 2014 University of Toronto 1 Introduction In Generative Theory, the patterns found in natural languages have traditionally been accounted for by means of rules (Chomsky and Halle 1968). As to language acquisition, under the scope of Generativism, the child learns to map underlying representations into surface structures, and vice-versa, from exposure to a limited set of stimuli by making use of rules and generalizations to novel elements (Marcus et al 1999). An alternative mechanism opposed to the postulation of rule-learning is statistical-learning, which has been widely investigated by current studies. Saffran et al (1996:1926) define statistical learning as ‘a mechanism by which learners acquire information about the distribution of elements in the input’. Thus, in this model, the learner’s role is to figure out how likely it is that some elements tend to occur in relation to one another. In recent years, the divergences raised by proponents of each of these two mechanisms, rule-learning and statistical-learning, have culminated in a vast number of heated discussions in the field of both psychology and linguistics. For instance, word segmentation experiments carried out by Marcus et al (1999) aimed at undermining statistical-learning models, mainly by arguing that such models can only be tested with elements that have had previous exposure and, therefore, are unrealistic when it comes to generalizations involving novel elements, even though

Transcript of Learning mechanisms: statistical, rule-like, or both

1

Learning mechanisms: statistical, rule-like, or both?

Graziela Pigatto Bohn May 2014

University of Toronto

1 Introduction

In Generative Theory, the patterns found in natural languages have traditionally been

accounted for by means of rules (Chomsky and Halle 1968). As to language acquisition, under

the scope of Generativism, the child learns to map underlying representations into surface

structures, and vice-versa, from exposure to a limited set of stimuli by making use of rules and

generalizations to novel elements (Marcus et al 1999). An alternative mechanism opposed to the

postulation of rule-learning is statistical-learning, which has been widely investigated by current

studies. Saffran et al (1996:1926) define statistical learning as ‘a mechanism by which learners

acquire information about the distribution of elements in the input’. Thus, in this model, the

learner’s role is to figure out how likely it is that some elements tend to occur in relation to one

another.

In recent years, the divergences raised by proponents of each of these two mechanisms,

rule-learning and statistical-learning, have culminated in a vast number of heated discussions in

the field of both psychology and linguistics. For instance, word segmentation experiments

carried out by Marcus et al (1999) aimed at undermining statistical-learning models, mainly by

arguing that such models can only be tested with elements that have had previous exposure and,

therefore, are unrealistic when it comes to generalizations involving novel elements, even though

2

the authors do not completely exclude the statistical model. Such study has received a number of

replies from supporters of statistical-learning models, namely Christiansen and Curtin (1999)

who replicate the experiments conducted by Marcus et al (1999) and report that statistical models

can fit the data without invoking any kind of rules. In Altmman and Dienes’s (1999) commentary

on Marcus et al’s experiments it is also argued that statistical models not only account for their

data, but also for substantially more complex data than those in the experiment (p. 875a).

However, while these different views have claimed that statistical learning and rule learning are

separate mechanism, there are proposals, such as by Aslin and Newport (2012), in which a

unifying perspective is suggested.

This paper explores each of these proposals more extensively and analyzes the arguments

presented in favor of their mechanisms. § 2 and §3 lay out the fundamentals of both statistical

and rule learning mechanisms, respectively. These sections also present some of the important

studies carried out to test each mechanism. § 4 discusses some of the arguments which have been

put forward to support each mechanism and § 5 brings a proposal which unifies both statistical

and rule learning. § 6 presents my personal assessment of both mechanisms and concludes.

2 Statistical-learning mechanism

The interest in statistical learning arose from a change from symbolic models to

probabilistic models that emphasize the distribution of exemplars and the importance of the

richness of the input (Aslin & Newport 2008). The term statistical learning was first used in

Charniak’s (1993, apud Aslin & Newport 2008) description of algorithms in computational

linguistics and has become a term used to describe the acquisition of

3

structured information from the auditory or visual environment via sensitivity to

frequency or probability distributions without overt reinforcement or direct

feedback, but by mere exposure to language. (Aslin & Newport 2008:16)

But how does the learner who is being exposed to language for the first time know which sounds

are relevant for the computation? That is, how does a learner distinguish relevant speech sounds

from noise? One valid point made by Aslin & Newport is that, even though statistical learning

seems to be an implausible learning mechanism, several of psycholinguistics experiments do

control stimuli for word and phoneme frequency, suggesting, thus, that subjects are indeed

sensitive to these variables and have access to distributional information at the word, syllable,

and phoneme levels (2008:20). As to what allows a statistical learning mechanism to operate

without attempting to compute too many statistics or irrelevant ones is, according to these

authors, a set of constraints – some innate and some learned from the input. For example, work

conducted by Newport, Weiss, Wonnocott & Aslin (2004, apud Aslin & Newport 2008) suggests

that learners pay attention at segment and syllable information to segment words – depending on

whether the word boundaries were at the syllable level or the segment level.1 Also, another study

by Yu, Ballard & Aslin (2005) with adults learning Mandarin suggests that non-linguistic cues

such as eye gaze can also help word segmentation and, consequently, word learning.

The studies which will be presented in the following subsection briefly summarize what

has been achieved in terms of statistical learning by means of experimental analyses with both

infant and adult learners as well as computer models.

1 Other studies (Cutler & Butterfield 1992; Jusczyk, Hohne & Bauman 1999; Johnson & Jusczyk 2001) argue that prosodic cues such as strong and weak stress also play an important role to segment words from a continuous acoustic signal.

4

2.1 Studies based on statistical learning

A seminal study on statistical learning was conducted by Saffran, Aslin & Newport

(1996) who argue that segmentation of words from fluent speech can be accomplished by 8-

month-old infants based solely on statistical relationships between neighboring speech sounds.

Their study is grounded on three fundamental questions: (i) how can children segment words

from fluent speech and subsequently recognize them in isolation? (ii) what information is used

by infants to discover word boundaries? (iii) is there any acoustic correlate which helps learners

discover word boundaries? The authors underline that most studies so far have attempted to

answer these question emphasizing experience-independent internal structures over the role of

experience-dependent factors. To Saffran, Aslin & Newport, however, it is irrefutable that

experience-dependent mechanisms also have a crucial role in language acquisition, so the

authors address the above questions from a different perspective in which learning, more

specifically statistical learning, plays a primary part in helping learners acquire the structure of a

language. Based on a previous experiment in which adults and children used information about

transitional probabilities to discover word boundaries in an artificial language corpus with no

acoustic cues, Saffran, Aslin & Newport perform a similar analysis to examine if 8-month-old

infants also make use of statistical learning to segment speech into words. The authors make use

of the familiarization-preference procedure in which infants are exposed to auditory material that

serves as potential learning experience. In the first phase of the experiment infants are

familiarized with 2 minutes of a continuous speech stream consisting of four three-syllable

nonsense words. In order to avoid having any acoustic information about word boundaries, the

speech stream was generated by a speech synthesizer in a monotone female voice so that the only

5

cue to word boundary was the transitional boundaries between syllable pairs. For instance, in a

stream like bidakupadotigolabubidaku the only cues to word boundaries are the transitional

probabilities between syllable pairs – which were higher (more restricted) within words then

between words. In the second part, infants are presented with two types of test stimuli: items

that are contained within the familiarization material and items that are highly similar. During

the test trial, infants showed a significant discrimination between word and non-word stimuli,

paying more attention at nonwords. That is, infants recognized the difference between the novel

and the familiar ordering of the three-syllable string, which suggests, according to Saffran, Aslin

& Newport, that 8-months olds are capable of extracting serial-order information after only a 2-

minute of listening experience (p.1927). However, according to the authors, the learner must also

be able to extract the transitional frequency of co-occurrence of sound pairs (how likely it is that

some sounds tend to occur in relation to one another). In order to examine that, Saffran, Aslin &

Newport carry out a second experiment in which 8-month-old infants are required to perform a

more difficult statistical computation to distinguish a word-internal syllable pair like pre-tty –

words – from a word external syllable pair like ty#ba – part-words – as in pretty#baby. The test

trials’ results have shown that babies listen longer to part-words, suggesting that 2 minutes of

exposure is sufficient for them to extract information about the sequential statistics of syllables.

According to Saffran, Aslin & Newport, the fact that these infants could extract word and

syllable segmentation after only 2 minutes of exposures suggests that learners have access to a

powerful mechanism for the computation of statistical properties of the language input. Also,

although experience in the real world is unlikely to be as concentrated as it was in these

experiments, Saffran, Aslin & Newport believe that infants in more natural settings can also

profit from other types of cues correlated with statistical information. Hence, based on how little

6

input and time these infants had to extract word and syllable segmentation from an artificial

language, they argue that computational abilities may play a more crucial role in language

acquisition than existing theories have suggested.

In a subsequent study, Saffran, Newport & Aslin (1996) tested whether an integration

between transitional probabilities and prosodic cues, in this case word-final lengthening, helped

enhance the learner’s performance on the task of identifying word boundaries. For this

experiment twenty-four adult subjects were recruited and tested on nonse trisyllabic words with

three different patterns of lengthening: initial lengthening (first syllable was lengthened); final

lengthening (third syllable was lengthened); and no lengthening (no prosodic cues were added,

leaving transitional probabilities as the only cue to word boundaries). To Saffran, Newport &

Aslin, having a consistent prosodic cue combined with transitional probabilities may facilitate

word segmentation. Also, according to them, third syllable lengthening is expected to facilitate

performance in determining word boundaries mainly because of the existence of word-final

lengthening in many languages. The results show that subjects exposed to initial lengthening

performed with a mean score of 21.9 (61%); subjects exposed to no lengthening performed with

a mean score of 23.4 (65%); and subjects exposed to final lengthening performed with a mean

score of 29 (80%). To Saffran, Newport & Aslin, these results support the hypothesis that final

syllable lengthening facilitates word segmentation, suggesting that these subjects used their

knowledge of word-lengthening to segment speech. Based on this, the authors entertain the idea

that the awareness of prosodic cues may be a part of the human perceptual set of strategies for

7

dealing with continuous speech, but these cues are not sufficient to dispense with the statistical

learning mechanism.2

Following the aforementioned studies conducted by Saffran, Newport & Aslin (1996) and

Saffran, Aslin & Newport (1996) showing that learners seem to make use of a statistical

information as well prosodic cues to segment speech stream, Allen & Christiansen (1996) run a

simulation in which two different neural networks are trained on two different sets of cues: the

first network was trained on a set of 15 trisyllabic (CVCVCV) words – ‘vtp’ words (from

variable transitional probabilities) - in which the word internal transitional probabilities are

varied so as to serve as a possible source of information about lexical boundaries, that is, some

syllables occurr in some words and in more locations within words than others (also there are

restrictions to which speech sounds could begin or end words); the second network was trained

on what Allen & Christiansen call ‘flat words’ in which the probability of a given syllable

following any other syllable is the same for all syllables and each syllable was equally likely to

appear in any position within a word, showing, thus, no phonotactic restrictions. By this, a

syllable that appears with higher probability at the end of words than at other positions is more

likely to mark word boundary than a syllable that appears with equal frequency at any position in

the word. Likewise, a syllable that appears more frequent word-initially is less likely to be

followed by a word boundary. Allen & Christiansen’s (1996) purpose is, thus, to see how the

trained neural networks use these probabilistic differences to determine word boundaries. More

specifically, the authors’ expectation is that a combination of cues – transitional probabilities and 2 To Saffran, Newport & Aslin (1996), prosodic cues are quite helpful in initially breaking up the speech stream – an evidence of which is infants’ initial productions consisting predominantly of those syllable which are stressed or word-final in adult speech –, but the problem of correctly determining the structure and boundaries of many multisyllabic words remains in later stages (p. 608). Saffran, Newport & Aslin also point out that it is unknown whether all languages provide reliable prosodic cues to word boundaries, such as English with word-initial stress and French with word-final stress, for example. Another problem, according to them, is that to access language-specific prosodic patterns such as that of English or French, the learner must have some knowledge of word boundary to determine if the stress corresponds to the beginning, middle or end of words (1996:608).

8

phonotatics – facilitates the task of identifying word boundaries. The results indeed show that the

network trained on a flat vocabulary make almost no discrimination between end-of-word and

non-end-of-word positions, while the network trained on vtp words predict a boundary with

significantly higher confidence at lexical boundaries than at word internal positions. Also, in

order to assess generalizations made by the neural network, Allen & Christiansen tested the vtp

trained network on a set of novel words that are in line with the artificial language and non-

words which violate the phonotactic restrictions. The average activation of boundaries for novel

words is .26 compared to .006 for non-words, suggesting that the vtp trained network is able to

discriminate between impossible from possible words. According to these results, Allen &

Christiansen argue that the variability in transitional probabilities between syllables may play an

important role in aiding learners perform better at identifying word boundaries than a network

with access to only word boundary information.

Another study examining statistical learning was conducted by Christiansen, Allen &

Seidenberg (1998) in which the problem of segmenting utterances into words by learners was

once again addressed. According to these authors, language acquisition involves three tasks: a

primary task, in which the learner attends to the linguistic input; an immediate task, in which the

learner encodes statistically salient regularities of the input concerning phonology, lexical stress,

and utterance boundaries; and a derived task, in which knowledge about linguistic structures that

are not explicitly marked in the input is acquired (for instance, boundaries between words). The

focus of their work is the integration of cues - distributional and acoustic - and how such

integration can facilitate the discovery of derived linguistic knowledge by means of a computer

model. However, in this study, instead of using an artificial language, Christiansen Allen &

Seidenberg, expose the model to a phonetically transcribed corpus of child-directed speech so

9

that the input has the same characteristics as the human learners are exposed to. The corpus used,

from CHILDES database, consists of 27,467 words distributed over 9,108 utterances and was

randomly divided into a training corpus and a test corpus. The training corpus consists of mainly

monosyllabic words (86.8%), compared with 12.3% bisyllabic words and 0.9% trisyllabic words.

Also, most bisyllabic and trisyllabic words show a strong-weak stress pattern (77.3% and 77.6%,

respectively). The distribution of words in the test corpus is very similar to that of the training

corpus: 86.5% of monosyllabic words; 12.7% of bisyllabic words; and 0.7% of trisyllabic words.

The model was also trained under five different training conditions which varied according to the

combination of cues provided: (i) phonological, utterance boundary, and stress information; (ii)

phonological and utterancy boundary information; (iii) phonological and stress information; (iv)

phonological information; and (v) stress and utterance boundary information. This experiment’s

results show that the network seems to have the necessary attributes to integrate multiple cues

when segmenting speech, suggesting that interactions between cues may form an additional

source of information for word segmentation. According to Christiansen Allen & Seidenberg, the

networks are able to perform the immediate task of predicting the next speech sound in a

phonological string, which indicates learning the phonology and phonotactics of the language,

as well as to perform the derived task of word segmentation. Overall, the network could segment

44% of the speech stream. However, the authors point out that because the network was trained

on child-directed speech to infants between the ages of 6 and 16 weeks, it is difficult to compare

it with humans because it is yet unknown how well infants do at this age. They also highlight

that these results should not be used to account for adult-level word segmentation without

10

additional cues, training, and/or architectural augmentation.3

Finally, in a study on the role of consonants and vowels in continuous speech processing,

Bonatti et al (2005) take as their starting point the assumption that vowels and consonants carry

different kinds of information, the latter being more tied to word identification and the former to

grammar. That being so, in a word segmentation task involving speech streams, learners will

track transitional probabilities among consonants, but not among vowels. That is, consonants are

more transparent to statistical computations than vowels, suggesting that when learners compute

transitional probabilities over consonants, they will tend to extract not the actual sequence of

syllables that compose the words encountered in the stream, but the sequence of consonants.

According to Nespor et al (2003), consonants predominate in an on-line task of word

segmentation because they preferentially contribute to lexical processing of words, whereas

grammatical variations rest mostly on vocalic segments. On the basis of these observations,

Bonatti et al propose that the task of interpreting the lexicon falls more on consonants than on

vowels. And because vowels and consonants behave differently in the language, they may elicit

different computations in word segmentation. The results of an experiment conducted by these

authors show that learners can compute transitional probabilities between consonants to segment

speech into words, suggesting that the consonantal tier plays a role in language processing. As to

vowels, learners clearly failed to compute transitional probabilities of vocalic sounds to segment

speech. This study suggests that learners can break a continuous speech into words when relying

on consonants, but they are apparently unable to do so when relying on vowels. The language,

thus, may limit which representations are open to statistical computations and which are not.

3 Ways of improving the model so that it is comparable to human word segmentation include: exposure to more data, and, therefore, a greater variety of words (the model was trained on less than a month’s worth of data); a change in stress patterns (the simulation performed here included only trochaic words); vowel quality including vowel reduction in weak syllables; and correlation between speech input and non-linguistic stimuli (Allen & Christiansen, 1998:252-256).

11

The five studies briefly summarized here are grounded on a common driving question -

whether learners track statistics in linguistic input to segment words in a speech stream – and

have reached a common answer: yes. However, we have seen that when statistical learning

incorporates more aspects of the speech stream, such as phonotactics and prosodic cues, learning

is enhanced, and this can serve as an indicator that a learner does not rely solely on statistical

information. While the primary focus of these studies has been how sensitive a learner is to

regularities in the input, they have made it clear that a single cue is insufficient and that cues are

not independent of each other. These studies have, however, either tested humans on artificial

languages in which words had unvarying size (usually trisyllabic), or tested computer models on

both artificial and natural languages. This means that statistical learning has not been tested with

a rich input characteristic of natural language and human subjects at the same time.

Alternatively, according to Saffran, Aslin & Newport (1996), it is possible that the complexity of

natural languages actually facilitates learning because it contains non-linguistic stimuli as well as

other cues that may correlate with statistical cues. But because no study so far has tested this

difference, it is also possible that a learner may fail when faced with the variability inherent in

natural speech (Romberg & Saffran 2010).

3 Rule-learning mechanism Similarly to statistical learning, the mechanism of rule learning also posits the rapid

learning of regularities of a language from exposure to a limited set of stimuli. However, one of

the differences between rule learning and statistical learning has to do with the open-ended

abstract relationship present in the former but not in the latter. According to Marcus et al

12

(1998:77), an in equation where y = x +2, x can be substituted by any value. Likewise, languages

are also governed by rules which allow generalizations, that is, open-ended relationships, and as

soon as a rule is mastered, the learner is capable of applying it to novel items. In the following

subsection some of the studies regarding rule-learning mechanism will be briefly summarized.

As it will be noticed, all of these studies share a common ground by focusing on the aspect of

abstraction and generalization of language structures.

3.1 Studies based on rule-learning mechanism

In a study aimed to test rule learning by 7-month-old infants, Marcus et al (1999)

consider that learners might possess at least two learning mechanisms, one for learning statistical

information and another for learning algebraic rules. In this study, Marcos et al tested 7-month-

old infants4 in three experiments in which, according to them, statistical learning mechanism

would not suffice. In each experiment, infants were familiarized to 3-word sentences from an

artificial language and tested on 3-word sentences formed with words which did not appear in

the habituation phase. The infants were tested on whether the test sentences were consistent or

inconsistent with the grammar they had been familiarized. Marcus et all highlight that, because

none of the test words appeared in the habituation phase, infants could not rely on transitional

probabilities. Also, because sentences were all the same length and generated by a computer,

infants could not rely on statistical properties such as number of syllables or prosody either

(p.78). In the first experiment, infants were exposed for 2 minutes to either an ‘ABA’ condition

4 According to Marcus et al, these subjects are old enough to be able to distinguish words in fluent stream of speech.

13

(‘ga ti ga’, for instance) or an ‘ABB’ condition (‘ga ti ti’, for instance). In the test phase, infants

were presented with entirely new words, such as ‘wo fe wo’ or ‘wo fe fe’, for instance. Half of

the sentences were consistent with the grammar presented in the habituation phase and the other

half was constructed from the grammar the infant had not been exposed to. The results show that

15 of 16 infants attended more to the inconsistent sentences. This result is in accordance with

Marcus et al’s expectations. To them, if infants can abstract a rule and generalize it to novel

items, they should attend longer to inconsistent items because they differ from the rule learned.

Marcus et al notice however that, even though test sentences were made up of new items, there is

still overlapping regarding phonetic features (sentences in both habituation and test phase

starting with voiced consonant and followed by a word starting with a voiceless consonant, for

example). The regularities involving phonetic features may, thus, be helping infants make

predictions of what to expect when presented with novel items. In order to rule out the possibility

that learners might be learning a sequence based on phonetic feature rather than abstracting a

rule and generalizing it, Marcus et al conduct a second experiment in which words were more

carefully constructed and arranged. Similarly to experiment 1, the results of experiment 2 show

that 15 of 16 paid more attention to inconsistent items during the test phase. Another problem

found by the authors involves the reduplication found in ABB and not in ABA which could be

leading learners to differentiate one grammar from another without making use of any rule. To

solve this, Marcus et al designed a third experiment in which they compared ABB and AAB

sentences so that reduplication was contained in both grammars. Once more, infants paid more

attention to inconsistent items during the test phase. After having carried out these three

experiments, Marcus et al conclude that a mechanism which relies solely on statistical learning

and is, therefore, sensitive to transitional probabilities between words, could not achieve these

14

results, mainly because all items presented in the test phase were novel. In other words, a learner

could not rely on transitional probabilities based on phonetic cues, for example, because these

cues were not consistent across the habituation and test phase. According to Marcus et al, only a

rule-learning mechanism allows infants to extract abstract rules between elements, such as x then

y, and generalize it to new items independently of transitional probabilities. Also, it is argued by

them that the subjects tested in the experiments seem to be able to extract rules rapidly from

small amount of data.

Gomez & Gerken (1999) also tested the ability infants have to abstract beyond a specific

set of sentences, but differently from the studies presented so far, Gomez and Gerken were

interested in extending such research question to the acquisition of syntax. The authors

conducted 4 different experiments in which 10-month-old infants were not tested on their

memory for familiar grammatical strings constructed from an artificial language, but were asked

to generalize the acquired structures to new instances. Because the sentences used during the

familiarization phase were similar but not identical to the ones presented during the test phase,

specially because infants were exposed to grammars characterized by variable word order, they

could not rely on transitional probabilities to abstract long-distances dependencies. Experiment 1

asked whether infants were able to generalize to new instances by distinguishing new

grammatical string from illegal ones. Experiment 2 asked the same question as experiment 1, but

this time the strings showed violations of grammatical word order. Experiment 3 tested if infants

could discriminate new strings of their training grammar from strings produced by another

grammar. In this experiment, both strings began and ended with the same words, but the ordering

of words differed within the strings. Finally, in Experiment 4, infants were trained in one

vocabulary and tested on a new vocabulary. In all experiments, infants were able to distinguish

15

the learned grammar from the unlearned one. Also, the fact that in experiment 3 and 4 word

ordering and vocabulary in the test phase differed from the habituation phase suggests that

infants showed the discrimination regardless of their training grammar. This means that the

ability to discriminate grammars in these experiments resulted from the infants’ ability to extract

information by abstracting beyond specific word order. To Gomez & Gerken, a statistical

learning mechanism almost certainly plays a role in word segmentation, but the syntactic use of

units must certainly involve some degree of abstraction.

Johnson et al (2009) conducted an experiment on rule-learning based on Marcus et al

(1999) in order to look into infants’ sensitivity to structured relations among stimulus features

from visual and auditory input. According to Marcus et al (2007), infants are better at learning

rules from speech than other domains of auditory stimulus, such as musical tones, timbres, and

natural animals sounds.5 The goal of Johnson et al’s study is to examine which mechanisms are

involved in infant rule learning. Because previous studies, such as that by Marcus et al (2007),

have shown that rule learning is facilitated by familiar stimulus – speech, for instance – Johnson

et al test if the use of unfamiliar stimuli might challenge the process of learning rules. First,

infants were exposed to sequences of colored shapes organized according to ABB, ABA or AAB

patterns (e.g., octagon-square-octagon, bowtie-star-bowtie). Following the habituation phase, a

new set of colored shapes was shown based either on the familiar rule or a novel rule to which

infants had not been exposed before (two test trials consisted of the three possible three-shape

sequences that had a familiar pattern, and two test trials with the same three-shape sequences

5 In this study, Marcus et all (2007) exposed infants to structured sequences consisted of either pure speech syllables (naturally sung) or nonspeech sounds (pure tones, instrument timbres, and animal sounds) and asked if infants could extract rules from these sequences. Their results show that infants were able to extract rules only when exposed to speech sequences. However, if infants first heard a rule based on speech, they were able to generalize the rule to sequences of nonspeech sounds. To them, this indicates that a learned rule can also be transferred to a different domain. Most importantly, their conclusions suggest that speech can facilitate rule learning in domains where humans might not acquire rules in a direct manner.

16

arranged in a different order to ensure that infants were not distracted by being tested on a

different color or shape). The first experiment tested if 11-month-old and 8-month-old infants are

able to distinguish a sequence of ABB shapes from a sequence of AAB (late and early

repetitions), a discrimination that can be accomplished by infants when they are exposed to

speech (Marcus et al. 1999, Marcus et al 2007). After habituation to ABB, 11-month-old infants

showed the ability to discriminate both ‘grammars’, while 8-month-olds did not show such

ability. This suggests that the first groups of infants could extract the ABB pattern and apply it to

new sequences. In order to test if the younger group of infants failed to distinguish both grammar

due to a common characteristic between them – it is possible that the pattern of repetition in both

ABB and AAB led infants to see both grammars as similar – Jonhson et al conducted a second

experiment in which the discrimination between ABB and ABA was tested. The results show

that ABB rule might be easier to discriminate within an ABB-ABA contrast because infants

learned the repetition during the habituation phase, but were not required to detect its position in

the sequence in the test phase. To the authors, it might be that 8-month-old acquired the rule

‘identiy anywhere’ or ‘there is a repetition’. Based on this, Johnson et al conclude that one

abstract relation that is acquired relatively early is that of repetition, and then that of position of

elements is acquired. However, the goal of this study was not to address the earliest age at which

rules can be acquired, but to examine how infants react when learning rules from a visual input.

Their results present evidence that infants are able to detect abstract relations from sets of

unfamiliar stimuli that share no surface features in common. The authors also highlight infants’

ability to notice a pattern and extend it to analogous cases, suggesting the development of

analogy.

17

4 Which mechanism?

The studies presented so far have shown that while statistical learning is a rapid and

efficient learning mechanism that enables both adults and infants to learn patterns from both

auditory and visual inputs, it has been argued that only rule-like mechanisms make

generalization to novel items possible.

According to Marcus (2000), some of the things we know about language are very

specific – for instance, the st sequence is usually followed by r and never by x – while others are

more general and abstract – the fact that we can form a sentence by combining any plural noun

with a plural verb or that we can add –ing to verb stems to form the progressive tense. Both

specific and abstract knowledge are, to Marcus, statistically reliable reflections of the world.

However, how to learn these types of knowledge may stem from two different mechanisms: one

pertaining to how specific units relate to each other (statistical learning); and the other pertaining

to open-ended schemas which allow free substitutions and generalizations (rule learning).

According to Marcus et al (1999), a system that relies solely on relations between specific items

is completely ruled out when it comes to generalizations to novel items. Based on their results,

the authors claim that infants were not using relations between specific elements to distinguish

grammar, but had indeed acquired an open-ended rule which could be generalized to new

instances. Marcus et al , however, do not completely disregard statistical learning as a

mechanism to acquire language. The results from their first experiments have indeed shown that

infants could be relying on relations between phonetic features presented during the habituation

phase and using these relations to distinguish both grammars. However, in a follow-up

experiment in which input was carefully designed not to provide crucial phonetic information

which could be used as a cue, infants were still able to distinguish grammars satisfactorily. To

18

them, the crucial aspect is that learners only acquire an abstract system by making use of

variables and operations and this is what allows generalizations. More specifically, they state that

‘[…] such mechanisms cannot account for how humans generalize rules to new items that do not

overlap with the items that appeared in the training’ (1999:79). However, contrary to this claim,

Christiansen & Curtin (1999) argue that a statistical mechanism modeled by a neural network

architecture typically trained to predict what the next item in a sequence of inputs is can indeed

fit Marcus et al’s data without invoking any kind of abstract rule. By means of binary

phonological features the network was trained on Marcus et al’s (1999) habituation sentences

and then tested with different items based on the same grammar from the habituation phase.

Even though Christiansen & Curtin predict that their model can fit Marcus et al’s data, when

presented with the test sentences, the model was better at predicting phonemes occurring in

inconsistent than in consistent sentences. Although it may seem unclear why this happened,

Christiansen & Curtin suggest that the inconsistent items are more salient to the learner and,

therefore, attract more of their attention. Regarding speech segmentation, Christiansen & Curtin

argue that no algebraic rules are needed. However, to them, it is to be determined whether rules

are needed outside this domain of speech segmentation.

Even though it is not yet clear which mechanism best accounts for language acquisition,

it is evident that they are still seen as two separate and independent mechanisms. There is,

however, a proposal in which a unifying perspective is put forth, arguing for a statistical-learning

mechanism which accounts for both the learning of patterns and the generalization to novel

instances. The following subsection discusses this particular proposal.

19

5 Statistical and rule learning: a unifying proposal

According to Aslin & Newport (2012), it is possible to have a balance between instance-

learning and generalization based on two facts: a language learner cannot acquire a structure

without gathering regularities from the input; likewise, the learner cannot wait for every possible

language structure to be in the input before inferences are made. Aslin & Newport highlight that

even though statistical-learning enables learners to extract statistics from input and to use the

information to make decisions about recurrent materials, it does not address the question of how

learners make rules, that is, how learners go beyond what they have seen or heard. Instead of

separating both mechanisms, Aslin & Newport suggest that they are part of one learning

mechanism. But how? The hypothesis formulated by these authors is that some stimulus are

naturally more salient than others, and if stimulus are coded based on their salient aspects rather

than specific details, learners will be able to generalize the saliency to all stimuli that present the

same salient aspects, even when these stimuli had never been heard or seen before. The salient

aspects of the input can also constraint the encoding of the structure presented, guiding learners

to what they should attend to. As to rules, these are acquired when patterns in the input indicate

that elements can occur interchangeably in the same structure (e.g., category of verbs). The same

contrast between word ‘dog’, which is generalized to all kinds of dogs, and the word ‘Huck’,

which is used to refer to a specific dog, holds when learning language rules and items.

According to a recent study conducted by Reeder, Newport & Aslin (2009, 2010), it is the

consistency in the input that lead learners to generalize rules to novel strings, while the

inconsistency of cues lead them to treat certain strings as exceptions. Hence, learning stems from

a single mechanism in which the result either applies to elements that have been experienced in

the input or to generalizations beyond experienced elements.

20

6 Final Considerations This paper has focused on two learning mechanisms which have been believed to operate

separately in the acquisition of grammar. While proponents of statistical learning argue this a

fundamental mechanism which allows learners to acquire patterns presented in the input,

supporters of the rule-like mechanism state that it is only by means of acquisition of rules that

language abstractions and generalizations are possible.

As to statistical learning, I have observed that many of the studies conducted so far

remain on one very specific domain of grammar: word segmentation. Also, the languages used in

the experiments are usual artificial with very controlled structures, such as word length, syllable

shape, and limited vocabulary. These studies also do not seem to account for linguistic variation,

such as allophones, which occur in real language data. That is, how would statistical learning

function given a combination of human learners and the rich input characteristic of natural

language? Also, if statistical learning mechanism relies solely on regularities found on surface

forms, how does a learner acquire more complex structures which are not so obvious in the

input?

As to rule-learning, it seems to me that the argument that this mechanism allows the

acquisition of more complex structures, abstractions and generalizations does not suffice to

dispense with statistical learning. In order to formulate rules and extend them to novel items, it

seems reasonable that the learner first attends to regularities from the input. Also, if statistical

learning is completely disregarded as a mechanism, what will be the starting point of a learner

when acquiring rules?

A unifying mechanism, as suggested by Aslin & Newport (2012), seems to offer a

plausible answer to these questions and should, thus, be explored in future studies.

21

References

Aslin, R.N., & Newport, E.L. (2008) What statistical learning can and can't tell us about language acquisition. In J. Colombo, P. McCardle, and L. Freund (eds.), Infant Pathways to Language: Methods, Models, and Research Directions. Mahwah, NJ: Lawrence Erlbaum Associates.

Aslin, R. N., & Newport, E. L. (2012). Statistical learning from acquiring specific items to forming general rules. Current directions in psychological science, 21(3), 170-176.

Allen, J., & Christiansen, M.H. (1996). Integrating multiple cues in word segmentation: A connectionist model using hints. In Proceedings of the 18th annual Cognitive Science Society conference (pp. 370–375). Mahwah, NJ: Lawrence Erlbaum Associates Inc.

Bonatti, L. L., Pena, M., Nespor, M., & Mehler, J. (2005). Linguistic Constraints on Statistical Computations: the role of consonants and vowels in continuous speech processing. Psychological Science, 16(6), 451-459

Christiansen, M., Allen, J., & Seidenberg, M. (1998). Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13, 221–268.

Christiansen, M. H., & Curtin, S. L. (1999, August). The power of statistical learning: No need for algebraic rules. In Proceedings of the 21st annual conference of the Cognitive Science Society (Vol. 114, p. 119).

Cutler, A., Butterfield, S. (1992). Rhythmic cues to speech segmentation: Evidence from juncture misperception. Journal of Memory and Language, 31, 218–236.

Johnson, E., & Jusczyk, P. (2001).Word segmentation by 8-month-olds: When speech cues count more than statistics. Journal of Memory and Language, 44, 548–567.

Jusczyk, P.W., Hohne, E. A., & Bauman, A. (1999). Infants’ sensitivity to allophonic cues for word segmentation. Perception and Psychophysics, 61, 1465–1476.

Gomez, R. L., & Gerken, L. (1999). Artificial grammar learning by 1-year-olds leads to specific and abstract knowledge. Cognition, 70(2), 109-135.

Marcus, G. F. (2000). Pabiku and Ga Ti Ga Two Mechanisms Infants Use to Learn About the World. Current Directions in Psychological Science, 9(5), 145-147.

Marcus, G. F. et al (1999). Rule learning by seven-month-old infants. Science, 283(5398), 77-80.

Marcus, G. F. et al (2007). Infant rule learning facilitated by speech. Psychological Science, 18(5), 387-391.

Nespor, M., Peña, M., & Mehler, J. (2003). On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e linguaggio, 2(2), 203-229.

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2009). The role of distributional information in linguistic category formation. In N. Taatgen & H. van Rijn (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. 2564–2569). Austin, TX: Cognitive Science Society.

22

Reeder, P. A., Newport, E. L., & Aslin, R. N. (2010). Novel words in novel contexts: The role of distributional information in formclass category learning. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 2063–2068). Austin, TX: Cognitive Science Society.

Romberg, A. R., & Saffran, J. R. (2010). Statistical learning and language acquisition. Wiley Interdisciplinary Reviews: Cognitive Science, 1(6), 906-914.

Saffran, J.R., Aslin, R.N., & Newport, E.L. (1996). Statistical learning by 8-month old infants. Science, 274, 1926-1928.

Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996).Word segmentation: The role of distributional cues. Journal of Memory and Language, 35, 606–621.

Yu, C., Ballard, D. H., & Aslin, R. N. (2005). The role of embodied intention in early lexical acquisition. Cognitive Science, 29, 961-1005.