Post on 28-Jan-2023
1
Learning mechanisms: statistical, rule-like, or both?
Graziela Pigatto Bohn May 2014
University of Toronto
1 Introduction
In Generative Theory, the patterns found in natural languages have traditionally been
accounted for by means of rules (Chomsky and Halle 1968). As to language acquisition, under
the scope of Generativism, the child learns to map underlying representations into surface
structures, and vice-versa, from exposure to a limited set of stimuli by making use of rules and
generalizations to novel elements (Marcus et al 1999). An alternative mechanism opposed to the
postulation of rule-learning is statistical-learning, which has been widely investigated by current
studies. Saffran et al (1996:1926) define statistical learning as ‘a mechanism by which learners
acquire information about the distribution of elements in the input’. Thus, in this model, the
learner’s role is to figure out how likely it is that some elements tend to occur in relation to one
another.
In recent years, the divergences raised by proponents of each of these two mechanisms,
rule-learning and statistical-learning, have culminated in a vast number of heated discussions in
the field of both psychology and linguistics. For instance, word segmentation experiments
carried out by Marcus et al (1999) aimed at undermining statistical-learning models, mainly by
arguing that such models can only be tested with elements that have had previous exposure and,
therefore, are unrealistic when it comes to generalizations involving novel elements, even though
2
the authors do not completely exclude the statistical model. Such study has received a number of
replies from supporters of statistical-learning models, namely Christiansen and Curtin (1999)
who replicate the experiments conducted by Marcus et al (1999) and report that statistical models
can fit the data without invoking any kind of rules. In Altmman and Dienes’s (1999) commentary
on Marcus et al’s experiments it is also argued that statistical models not only account for their
data, but also for substantially more complex data than those in the experiment (p. 875a).
However, while these different views have claimed that statistical learning and rule learning are
separate mechanism, there are proposals, such as by Aslin and Newport (2012), in which a
unifying perspective is suggested.
This paper explores each of these proposals more extensively and analyzes the arguments
presented in favor of their mechanisms. § 2 and §3 lay out the fundamentals of both statistical
and rule learning mechanisms, respectively. These sections also present some of the important
studies carried out to test each mechanism. § 4 discusses some of the arguments which have been
put forward to support each mechanism and § 5 brings a proposal which unifies both statistical
and rule learning. § 6 presents my personal assessment of both mechanisms and concludes.
2 Statistical-learning mechanism
The interest in statistical learning arose from a change from symbolic models to
probabilistic models that emphasize the distribution of exemplars and the importance of the
richness of the input (Aslin & Newport 2008). The term statistical learning was first used in
Charniak’s (1993, apud Aslin & Newport 2008) description of algorithms in computational
linguistics and has become a term used to describe the acquisition of
3
structured information from the auditory or visual environment via sensitivity to
frequency or probability distributions without overt reinforcement or direct
feedback, but by mere exposure to language. (Aslin & Newport 2008:16)
But how does the learner who is being exposed to language for the first time know which sounds
are relevant for the computation? That is, how does a learner distinguish relevant speech sounds
from noise? One valid point made by Aslin & Newport is that, even though statistical learning
seems to be an implausible learning mechanism, several of psycholinguistics experiments do
control stimuli for word and phoneme frequency, suggesting, thus, that subjects are indeed
sensitive to these variables and have access to distributional information at the word, syllable,
and phoneme levels (2008:20). As to what allows a statistical learning mechanism to operate
without attempting to compute too many statistics or irrelevant ones is, according to these
authors, a set of constraints – some innate and some learned from the input. For example, work
conducted by Newport, Weiss, Wonnocott & Aslin (2004, apud Aslin & Newport 2008) suggests
that learners pay attention at segment and syllable information to segment words – depending on
whether the word boundaries were at the syllable level or the segment level.1 Also, another study
by Yu, Ballard & Aslin (2005) with adults learning Mandarin suggests that non-linguistic cues
such as eye gaze can also help word segmentation and, consequently, word learning.
The studies which will be presented in the following subsection briefly summarize what
has been achieved in terms of statistical learning by means of experimental analyses with both
infant and adult learners as well as computer models.
1 Other studies (Cutler & Butterfield 1992; Jusczyk, Hohne & Bauman 1999; Johnson & Jusczyk 2001) argue that prosodic cues such as strong and weak stress also play an important role to segment words from a continuous acoustic signal.
4
2.1 Studies based on statistical learning
A seminal study on statistical learning was conducted by Saffran, Aslin & Newport
(1996) who argue that segmentation of words from fluent speech can be accomplished by 8-
month-old infants based solely on statistical relationships between neighboring speech sounds.
Their study is grounded on three fundamental questions: (i) how can children segment words
from fluent speech and subsequently recognize them in isolation? (ii) what information is used
by infants to discover word boundaries? (iii) is there any acoustic correlate which helps learners
discover word boundaries? The authors underline that most studies so far have attempted to
answer these question emphasizing experience-independent internal structures over the role of
experience-dependent factors. To Saffran, Aslin & Newport, however, it is irrefutable that
experience-dependent mechanisms also have a crucial role in language acquisition, so the
authors address the above questions from a different perspective in which learning, more
specifically statistical learning, plays a primary part in helping learners acquire the structure of a
language. Based on a previous experiment in which adults and children used information about
transitional probabilities to discover word boundaries in an artificial language corpus with no
acoustic cues, Saffran, Aslin & Newport perform a similar analysis to examine if 8-month-old
infants also make use of statistical learning to segment speech into words. The authors make use
of the familiarization-preference procedure in which infants are exposed to auditory material that
serves as potential learning experience. In the first phase of the experiment infants are
familiarized with 2 minutes of a continuous speech stream consisting of four three-syllable
nonsense words. In order to avoid having any acoustic information about word boundaries, the
speech stream was generated by a speech synthesizer in a monotone female voice so that the only
5
cue to word boundary was the transitional boundaries between syllable pairs. For instance, in a
stream like bidakupadotigolabubidaku the only cues to word boundaries are the transitional
probabilities between syllable pairs – which were higher (more restricted) within words then
between words. In the second part, infants are presented with two types of test stimuli: items
that are contained within the familiarization material and items that are highly similar. During
the test trial, infants showed a significant discrimination between word and non-word stimuli,
paying more attention at nonwords. That is, infants recognized the difference between the novel
and the familiar ordering of the three-syllable string, which suggests, according to Saffran, Aslin
& Newport, that 8-months olds are capable of extracting serial-order information after only a 2-
minute of listening experience (p.1927). However, according to the authors, the learner must also
be able to extract the transitional frequency of co-occurrence of sound pairs (how likely it is that
some sounds tend to occur in relation to one another). In order to examine that, Saffran, Aslin &
Newport carry out a second experiment in which 8-month-old infants are required to perform a
more difficult statistical computation to distinguish a word-internal syllable pair like pre-tty –
words – from a word external syllable pair like ty#ba – part-words – as in pretty#baby. The test
trials’ results have shown that babies listen longer to part-words, suggesting that 2 minutes of
exposure is sufficient for them to extract information about the sequential statistics of syllables.
According to Saffran, Aslin & Newport, the fact that these infants could extract word and
syllable segmentation after only 2 minutes of exposures suggests that learners have access to a
powerful mechanism for the computation of statistical properties of the language input. Also,
although experience in the real world is unlikely to be as concentrated as it was in these
experiments, Saffran, Aslin & Newport believe that infants in more natural settings can also
profit from other types of cues correlated with statistical information. Hence, based on how little
6
input and time these infants had to extract word and syllable segmentation from an artificial
language, they argue that computational abilities may play a more crucial role in language
acquisition than existing theories have suggested.
In a subsequent study, Saffran, Newport & Aslin (1996) tested whether an integration
between transitional probabilities and prosodic cues, in this case word-final lengthening, helped
enhance the learner’s performance on the task of identifying word boundaries. For this
experiment twenty-four adult subjects were recruited and tested on nonse trisyllabic words with
three different patterns of lengthening: initial lengthening (first syllable was lengthened); final
lengthening (third syllable was lengthened); and no lengthening (no prosodic cues were added,
leaving transitional probabilities as the only cue to word boundaries). To Saffran, Newport &
Aslin, having a consistent prosodic cue combined with transitional probabilities may facilitate
word segmentation. Also, according to them, third syllable lengthening is expected to facilitate
performance in determining word boundaries mainly because of the existence of word-final
lengthening in many languages. The results show that subjects exposed to initial lengthening
performed with a mean score of 21.9 (61%); subjects exposed to no lengthening performed with
a mean score of 23.4 (65%); and subjects exposed to final lengthening performed with a mean
score of 29 (80%). To Saffran, Newport & Aslin, these results support the hypothesis that final
syllable lengthening facilitates word segmentation, suggesting that these subjects used their
knowledge of word-lengthening to segment speech. Based on this, the authors entertain the idea
that the awareness of prosodic cues may be a part of the human perceptual set of strategies for
7
dealing with continuous speech, but these cues are not sufficient to dispense with the statistical
learning mechanism.2
Following the aforementioned studies conducted by Saffran, Newport & Aslin (1996) and
Saffran, Aslin & Newport (1996) showing that learners seem to make use of a statistical
information as well prosodic cues to segment speech stream, Allen & Christiansen (1996) run a
simulation in which two different neural networks are trained on two different sets of cues: the
first network was trained on a set of 15 trisyllabic (CVCVCV) words – ‘vtp’ words (from
variable transitional probabilities) - in which the word internal transitional probabilities are
varied so as to serve as a possible source of information about lexical boundaries, that is, some
syllables occurr in some words and in more locations within words than others (also there are
restrictions to which speech sounds could begin or end words); the second network was trained
on what Allen & Christiansen call ‘flat words’ in which the probability of a given syllable
following any other syllable is the same for all syllables and each syllable was equally likely to
appear in any position within a word, showing, thus, no phonotactic restrictions. By this, a
syllable that appears with higher probability at the end of words than at other positions is more
likely to mark word boundary than a syllable that appears with equal frequency at any position in
the word. Likewise, a syllable that appears more frequent word-initially is less likely to be
followed by a word boundary. Allen & Christiansen’s (1996) purpose is, thus, to see how the
trained neural networks use these probabilistic differences to determine word boundaries. More
specifically, the authors’ expectation is that a combination of cues – transitional probabilities and 2 To Saffran, Newport & Aslin (1996), prosodic cues are quite helpful in initially breaking up the speech stream – an evidence of which is infants’ initial productions consisting predominantly of those syllable which are stressed or word-final in adult speech –, but the problem of correctly determining the structure and boundaries of many multisyllabic words remains in later stages (p. 608). Saffran, Newport & Aslin also point out that it is unknown whether all languages provide reliable prosodic cues to word boundaries, such as English with word-initial stress and French with word-final stress, for example. Another problem, according to them, is that to access language-specific prosodic patterns such as that of English or French, the learner must have some knowledge of word boundary to determine if the stress corresponds to the beginning, middle or end of words (1996:608).
8
phonotatics – facilitates the task of identifying word boundaries. The results indeed show that the
network trained on a flat vocabulary make almost no discrimination between end-of-word and
non-end-of-word positions, while the network trained on vtp words predict a boundary with
significantly higher confidence at lexical boundaries than at word internal positions. Also, in
order to assess generalizations made by the neural network, Allen & Christiansen tested the vtp
trained network on a set of novel words that are in line with the artificial language and non-
words which violate the phonotactic restrictions. The average activation of boundaries for novel
words is .26 compared to .006 for non-words, suggesting that the vtp trained network is able to
discriminate between impossible from possible words. According to these results, Allen &
Christiansen argue that the variability in transitional probabilities between syllables may play an
important role in aiding learners perform better at identifying word boundaries than a network
with access to only word boundary information.
Another study examining statistical learning was conducted by Christiansen, Allen &
Seidenberg (1998) in which the problem of segmenting utterances into words by learners was
once again addressed. According to these authors, language acquisition involves three tasks: a
primary task, in which the learner attends to the linguistic input; an immediate task, in which the
learner encodes statistically salient regularities of the input concerning phonology, lexical stress,
and utterance boundaries; and a derived task, in which knowledge about linguistic structures that
are not explicitly marked in the input is acquired (for instance, boundaries between words). The
focus of their work is the integration of cues - distributional and acoustic - and how such
integration can facilitate the discovery of derived linguistic knowledge by means of a computer
model. However, in this study, instead of using an artificial language, Christiansen Allen &
Seidenberg, expose the model to a phonetically transcribed corpus of child-directed speech so
9
that the input has the same characteristics as the human learners are exposed to. The corpus used,
from CHILDES database, consists of 27,467 words distributed over 9,108 utterances and was
randomly divided into a training corpus and a test corpus. The training corpus consists of mainly
monosyllabic words (86.8%), compared with 12.3% bisyllabic words and 0.9% trisyllabic words.
Also, most bisyllabic and trisyllabic words show a strong-weak stress pattern (77.3% and 77.6%,
respectively). The distribution of words in the test corpus is very similar to that of the training
corpus: 86.5% of monosyllabic words; 12.7% of bisyllabic words; and 0.7% of trisyllabic words.
The model was also trained under five different training conditions which varied according to the
combination of cues provided: (i) phonological, utterance boundary, and stress information; (ii)
phonological and utterancy boundary information; (iii) phonological and stress information; (iv)
phonological information; and (v) stress and utterance boundary information. This experiment’s
results show that the network seems to have the necessary attributes to integrate multiple cues
when segmenting speech, suggesting that interactions between cues may form an additional
source of information for word segmentation. According to Christiansen Allen & Seidenberg, the
networks are able to perform the immediate task of predicting the next speech sound in a
phonological string, which indicates learning the phonology and phonotactics of the language,
as well as to perform the derived task of word segmentation. Overall, the network could segment
44% of the speech stream. However, the authors point out that because the network was trained
on child-directed speech to infants between the ages of 6 and 16 weeks, it is difficult to compare
it with humans because it is yet unknown how well infants do at this age. They also highlight
that these results should not be used to account for adult-level word segmentation without
10
additional cues, training, and/or architectural augmentation.3
Finally, in a study on the role of consonants and vowels in continuous speech processing,
Bonatti et al (2005) take as their starting point the assumption that vowels and consonants carry
different kinds of information, the latter being more tied to word identification and the former to
grammar. That being so, in a word segmentation task involving speech streams, learners will
track transitional probabilities among consonants, but not among vowels. That is, consonants are
more transparent to statistical computations than vowels, suggesting that when learners compute
transitional probabilities over consonants, they will tend to extract not the actual sequence of
syllables that compose the words encountered in the stream, but the sequence of consonants.
According to Nespor et al (2003), consonants predominate in an on-line task of word
segmentation because they preferentially contribute to lexical processing of words, whereas
grammatical variations rest mostly on vocalic segments. On the basis of these observations,
Bonatti et al propose that the task of interpreting the lexicon falls more on consonants than on
vowels. And because vowels and consonants behave differently in the language, they may elicit
different computations in word segmentation. The results of an experiment conducted by these
authors show that learners can compute transitional probabilities between consonants to segment
speech into words, suggesting that the consonantal tier plays a role in language processing. As to
vowels, learners clearly failed to compute transitional probabilities of vocalic sounds to segment
speech. This study suggests that learners can break a continuous speech into words when relying
on consonants, but they are apparently unable to do so when relying on vowels. The language,
thus, may limit which representations are open to statistical computations and which are not.
3 Ways of improving the model so that it is comparable to human word segmentation include: exposure to more data, and, therefore, a greater variety of words (the model was trained on less than a month’s worth of data); a change in stress patterns (the simulation performed here included only trochaic words); vowel quality including vowel reduction in weak syllables; and correlation between speech input and non-linguistic stimuli (Allen & Christiansen, 1998:252-256).
11
The five studies briefly summarized here are grounded on a common driving question -
whether learners track statistics in linguistic input to segment words in a speech stream – and
have reached a common answer: yes. However, we have seen that when statistical learning
incorporates more aspects of the speech stream, such as phonotactics and prosodic cues, learning
is enhanced, and this can serve as an indicator that a learner does not rely solely on statistical
information. While the primary focus of these studies has been how sensitive a learner is to
regularities in the input, they have made it clear that a single cue is insufficient and that cues are
not independent of each other. These studies have, however, either tested humans on artificial
languages in which words had unvarying size (usually trisyllabic), or tested computer models on
both artificial and natural languages. This means that statistical learning has not been tested with
a rich input characteristic of natural language and human subjects at the same time.
Alternatively, according to Saffran, Aslin & Newport (1996), it is possible that the complexity of
natural languages actually facilitates learning because it contains non-linguistic stimuli as well as
other cues that may correlate with statistical cues. But because no study so far has tested this
difference, it is also possible that a learner may fail when faced with the variability inherent in
natural speech (Romberg & Saffran 2010).
3 Rule-learning mechanism Similarly to statistical learning, the mechanism of rule learning also posits the rapid
learning of regularities of a language from exposure to a limited set of stimuli. However, one of
the differences between rule learning and statistical learning has to do with the open-ended
abstract relationship present in the former but not in the latter. According to Marcus et al
12
(1998:77), an in equation where y = x +2, x can be substituted by any value. Likewise, languages
are also governed by rules which allow generalizations, that is, open-ended relationships, and as
soon as a rule is mastered, the learner is capable of applying it to novel items. In the following
subsection some of the studies regarding rule-learning mechanism will be briefly summarized.
As it will be noticed, all of these studies share a common ground by focusing on the aspect of
abstraction and generalization of language structures.
3.1 Studies based on rule-learning mechanism
In a study aimed to test rule learning by 7-month-old infants, Marcus et al (1999)
consider that learners might possess at least two learning mechanisms, one for learning statistical
information and another for learning algebraic rules. In this study, Marcos et al tested 7-month-
old infants4 in three experiments in which, according to them, statistical learning mechanism
would not suffice. In each experiment, infants were familiarized to 3-word sentences from an
artificial language and tested on 3-word sentences formed with words which did not appear in
the habituation phase. The infants were tested on whether the test sentences were consistent or
inconsistent with the grammar they had been familiarized. Marcus et all highlight that, because
none of the test words appeared in the habituation phase, infants could not rely on transitional
probabilities. Also, because sentences were all the same length and generated by a computer,
infants could not rely on statistical properties such as number of syllables or prosody either
(p.78). In the first experiment, infants were exposed for 2 minutes to either an ‘ABA’ condition
4 According to Marcus et al, these subjects are old enough to be able to distinguish words in fluent stream of speech.
13
(‘ga ti ga’, for instance) or an ‘ABB’ condition (‘ga ti ti’, for instance). In the test phase, infants
were presented with entirely new words, such as ‘wo fe wo’ or ‘wo fe fe’, for instance. Half of
the sentences were consistent with the grammar presented in the habituation phase and the other
half was constructed from the grammar the infant had not been exposed to. The results show that
15 of 16 infants attended more to the inconsistent sentences. This result is in accordance with
Marcus et al’s expectations. To them, if infants can abstract a rule and generalize it to novel
items, they should attend longer to inconsistent items because they differ from the rule learned.
Marcus et al notice however that, even though test sentences were made up of new items, there is
still overlapping regarding phonetic features (sentences in both habituation and test phase
starting with voiced consonant and followed by a word starting with a voiceless consonant, for
example). The regularities involving phonetic features may, thus, be helping infants make
predictions of what to expect when presented with novel items. In order to rule out the possibility
that learners might be learning a sequence based on phonetic feature rather than abstracting a
rule and generalizing it, Marcus et al conduct a second experiment in which words were more
carefully constructed and arranged. Similarly to experiment 1, the results of experiment 2 show
that 15 of 16 paid more attention to inconsistent items during the test phase. Another problem
found by the authors involves the reduplication found in ABB and not in ABA which could be
leading learners to differentiate one grammar from another without making use of any rule. To
solve this, Marcus et al designed a third experiment in which they compared ABB and AAB
sentences so that reduplication was contained in both grammars. Once more, infants paid more
attention to inconsistent items during the test phase. After having carried out these three
experiments, Marcus et al conclude that a mechanism which relies solely on statistical learning
and is, therefore, sensitive to transitional probabilities between words, could not achieve these
14
results, mainly because all items presented in the test phase were novel. In other words, a learner
could not rely on transitional probabilities based on phonetic cues, for example, because these
cues were not consistent across the habituation and test phase. According to Marcus et al, only a
rule-learning mechanism allows infants to extract abstract rules between elements, such as x then
y, and generalize it to new items independently of transitional probabilities. Also, it is argued by
them that the subjects tested in the experiments seem to be able to extract rules rapidly from
small amount of data.
Gomez & Gerken (1999) also tested the ability infants have to abstract beyond a specific
set of sentences, but differently from the studies presented so far, Gomez and Gerken were
interested in extending such research question to the acquisition of syntax. The authors
conducted 4 different experiments in which 10-month-old infants were not tested on their
memory for familiar grammatical strings constructed from an artificial language, but were asked
to generalize the acquired structures to new instances. Because the sentences used during the
familiarization phase were similar but not identical to the ones presented during the test phase,
specially because infants were exposed to grammars characterized by variable word order, they
could not rely on transitional probabilities to abstract long-distances dependencies. Experiment 1
asked whether infants were able to generalize to new instances by distinguishing new
grammatical string from illegal ones. Experiment 2 asked the same question as experiment 1, but
this time the strings showed violations of grammatical word order. Experiment 3 tested if infants
could discriminate new strings of their training grammar from strings produced by another
grammar. In this experiment, both strings began and ended with the same words, but the ordering
of words differed within the strings. Finally, in Experiment 4, infants were trained in one
vocabulary and tested on a new vocabulary. In all experiments, infants were able to distinguish
15
the learned grammar from the unlearned one. Also, the fact that in experiment 3 and 4 word
ordering and vocabulary in the test phase differed from the habituation phase suggests that
infants showed the discrimination regardless of their training grammar. This means that the
ability to discriminate grammars in these experiments resulted from the infants’ ability to extract
information by abstracting beyond specific word order. To Gomez & Gerken, a statistical
learning mechanism almost certainly plays a role in word segmentation, but the syntactic use of
units must certainly involve some degree of abstraction.
Johnson et al (2009) conducted an experiment on rule-learning based on Marcus et al
(1999) in order to look into infants’ sensitivity to structured relations among stimulus features
from visual and auditory input. According to Marcus et al (2007), infants are better at learning
rules from speech than other domains of auditory stimulus, such as musical tones, timbres, and
natural animals sounds.5 The goal of Johnson et al’s study is to examine which mechanisms are
involved in infant rule learning. Because previous studies, such as that by Marcus et al (2007),
have shown that rule learning is facilitated by familiar stimulus – speech, for instance – Johnson
et al test if the use of unfamiliar stimuli might challenge the process of learning rules. First,
infants were exposed to sequences of colored shapes organized according to ABB, ABA or AAB
patterns (e.g., octagon-square-octagon, bowtie-star-bowtie). Following the habituation phase, a
new set of colored shapes was shown based either on the familiar rule or a novel rule to which
infants had not been exposed before (two test trials consisted of the three possible three-shape
sequences that had a familiar pattern, and two test trials with the same three-shape sequences
5 In this study, Marcus et all (2007) exposed infants to structured sequences consisted of either pure speech syllables (naturally sung) or nonspeech sounds (pure tones, instrument timbres, and animal sounds) and asked if infants could extract rules from these sequences. Their results show that infants were able to extract rules only when exposed to speech sequences. However, if infants first heard a rule based on speech, they were able to generalize the rule to sequences of nonspeech sounds. To them, this indicates that a learned rule can also be transferred to a different domain. Most importantly, their conclusions suggest that speech can facilitate rule learning in domains where humans might not acquire rules in a direct manner.
16
arranged in a different order to ensure that infants were not distracted by being tested on a
different color or shape). The first experiment tested if 11-month-old and 8-month-old infants are
able to distinguish a sequence of ABB shapes from a sequence of AAB (late and early
repetitions), a discrimination that can be accomplished by infants when they are exposed to
speech (Marcus et al. 1999, Marcus et al 2007). After habituation to ABB, 11-month-old infants
showed the ability to discriminate both ‘grammars’, while 8-month-olds did not show such
ability. This suggests that the first groups of infants could extract the ABB pattern and apply it to
new sequences. In order to test if the younger group of infants failed to distinguish both grammar
due to a common characteristic between them – it is possible that the pattern of repetition in both
ABB and AAB led infants to see both grammars as similar – Jonhson et al conducted a second
experiment in which the discrimination between ABB and ABA was tested. The results show
that ABB rule might be easier to discriminate within an ABB-ABA contrast because infants
learned the repetition during the habituation phase, but were not required to detect its position in
the sequence in the test phase. To the authors, it might be that 8-month-old acquired the rule
‘identiy anywhere’ or ‘there is a repetition’. Based on this, Johnson et al conclude that one
abstract relation that is acquired relatively early is that of repetition, and then that of position of
elements is acquired. However, the goal of this study was not to address the earliest age at which
rules can be acquired, but to examine how infants react when learning rules from a visual input.
Their results present evidence that infants are able to detect abstract relations from sets of
unfamiliar stimuli that share no surface features in common. The authors also highlight infants’
ability to notice a pattern and extend it to analogous cases, suggesting the development of
analogy.
17
4 Which mechanism?
The studies presented so far have shown that while statistical learning is a rapid and
efficient learning mechanism that enables both adults and infants to learn patterns from both
auditory and visual inputs, it has been argued that only rule-like mechanisms make
generalization to novel items possible.
According to Marcus (2000), some of the things we know about language are very
specific – for instance, the st sequence is usually followed by r and never by x – while others are
more general and abstract – the fact that we can form a sentence by combining any plural noun
with a plural verb or that we can add –ing to verb stems to form the progressive tense. Both
specific and abstract knowledge are, to Marcus, statistically reliable reflections of the world.
However, how to learn these types of knowledge may stem from two different mechanisms: one
pertaining to how specific units relate to each other (statistical learning); and the other pertaining
to open-ended schemas which allow free substitutions and generalizations (rule learning).
According to Marcus et al (1999), a system that relies solely on relations between specific items
is completely ruled out when it comes to generalizations to novel items. Based on their results,
the authors claim that infants were not using relations between specific elements to distinguish
grammar, but had indeed acquired an open-ended rule which could be generalized to new
instances. Marcus et al , however, do not completely disregard statistical learning as a
mechanism to acquire language. The results from their first experiments have indeed shown that
infants could be relying on relations between phonetic features presented during the habituation
phase and using these relations to distinguish both grammars. However, in a follow-up
experiment in which input was carefully designed not to provide crucial phonetic information
which could be used as a cue, infants were still able to distinguish grammars satisfactorily. To
18
them, the crucial aspect is that learners only acquire an abstract system by making use of
variables and operations and this is what allows generalizations. More specifically, they state that
‘[…] such mechanisms cannot account for how humans generalize rules to new items that do not
overlap with the items that appeared in the training’ (1999:79). However, contrary to this claim,
Christiansen & Curtin (1999) argue that a statistical mechanism modeled by a neural network
architecture typically trained to predict what the next item in a sequence of inputs is can indeed
fit Marcus et al’s data without invoking any kind of abstract rule. By means of binary
phonological features the network was trained on Marcus et al’s (1999) habituation sentences
and then tested with different items based on the same grammar from the habituation phase.
Even though Christiansen & Curtin predict that their model can fit Marcus et al’s data, when
presented with the test sentences, the model was better at predicting phonemes occurring in
inconsistent than in consistent sentences. Although it may seem unclear why this happened,
Christiansen & Curtin suggest that the inconsistent items are more salient to the learner and,
therefore, attract more of their attention. Regarding speech segmentation, Christiansen & Curtin
argue that no algebraic rules are needed. However, to them, it is to be determined whether rules
are needed outside this domain of speech segmentation.
Even though it is not yet clear which mechanism best accounts for language acquisition,
it is evident that they are still seen as two separate and independent mechanisms. There is,
however, a proposal in which a unifying perspective is put forth, arguing for a statistical-learning
mechanism which accounts for both the learning of patterns and the generalization to novel
instances. The following subsection discusses this particular proposal.
19
5 Statistical and rule learning: a unifying proposal
According to Aslin & Newport (2012), it is possible to have a balance between instance-
learning and generalization based on two facts: a language learner cannot acquire a structure
without gathering regularities from the input; likewise, the learner cannot wait for every possible
language structure to be in the input before inferences are made. Aslin & Newport highlight that
even though statistical-learning enables learners to extract statistics from input and to use the
information to make decisions about recurrent materials, it does not address the question of how
learners make rules, that is, how learners go beyond what they have seen or heard. Instead of
separating both mechanisms, Aslin & Newport suggest that they are part of one learning
mechanism. But how? The hypothesis formulated by these authors is that some stimulus are
naturally more salient than others, and if stimulus are coded based on their salient aspects rather
than specific details, learners will be able to generalize the saliency to all stimuli that present the
same salient aspects, even when these stimuli had never been heard or seen before. The salient
aspects of the input can also constraint the encoding of the structure presented, guiding learners
to what they should attend to. As to rules, these are acquired when patterns in the input indicate
that elements can occur interchangeably in the same structure (e.g., category of verbs). The same
contrast between word ‘dog’, which is generalized to all kinds of dogs, and the word ‘Huck’,
which is used to refer to a specific dog, holds when learning language rules and items.
According to a recent study conducted by Reeder, Newport & Aslin (2009, 2010), it is the
consistency in the input that lead learners to generalize rules to novel strings, while the
inconsistency of cues lead them to treat certain strings as exceptions. Hence, learning stems from
a single mechanism in which the result either applies to elements that have been experienced in
the input or to generalizations beyond experienced elements.
20
6 Final Considerations This paper has focused on two learning mechanisms which have been believed to operate
separately in the acquisition of grammar. While proponents of statistical learning argue this a
fundamental mechanism which allows learners to acquire patterns presented in the input,
supporters of the rule-like mechanism state that it is only by means of acquisition of rules that
language abstractions and generalizations are possible.
As to statistical learning, I have observed that many of the studies conducted so far
remain on one very specific domain of grammar: word segmentation. Also, the languages used in
the experiments are usual artificial with very controlled structures, such as word length, syllable
shape, and limited vocabulary. These studies also do not seem to account for linguistic variation,
such as allophones, which occur in real language data. That is, how would statistical learning
function given a combination of human learners and the rich input characteristic of natural
language? Also, if statistical learning mechanism relies solely on regularities found on surface
forms, how does a learner acquire more complex structures which are not so obvious in the
input?
As to rule-learning, it seems to me that the argument that this mechanism allows the
acquisition of more complex structures, abstractions and generalizations does not suffice to
dispense with statistical learning. In order to formulate rules and extend them to novel items, it
seems reasonable that the learner first attends to regularities from the input. Also, if statistical
learning is completely disregarded as a mechanism, what will be the starting point of a learner
when acquiring rules?
A unifying mechanism, as suggested by Aslin & Newport (2012), seems to offer a
plausible answer to these questions and should, thus, be explored in future studies.
21
References
Aslin, R.N., & Newport, E.L. (2008) What statistical learning can and can't tell us about language acquisition. In J. Colombo, P. McCardle, and L. Freund (eds.), Infant Pathways to Language: Methods, Models, and Research Directions. Mahwah, NJ: Lawrence Erlbaum Associates.
Aslin, R. N., & Newport, E. L. (2012). Statistical learning from acquiring specific items to forming general rules. Current directions in psychological science, 21(3), 170-176.
Allen, J., & Christiansen, M.H. (1996). Integrating multiple cues in word segmentation: A connectionist model using hints. In Proceedings of the 18th annual Cognitive Science Society conference (pp. 370–375). Mahwah, NJ: Lawrence Erlbaum Associates Inc.
Bonatti, L. L., Pena, M., Nespor, M., & Mehler, J. (2005). Linguistic Constraints on Statistical Computations: the role of consonants and vowels in continuous speech processing. Psychological Science, 16(6), 451-459
Christiansen, M., Allen, J., & Seidenberg, M. (1998). Learning to segment speech using multiple cues: A connectionist model. Language and Cognitive Processes, 13, 221–268.
Christiansen, M. H., & Curtin, S. L. (1999, August). The power of statistical learning: No need for algebraic rules. In Proceedings of the 21st annual conference of the Cognitive Science Society (Vol. 114, p. 119).
Cutler, A., Butterfield, S. (1992). Rhythmic cues to speech segmentation: Evidence from juncture misperception. Journal of Memory and Language, 31, 218–236.
Johnson, E., & Jusczyk, P. (2001).Word segmentation by 8-month-olds: When speech cues count more than statistics. Journal of Memory and Language, 44, 548–567.
Jusczyk, P.W., Hohne, E. A., & Bauman, A. (1999). Infants’ sensitivity to allophonic cues for word segmentation. Perception and Psychophysics, 61, 1465–1476.
Gomez, R. L., & Gerken, L. (1999). Artificial grammar learning by 1-year-olds leads to specific and abstract knowledge. Cognition, 70(2), 109-135.
Marcus, G. F. (2000). Pabiku and Ga Ti Ga Two Mechanisms Infants Use to Learn About the World. Current Directions in Psychological Science, 9(5), 145-147.
Marcus, G. F. et al (1999). Rule learning by seven-month-old infants. Science, 283(5398), 77-80.
Marcus, G. F. et al (2007). Infant rule learning facilitated by speech. Psychological Science, 18(5), 387-391.
Nespor, M., Peña, M., & Mehler, J. (2003). On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e linguaggio, 2(2), 203-229.
Reeder, P. A., Newport, E. L., & Aslin, R. N. (2009). The role of distributional information in linguistic category formation. In N. Taatgen & H. van Rijn (Eds.), Proceedings of the 31st Annual Conference of the Cognitive Science Society (pp. 2564–2569). Austin, TX: Cognitive Science Society.
22
Reeder, P. A., Newport, E. L., & Aslin, R. N. (2010). Novel words in novel contexts: The role of distributional information in formclass category learning. In S. Ohlsson & R. Catrambone (Eds.), Proceedings of the 32nd Annual Conference of the Cognitive Science Society (pp. 2063–2068). Austin, TX: Cognitive Science Society.
Romberg, A. R., & Saffran, J. R. (2010). Statistical learning and language acquisition. Wiley Interdisciplinary Reviews: Cognitive Science, 1(6), 906-914.
Saffran, J.R., Aslin, R.N., & Newport, E.L. (1996). Statistical learning by 8-month old infants. Science, 274, 1926-1928.
Saffran, J. R., Newport, E. L., & Aslin, R. N. (1996).Word segmentation: The role of distributional cues. Journal of Memory and Language, 35, 606–621.
Yu, C., Ballard, D. H., & Aslin, R. N. (2005). The role of embodied intention in early lexical acquisition. Cognitive Science, 29, 961-1005.