Modelling the early stages of language acquisition - CiteSeerX

Implicit structure in continuous speech

This study investigates how implicit linguistic structure available in the speech signal can be derived without pre-programmed linguistic knowledge. The aim is to relate infants’ learning of the ambient language with computational methods of speech processing. A model is designed to mimic human biological constraints rather than adult knowledge of language and rules. Our hypothesis is that the system should be able to find possible word candidates and get some sense of linguistic structure by simply taking advantage of the regularities available in the speech signal. We evaluate the model using IDS (infant directed speech) and illustrate how implicit linguistic structure can be derived from the speech signal as an unintended consequence of a general pattern-matching process. We discuss how the ideas underlying the model also can prove useful in accounting for other aspects of information processing.

Nina Svärd Marty Lisa Gustavsson Master thesis in computational linguistics Master thesis in phonetics (magisteruppsats i datorlingvistik) (magisteruppsats i fonetik) Department of linguistics Department of linguistics Stockholm University Stockholm University December 2003 December 2003 Supervisor: Francisco Lacerda Supervisor: Francisco Lacerda

1. Introduction _______________________________________________ 3

2. Background _______________________________________________ 3 2.1. Language acquisition in infants____________________________________ 4 2.2. Theories on brain development and learning in computational systems_____ 6 2.3. Speech processing in computational systems ________________________ 9 2.4. Shifting the focus – applying the infants’ perspective __________________ 11

3. The model ________________________________________________ 13 3.1. Operating of the model _________________________________________ 14 3.2. Model Flowchart ______________________________________________ 17

4. Evaluation ________________________________________________ 18

5. Discussion _______________________________________________ 20 5.1. Conclusion __________________________________________________ 22

6. References _______________________________________________ 23

2

1. Introduction The purpose of this study is to investigate the possibility of finding implicit structure already available in the speech signal without having the restriction of a priori expectations of how it “should” and “must” be, as seen from a phonological perspective. A model is designed with the purpose of testing this idea, using as little background information as possible. In order to create a very basic model without the constraint of having any pre-programmed linguistic knowledge we will relate the way traditional computational systems process speech to the way infants seem to learn the ambient language. By simply taking advantage of the regularities (or irregularities) available in the speech signal, such as spectral information, our hypothesis is that the system should be able to find possible word candidates and get some sense of linguistic structure. Furthermore, the aim is to do this adopting the infants’ point of view. We believe that essentially similar ideas could prove useful also in accounting for other aspects of information processing.

The following chapter will serve as a conceptual framework within which speech processing will be considered in the light of comprehensive theories on learning and development. Various problems concerning speech processing will be introduced and they will be related to theories and ideas on brain development and intelligence. A summary of current automatic speech recognition (ASR) systems will also be provided. Finally, our theoretical starting point, related to these comprehensive theories on learning and development, is discussed in more detail.

2. Background Processing speech is obviously not a problem to us humans. We do it every day, often without even reflecting on what exactly it is that we do. We are able to understand speech under various circumstances; we can for example understand speech which is distorted due to background noise, slurred speech, mispronunciations etc (Warren, 1982). But what happens when we try to apply our talent for speech processing to computational systems? Automated speech recognition technology and human speech interaction with computers has been a topic of research since many years. Yet, computational systems are far from being as talented as humans when it comes to process speech and to recognise speech. There are, of course, several possible reasons for this lack of talent. Maybe we should start by asking ourselves what precisely it is that we need to know in order to make systems process speech more successfully. Do we need to know more about the structure of language and about the way we use language? Are machines intelligent enough to mimic human behaviour in these aspects? Or do we simply need better programming skills? We will address these questions and try to investigate them from a different angle than the traditional one where knowledge is viewed from a dogmatic perspective. From our point of view, the infant (or any system) need not figure out what words, phonemes or syntactic structure are, or even that there are such things, in order to learn a language. We will introduce a tentative model with the purpose of discovering implicit structure in continuous speech. The only built in constraints in this model will be filtering of the speech signal to simulate the human auditory process and a matching procedure based on acoustic similarities. Since this model attempts to mimic the initial processes in language acquisition we will give a brief summary of current issues in infants’ discovery of spoken language.

3

2.1. Language acquisition in infants How is it possible that infants seemingly effortless all manage to learn a language regardless of complex and varying input? There is no clear answer to this question even though the subject has been a topic of research and debates for decades. The theories of early language acquisition and perspectives adopted by researchers are manifold and sometimes contradictory. One major source of conflict is the question whether language acquisition is a purely genetic predisposition in humans (the nativistic theory) or a mere learning process. However, researchers seem to agree on the fact that infants make use of certain clues in order to perform the very first tasks involved in language acquisition, for example in segmenting the speech signal. To an adult listener, this task may seem completely effortless and is most likely taken for granted as an automatic process. To the infant, on the other hand, this is one of the very first milestones it must overcome in order to make sense of the ambient language. The closest we can get to understand the difficulty of speech segmentation is perhaps adopting the view of someone being exposed to a foreign language for the very first time. We will doubtlessly face the problem of not being able to tell where one word ends and the next begin. It is a common misperception that speakers of a foreign language speak much faster than we do in our native language. This idea probably arises from us not being able to segment the stream of words that we hear. With this in mind, we may begin to understand the difficulties facing the infant.

In order to segment the speech signal the infant must of course be able to distinguish between the different speech sounds. The infant is able to discriminate not only between speech sounds relevant to the ambient language but between all speech sounds the infant is exposed to (Vihman, 1996). This general ability to discriminate between all speech sounds decreases at the end of the child’s first year when it no longer can discriminate between sounds which do not play a significant role in the ambient language (Polka and Werker, 1994). Thus, the infant refines the acquisition process by filtering away “irrelevant” knowledge and focuses on the acoustically dimensions important for the language in question.

So what are the clues the infant uses when segmenting the speech signal? Perception studies have shown that some time between 6 and 9 months of age infants seem to be aware of different phonological properties of the language which may serve as clues for the segmentation task (Brent, 1999). Examples of such clues are for instance rhythmical clues: in English stressed syllables are usually word initial and therefore word candidates with non-initial stressed syllables can be eliminated as constituting a likely beginning of a word. Another example of phonological clues is the phonotactical restrictions of a language: how frequent or infrequent (or even “illegal”) certain combinations of sound segments are. For example, in English words, the phone [Ν] can occur at the end of syllables as in the word rang [ρΘΝ], but cannot occur at the beginning of a syllable (Jusczyk, 1997).

There are of course many more things involved in the language acquisition process than merely being able to discriminate between speech sounds and segmenting the signal. Already during the second trimester of the first year, the lexical development seems to kick off (Jusczyk, 1997). At the age of 4½ months the infant recognises its own name and by 7½ months the infant can recognise words in a continuous speech stream (Jusczyk, 2001). Parallel with the infant’s perceptual development a development in production takes place, from early vocalizations (already at birth) to speech-like babbling by 10 months. However, it seems to be a discrepancy between

4

the perceptual and articulatory development in the first year: perception precedes production in the sense that infants understand more words than they can produce (Vihman, 1996). The question of how perception and production are linked together is still unanswered. On the one hand it would be reasonable to assume that production and perception goes hand in hand if articulation would require an understanding of the speech signal (Lieberman et al., 1967), or that the articulatory gestures could be directly derived from the speech signal (Best, 1995). On the other hand, early babbling in infants could be seen as just spontaneous, playful vocalizations and that the link between production and perception will not be established until an understanding of the ambient language has taken place (Locke, 1983). There are empirical evidence that indicating that the infant at a very early age is influenced by perception and the babbling drifts towards the surrounding language (Boysson-Bardies et al., 1992), but there are also evidence that language-specific effects in babbling appear late in many infants (Engstrand et al., 2003). None of these findings however, are very surprising according to Engstrand and colleagues; they explain that perception probably influences production in a very early age in some children and later in others. The initial babbling could be seen as language specific in the aspect that infants produce sounds that can be found in any human language (in the same way as they can discriminate between all sounds). Sooner or later the repertoire becomes more attuned towards sounds that are specific to their ambient language (in the same way as they lose their ability to discriminate between all sounds). Exactly when this happens is probably determined rather by the infant’s individual developmental juncture (that may differ significantly in infancy) than at a fixed age.

The ambient language of almost every infant is in the form of Infant Directed Speech (IDS), a typical speech style used by adults when communicating with infants. IDS highlights the language-specific properties in the speech signal and there is no doubt that this structured speech style is helping the infant on its way towards a spoken language. Fernald explains IDS from an evolutionary perspective. She claims that adults may have evolved a specific style of speech that infants respond to and that this facilitates both social development and language acquisition. In a series of studies she investigated the role of IDS in the early communication between caretaker and infant. When mothers were asked to highlight certain words when talking to their infants, they generally used an exaggerated pitch-contour in those words and put them in a phrase-final position (Fernald and Mazzie, 1991). 15-months old infants also preferred listening to this modulated and highly repetitive speech style instead of adult-directed speech (Fernald and McRoberts, 1991). When the same study was carried out with 18-months old infants, the speech-style (infant-directed or adult-directed) did not have any impact on their performance in word-object mapping (Fernald, 1994). She explains these results with a scheme of the developmental functions of IDS. The initial function (at the beginning of the first year) of IDS could be seen as social bonding: the adult uses the voice to sooth, please or alert the infant. At the end of the first year the adult rather uses IDS to acoustically highlight words or linguistically meaningful units of speech. During the second year it seems like the infant already has acquired the basic structure in their ambient language.

The infant has to connect the sound sequences it hears with meaning. Whereas communication between adults usually is about exchanging information and emotions, speech directed to infants is of a more referential nature. The adult refers to objects, people and events in the world surrounding the infant. Because of this it is rather unavoidable that the sound sequences the child hears eventually will coincide

5

with seeing or feeling the actual object or event the sound refers to. The repetitive structure of IDS may here play an important role in helping the infant arriving at a plausible word-object link. There is of course, no guarantee that the sound sequence the child links to the object will be the correct word: it is possible that too large or too short a segment will be linked to the object at this stage. However, a continued exposure to spoken language is likely to help the infant narrow the pool of possible word candidates or revise the sound-object links (Lacerda, 2003). In our view this ability to separate representations of identified sound chunks from the remaining unknown sound sequences will eventually lead to the construction of a lexicon. We attempt to illustrate this idea with our model. Our view is that the language acquisition process need not be of a very complex organization and that it is in fact a rather self explanatory learning process. Nevertheless, we wish to give a broader picture of the reasoning behind the question whether language acquisition is a genetic predisposition in humans or a learning process.

2.2. Theories on brain development and learning in computational systems Traditionally, advocates of nativistic theories have argued that the brain even before learning consists of separate modules each dedicated to process a specific task (Pinker, 1994a and Chomsky, 1998) Speech processing is seen as being carried out by discrete operations, described in terms of rules. This reasoning has been subject to discussion. For example, primates seem to organise their brain very much depending on what input they are exposed to during their development. On account of the discovery that blind people use their visual cortex for non-visual processes, Sur et al. (1999) and von Melchner (2000) carried out a series of studies investigating the brain plasticity of young primates. They found that ferrets, which had their visual pathways rewired to their auditory cortex, and their auditory pathways rewired to their visual cortex, began to process visual stimuli with their auditory cortex and vice versa. This indicates that the brain is not as task-specific as previously assumed. But traditionally, this has not been taken into consideration when trying to make machines perform tasks in human environment. In the area of speech recognition, researchers have assumed that the human brain has some kind of innate behavioural scheme for communicating. Therefore, one has attempted to imitate the way the human brain processes speech, using a set of rules implemented in the speech recognition system. An example of such a rule is the phonotactical constraints (as mentioned previously): in English it is phonotactically illegal for a cluster like /zt/ to occur inside words whereas the cluster /st/ is very frequently occurring (Mattys and Jusczyk, 2001). If a system is already programmed with all the complex structures and rules of the language it will have difficulties solving new or unexpected problems since it lacks the crucial basic processing which is necessary to incrementally structure more complex information. Such a system cannot handle new or unexpected problems it encounters that are not covered by the programmed behavioural rules.

Of course, there are more flexible systems that, when anything goes wrong, try a different solution. But according to Minsky (1985b), not even this makes the system intelligent since “A distinctive aspect of what we call intelligence is this ability to solve very wide ranges of new, different kinds of problems” (Minsky, 1985a). Intelligent beings (like us humans) look for causal explanations and when we find them, add them to our networks of belief and understanding. We do intelligent learning from our own experiences. A task-specific programmed system cannot learn

6

from its own experience because it does not have any. Minsky questions why we make programs do “grown-up” things before we make them do “childish” things. He argues that it is necessary for a system to discover and learn basic thinking in order to reach advanced knowledge and to develop one very important learning mechanism, namely common sense. He further describes intelligence as the ability to solve hard problems and the specific ways in which humans solve them. Among the different approaches Minsky refers to as being intelligent ways of solving problems, we can mention the usage of sub-goals (to break problems down in smaller parts), sub-objects (to make descriptions based on parts and relations), cause-symbols (to explain and to understand how things work), memories (to accumulate experience about similar problems) and economies (to divide sparse resources efficiently). These are desirable qualities, which ASR systems perhaps could benefit from. We will later return to strategies for problem solving when discussing our choice of method. Since humans qualify as intelligent beings one could compare problems in ASR systems with the situation facing the infant when acquiring its first language. The task of processing a speech signal in order to sooner or later find syntactic structure is the same in both cases. Infants are very sensitive to variation in the pitch-contour in phrases, syllable duration, intensity, spectral variations and repetitive patterns and several studies indicate that infants very early start to make use of that type of information in order to find structure in an utterance (Crystal 1975; Jusczyk et al., 1992; Fernald 1984; Kuhl and Miller 1982). However, there is a fundamental difference in how infants, who still have not acquired a language, and how adults interpret speech sounds. Since the infant does not share the linguistic conventions that adults have, they cannot search for linguistic information in the speech signal in the same way as adults do. Sundberg (1998) describes how variation in the pitch-contour in phrases, syllable duration, intensity and repetitive patterns serve as something “attention-catching” to the infant. As mentioned earlier, this speech-style is characteristic of IDS. To an adult listener however, the same variation brings explicit linguistic information.

Given the variation in speech, it would seem impossible for the infant to find any linguistic structure in the varying information flow of the ambient language unless they have some form of innate linguistic knowledge, a genetically programmed module enabling language learning (Pinker, 1994b). Maybe one should question this reasoning since it closes all doors to other possible explanations. Elman (1999) for example, suggests that one does not necessarily has to view the information flow as a problem children must struggle with when learning words, but instead perhaps one should view it as an asset. Since language is used to refer to the world it is likely that when a number of sensory inputs are stored continuously and simultaneously (i.e. visual and auditory stimuli) systematic correlations will eventually emerge. Therefore, children’s associations are limited due to the child connecting the sensory inputs which give rise to most attention together with the most frequent words in a specific situation. As we store more and more information, we learn more words that bring us closer to a linguistic structure. According to Guttentag (1995) this is, from the very beginning, an automatic process during which we simply connect what we experience with what we hear.

Elman (1999) compares infants with computational network models. In his view, infants have an immature memory and can therefore only process simple sentences and structures. We assume in our model that infants’ internal representations of the world are much rougher than adults’ in the sense that they are not focused on categories that will be adequate later on. The infants’ ability to structure information

7

more in detail is therefore viewed as an incremental development, in line with Elman’s suggestion that it is the improvement in memory capacities and the acquisition of basic primitive representations (ambient-language sound chunks stored in the brain) that allows the infant to process more complex sentences. However, the input data is also of great importance when the initial capacity of a system is limited but matures with time. Lack of variation in the information may cause the system to make wrong generalisations. And too much variation in the information may slow down the learning process until enough data is gathered to make generalisations. This seemed to be the case in a study on sound-meaning connections made by infants (Koponen et al., 2003) where the results showed that too much variation in the speech material tends to prevent the infants from being able to structure the speech signal in a meaningful way. Similar results were presented in a study on obtaining linguistic structure in orthographically transcribed speech (Svärd et al., 2002).

The nature of the input seems to be of great importance when it comes to developing linguistic representations. Lacerda and Lindblom (1996) suggest that already from the beginning the infant relies on automatic learning processes to handle information. As the infant stores sensory input and associates different kinds of information with each other, i.e. auditory with visual, the word acquisition process gets started and the infant has soon created him/herself a linguistic scaffold. Therefore, for a system to mimic humans’ strategies, learning in computational systems should start with data which permits the system to learn primitive representations when the system is young and plastic. In typical infant-adult interactions, the speech used by the adult is attention catching, repetitive and context-bound (easily related to external objects and actors). The infant is continuously exposed to parallel sensory input (auditive, visual, tactile etc.) and stores these sensory exposures in memory. And according to Lacerda and Sundberg (2001) this eventually leads to the emergence of the underlying structures since memory decay effectively filters out infrequent exposures. Elman (1993, p.95) describes the infant’s immature memory “…like a protective veil shielding the infant from stimuli which may either be irrelevant or require prior learning to be interpreted.” Lacerda and Sundberg (2001) suggest that language acquisition does not need complex ad hoc explanations based on genetically pre-programmed linguistic knowledge of any sort. Instead, they describe language acquisition as the result of the interaction between universal biological prerequisites (such as the functioning of our hearing, motoric constraints and memory capacity) and conditions created through the linguistic surroundings (such as the nature and quality of the input).

Within nativistic theories (Chomsky, 1988) a line is drawn between acquisition of primitive representations and the acquisition of grammatical structure, and they are viewed as two separate modules. However, Bates and Goodman (1999) have chosen to turn that reasoning on its head. They assume that there is in fact no separate grammatical module, and that grammatical structure arises naturally as the child tries to find a way to handle the information flow when accumulating more and more words. In a similar way, Elman (1993) suggests that with time, our memory develops and its capacities increase. This implicates that the conditions to learn have changed. Elman further suggests that the memory capacities we have as young infants filter the input to avoid a system overload and when the memory develops, less input is filtered out. As a consequence of the growth of input, the system starts to accept larger variance. Hence, the conditions to structure grammatical information in the speech signal are more favourable. Nowak (2000) describes the emergence of grammatical structure in a similar fashion as Bates, Goodman and Elman. Syntax arises when the

8

number of signals in a system reaches a certain threshold: the vocabulary becomes too large to handle. As the amount of information increases the signals are reused and start to structure themselves in order to facilitate the process economically speaking. Thus, syntax arises as an emergent result of the limited resources available for the processing of an increasing amount of exposure to structured information. He further suggests that as the number of signals to remember increases, using syntax becomes more profitable. The system becomes more flexible and one does not have to learn the formulations in advance in order to supply new messages.

2.3. Speech processing in computational systems All ASR systems can be described as methods of converting continuous signals to discrete sets of symbols such as phonemes, phones or words. A system runs a probabilistic check of the input signal’s correspondence to a stored vocabulary based either on word units or more commonly the phonemes, diphones, triphones or pronunciation. ASR systems can be either speaker-dependent or independent. Speaker-dependent ASR technology requires a user to train the system to recognise his/hers own voice. In a speaker-dependent system, a personal set of discrete sounds or phonemes is created as the user trains the program to recognise his/her own voice whereas a fixed, default set of discrete sounds is provided in the speaker-independent program. Speaker-independent systems are designed to recognise any speaker’s voice, but this type of system is usually less accurate than speaker-dependent systems due to its difficulties to recognise non-standard speaking patterns such as foreign accents and dialects. Speech recognition techniques can handle either continuous speech or discrete speech. Continuous speech recognition allows the speaker to talk normally, with a natural speech rate whereas discrete speech recognition requires the speaker to pause between each word.

Riederer (2000) summarises the present, traditional approaches in the ASR research area including the acoustic-phonetic approach, the statistical approach (also referred to as Hidden Markov Model, HMM) and the artificial intelligence (AI) approach. Acoustic-phonetic techniques are very sophisticated in the sense that they make use of all information available in the speech signal. Stable regions in the speech signal are labelled according to their acoustic properties i.e. nasality, frication, formant locations, voiced and unvoiced. The labelled segments are then matched against words/ phonemes/phones in the lexicon. This is a very sensitive method which seldom is used on its own but often included in other systems.

The statistical approach is the most common approach used in speech processing today. Statistical methods use pattern recognition and the underlying principles of the initial acoustic analysis are usually the same as in the acoustic-phonetic approach mentioned above. The speech waveform is split into a sequence of acoustic vectors, each vector representing some milliseconds of a speech spectrum. The statistical method uses acoustical analysis to match the segments against a pronouncing dictionary and converts the segments into a sequence of phones. Each phone has a hidden markov model. Transition probabilities between different states (phones) constitute possible word candidates, each matching the uttered word with a certain probability. Then, given the preceding words, the probability of a certain word occurring in an utterance is calculated, and it is therefore possible to determine the most probable word sequence.

9

The AI approach can be seen as a hybrid of acoustic-phonetic models and statistical models. Furthermore, it aims to mimic human intelligence in such a way that the system uses a number of knowledge sources, just like humans do, in order to solve the problem in question. Modules of cognitive linguistic capabilities, such as grammar, semantics, pragmatics and syntax, are added to the system. The drawback of all these methods is that they require a priori knowledge of the phonetic units. This is impossible to obtain since the different pronunciations between speakers and within the same speaker, but in different situations, are endless in their varieties. The AI approach, with the idea of a system able to learn with experience, may be an excellent theoretical approach. However, the approach does not yet offer a flexible fusion of the different knowledge sources. So what is it that we should require from an optimal and successful speech processing system? What we must try to achieve is a system that perceives speech in the same manner as humans. Riederer (2000) puts it in this way:

“…a universal and maximally beneficial speech recogniser should be capable of recognizing continuous speech from multiple speakers with different accents, speaking styles (including poor articulation), vocabularies, grammatical tendencies and multiple languages in noisy environments. Moreover, the recognizing system should be sufficiently small, low-cost and work in real time. Finally, the system should be able to adapt and learn new lexical, syntactic, semantic and pragmatic information as a human can.”

Nevertheless, speech recognition systems that are used today often work very well providing nothing unexpected occurs within the systems, which they are not programmed to handle. If errors do occur the recovery time can be quite long causing the programs to be very slow. This is unavoidable when dealing with systems that are matching sequences with a default set of sounds. Batliner et al. (2001) criticises the systems used today because they use an intermediate phonological level of description between the abstract function and the concrete phonetic form. Such a system possesses stored knowledge about the language which is similar to the linguistic knowledge of an adult. They argue that better results could be achieved if the systems were functioning successfully without such an intermediate level, using instead a direct link between the syntactic/semantic function and the phonetic form. Such systems would store new knowledge on their own, in a similar way infants do. Speech communication is traditionally seen from the perspective of the adult and naturally evaluations on speech recognition systems are also, in most cases, related to adults’ ability to recognise speech. It is possible that speech recognition technology would benefit from relating speech recognition systems to infants’ ability to recognise speech.

In certain aspects, the infants’ perspective is something that is considered in the development of intelligent humanoid robots (robots used for interaction with humans in human environments). Weng and Zhang (2002) suggest a developmental approach for robot systems in order to make it possible for robots to adapt and learn new tasks in human environments. In contrast to other AI systems, which from the beginning implement different knowledge modules, their robot, SAIL (Self-organizing, Autonomous, Incremental Learner) raises its own knowledge during its life span and the knowledge sources get interconnected in a functional way. Traditional robots are usually programmed to perform specific tasks. This task-bound programming results in rigid systems that cannot handle new or unexpected situations. A traditional task-specific robot requires human programmers to fully understand the domains of the tasks and to be able to predict them. The developmental approach on the other hand,

10

is motivated by human cognitive and behavioural development from infancy to adulthood. The approach aims to provide a system with broad and unified developmental skills instead of separate knowledge modules. The learning mechanism in a developmental system must be able to perform systematic self-organisation and develop its cognitive and behavioural skills through direct interactions with its environment (the physical world). Such a learning mechanism enables the machine to learn new tasks that a human programmer does not know about at the time of programming. In the developmental approach, as suggested by Zhang and Weng, robots are based on the behaviour of the infant instead of the adult. We think it would be profitable to adopt a similar approach to information processing in general. Or as Minsky (1985b) puts it:

“Most of the world of "Computer Science" is involved with building large, useful, but shallow practical systems, a few courageous students are trying to make computers use other kinds of thinking, representing different kinds of knowledge, sometimes, in several different ways, so that their programs won't get stuck at fixed ideas.”

There is unfortunately one problem with the method Weng and Zhang are using when implementing their ideas: they do not make use of the nature of input from the beginning, in the very simple and basic way as we do. Instead they jump straight to the concept of words and grammar when teaching their robot. When looking closer at their very sophisticated system architecture (Weng and Zhang, 2001) there is nothing in it that speaks against that the robot could learn words and grammar from continuous IDS if it were only given the chance to do so. But instead they start out by teaching the robot to recognise single, separate words, and that is not the reality for an infant.

In theory however, the developmental approach to learning in AI resembles that of our own theoretical starting point when handling the speech signal. We believe that it is necessary to start over and to start small: starting with “childish” things before starting with “grown-up” things in order to achieve more flexible systems. To make a system grow up without having had a “childhood” will only lead to temporary solutions. Adopting a more naïve and developmental approach will lead, if not to better and more flexible systems that can recover from errors, at least to an attempt to address the long discussed problems within speech processing from a different angle. We are trying to test this naïve approach in our model by applying what we believe is the infants’ perspective.

2.4. Shifting the focus – applying the infants’ perspective We would like to shift the focus from the more traditional starting point, where systems are ascribed actual linguistic knowledge and restrictions, to a more general and universal attitude where systematic variance in the speech signal is the only available clue. Our starting point is on a very basic level. In our model we are trying to use non-linguistic related knowledge to obtain linguistic information. Our intention is to evaluate what we believe are more general learning principles that better correspond to the early language learning process in infants, and in doing this perhaps contribute to a different approach for solving problems within speech processing systems. Within the area of computational linguistics for example, it is becoming popular to implement self-developing principles when building semantic lexicons and such implementations could probably benefit from knowledge about initial speech

11

structuring in humans. Shifting the focus from the traditional point of view to a more general and universal scope, may in the long run lead to more flexible speech recognition systems.

Our model is not restricted by already built-in knowledge, which reduces the search space, but the system learns through the regularities present in the signal such as frequencies of sound sequences. The implicit structure may give rise to “meaningful errors”. The result may not necessarily be correct from a phonological perspective but the errors are meaningful in such a way that they constitute a systematic grouping of strings or sequences. An example of a “meaningful error” is the pronunciation of the Portuguese town Porto, which means ’harbour’. In Portugal, the town is often referred to as Oporto (the definite article o + the name of the town Porto), which means ’the harbour’. Since non-Portuguese speakers do not know about this article and since the article is always included when mentioning this town, but no other towns, it is interpreted as belonging to the name itself. A similar example of a “meaningful error” can be found in the evaluation of a segmentation model presented by Svärd et al. (2002). The result showed how the Portuguese article o, present in the speech material, was perceived by the system (and by the authors, with no knowledge of Portuguese) as belonging to the adjacent word. The fact that we make “mistakes”, including mistakes in perceptual interpretation, is almost certainly a universal characteristic of the learning process of intelligent beings, and any intelligent machine will also do so.

In our model of early speech perception we will try to implement the behaviour of an infant perceiving speech. The infant is not ascribed any innate linguistic knowledge but still manages to learn complex linguistic tasks. Since it would be desirable to develop computational systems that are able to learn complex linguistic tasks, the comparison between an infant’s learning behaviour and that of a computational system is motivated. In both cases, it is a matter of extracting recurrent acoustic patterns from a speech stream. Initially, the segmentation is probably merely a coarse structuring of the continuous speech signal into relatively large segments, which stand out in the acoustic signal due to some recurrent properties the infant learns to recognise. And if the infant spontaneously focuses on possible recurrent patterns in the speech signal, these patterns may function as an excellent foundation for further structuring. It may therefore be of value to study infants’ speech processing as an approach to solving problems encountered within the research area of ASR and other applications of computational linguistics. In the following chapter we will present our model of the initial structuring of speech. The model is based solely on the characteristics of a plastic system and its memory constraints. We will test the model using infant-directed speech as input and simulate only humans’ ability to hear and forget. By taking advantage of information available in the speech signal, such as spectral variation, similarities and frequencies, our hypothesis is that the system should be able to find possible word candidates and an embryo to phrasal structure.

12

3. The model As mentioned, the model aims to resemble the early language acquisition of the infant, where the plastic and immature memory is crucial for utterances to be stored efficiently (Elman, 1999). On the basis of acoustic similarities in sound sequences of different lengths, the model suggests probable word candidates that are prominent in the input. The basis for this study is a model on structuring continuous speech (Svärd et al., 2002) which was designed to suggest possible word candidates on the basis of the relative frequency of orthographically transcribed speech strings. What we refer to as “words” could of course be any meaningful speech unit. In difference to models used in ASR systems, which have been based on already acquired knowledge stored in memory, this model has no stored knowledge. We assume that the infant is processing the speech signal according to spectral variation (the patterning of energy over time) and recognisable acoustic borders, such as silence. We dare to make these assumptions since perception studies have shown that infants can discriminate isolated sequences of all speech sounds they are exposed to (Aslin et al., 1998). Studies also indicate that infants can recognise familiar sound patterns in a continuous flow of speech (Jusczyk and Aslin, 1995). In order to investigate the potential relevance of prosodic patterns for the initial structuring of speech a study was carried out by Gustafsson and Lacerda (2002). Pitch-contours in IDS were analysed to disclose recurrent patterns that could be described in transitional probabilities. Essential prosodic properties in the input material could be captured by generating artificial pitch-contours according to transitional probabilities. In a similar fashion the syntax of the input used in our model could be revealed by transitional probabilities of suggested word candidates. This illustrates the kind of information, in the speech signal, which emerges on-line without any linguistic knowledge.

The model is designed to imitate the “fluid intelligence” characteristic of humans (Horn and Cattell, 1966). Fluid intelligence is the ability to deal with novel problem-solving situations for which we lack experience and it depends largely on the ability to manage information in working memory. Over our life span we progress from using fluid intelligence to depending more on crystallised intelligence. Crystallised intelligence is the ability to apply previously acquired knowledge to current problems and it depends largely on the ability to retrieve information and previously learned problem-solving schemas from long-term memory. We believe that omitting the stage of fluid intelligence and to directly use crystallised intelligence will result in the many rigid systems we see today. Without ever having had the plasticity of fluid intelligence, it will naturally be very hard for the system to be anything but rigid and certainly not flexible enough to be able to handle new problems. This reasoning falls in line with Minsky’s approaches to intelligence, where a dynamic memory and an economical structuring of information are two main factors for efficient problem solving.

Our memory-mechanism is one important factor for how we learn language, but the way our ears are designed are crucial for how we perceive speech and this cannot be omitted when trying to understand language learning. We do not know exactly how infants perceive a speech signal but several psychoacoustic studies (Werner, 1992) indicate that there are differences between how infants and how adults process sound regarding intensity and frequency. This is not very surprising according to Werner et al., given that the task of the developing auditory system is different from that of the mature system. It seems likely that the processing of auditory inputs changes during

13

development. This however is probably more something of a psychological than a physiological phenomenon and we do not address this issue any closer in our model. What we do know about the ear is that temporal and spatial factors influence speech perception. Psychoacoustic studies have shown that there is a frequency area in the basilar membrane in the human ear, which is divided into smaller areas. These areas influence our subjective interpretation of the speech signal (Gelfand, 1998). They are referred to as critical bands and each band has its own frequency width, which varies in size. The resolution of frequency in the basilar membrane is not a direct match with the acoustic signal. The critical bands can be said to function as filters, which to a certain extent enhance certain sounds. This means that our subjective perception of the speech signal’s loudness changes when the perceived speech signal passes through the critical band borders whereas the perceived loudness is even throughout one critical band. One critical band represents one Bark1 and one Bark corresponds to a certain length in the basilar membrane, which corresponds to a constant number of hair cells in the basilar membrane.

3.1. Operating of the model To represent the initial steps of the natural language acquisition process we used IDS, which is characterised by a high-pitched and modulated melody with many pauses and repetitions (Sundberg, 1998), to evaluate the model. A battery of sentences, produced in this IDS style by a female speaker, was digitally recorded in a sound-treated room. The signal was sampled at 44.1 kHz, 16 bit and subsequently sampled down to 16 kHz to save computational space before further processing.

First, a crude auditory representation (cf. Carlson & Granström, 1982) which can be imagined to correspond to critical bands (see table 2) is generated by processing the speech signal with a filter bank consisting of 21 digital band-pass filters covering adjacent 1 Bark bands up to 8 kHz, followed by full-wave rectifiers2 and integrators3. The speech material is sampled in terms of 5 millisecond slices. Each slice is viewed as a 21-dimensional vector representing the spectrum of the signal for that time-slice. To simulate auditory representations, the level of the input signal, which is calibrated in dBSPL4, is converted to hearing levels (dBHL)5 by subtracting the hearing thresholds within each Bark band. The auditory representations are stored in a general-purpose memory buffer, after being processed to preserve only reasonably audible sounds – sounds having a total intensity of less than 25 dBHL were considered as silence (the corresponding auditory representation was set to 0 dB for all the 21 bands), since signals under this hearing level are generally difficult to hear in a typical everyday acoustic ambient. 1 The Bark scale exhibits a non-linear relationship to the frequency scale. There is a direct link between the place of the basilar membrane’s maximum oscillation for a certain frequency and the Bark scale. This relationship has given rise to the Bark scale also being referred to as tonotopic (from Greek, “sound-place”). 2 Full-wave rectification is an engineering technique to obtain the absolute amplitude values. 3 Integrators are low-pass filters that reduce the ripple of the signal. They perform an engineering analog of a running average within the integrators time-window. 4 dBSPL = Decibel “Sound Pressure Level” (i.e. rel. to 20 µPa, 1 kHz). dB is used as a relative intensity measure but by defining a reference level it can be used as an absolute physical measure of intensity. The perceptual significance of the intensity levels expressed in dBSPL is not accounted for. 5 dBHL = Decibel “Hearing Level”. This is a measure of intensity intended to reflect our auditory perception by taking into account the frequency dependent absolute hearing threshold level.

14

Time cb01 cb02 cb03 cb04 cb05 cb06 cb07 cb08 cb09 cb10

96 3,34E+01 2,00E+01 2,24E+01 3,80E+01 1,95E+01 2,16E+01 1,38E+01 1,27E+01 9,93E+00 1,01E+01

96,005 3,45E+01 1,76E+01 2,36E+01 3,79E+01 2,02E+01 2,10E+01 1,28E+01 1,35E+01 1,20E+01 9,26E+00

96,010 3,54E+01 1,74E+01 2,47E+01 3,81E+01 2,07E+01 2,04E+01 1,46E+01 1,61E+01 1,49E+01 9,86E+00

96,015 3,52E+01 1,90E+01 2,54E+01 3,83E+01 2,07E+01 2,14E+01 1,57E+01 1,78E+01 1,48E+01 9,27E+00

96,020 3,46E+01 2,22E+01 2,59E+01 3,87E+01 2,03E+01 2,16E+01 1,55E+01 1,60E+01 1,23E+01 4,63E+00

96,025 3,43E+01 2,52E+01 2,62E+01 3,87E+01 2,02E+01 2,06E+01 1,36E+01 1,35E+01 9,40E+00 3,08E+00

96,030 3,36E+01 2,63E+01 2,55E+01 3,83E+01 1,90E+01 1,89E+01 1,45E+01 1,29E+01 9,79E+00 4,53E+00

96,035 3,22E+01 2,47E+01 2,50E+01 3,81E+01 1,91E+01 2,02E+01 1,61E+01 1,51E+01 9,34E+00 9,26E+00

96,040 3,24E+01 2,23E+01 2,49E+01 3,80E+01 2,02E+01 2,17E+01 1,37E+01 1,57E+01 1,01E+01 9,39E+00

96,045 3,42E+01 2,03E+01 2,57E+01 3,79E+01 2,04E+01 2,09E+01 1,36E+01 1,41E+01 1,26E+01 3,33E+00

96,05 3,49E+01 1,90E+01 2,61E+01 3,81E+01 2,13E+01 1,94E+01 1,58E+01 1,55E+01 1,50E+01 8,40E+00

96,055 3,48E+01 2,02E+01 2,56E+01 3,84E+01 2,09E+01 1,88E+01 1,53E+01 1,51E+01 1,45E+01 9,56E+00

96,06 3,46E+01 2,21E+01 2,57E+01 3,88E+01 2,02E+01 1,92E+01 1,58E+01 1,30E+01 1,13E+01 1,01E+01

96,065 3,48E+01 2,38E+01 2,57E+01 3,88E+01 1,97E+01 1,76E+01 1,35E+01 1,31E+01 7,68E+00 7,36E+00

96,07 3,45E+01 2,48E+01 2,53E+01 3,84E+01 1,80E+01 1,82E+01 1,47E+01 1,30E+01 2,77E+00 6,02E+00

96,075 3,38E+01 2,35E+01 2,45E+01 3,82E+01 1,81E+01 2,08E+01 1,52E+01 1,31E+01 7,59E+00 1,04E+01

96,08 3,35E+01 2,26E+01 2,42E+01 3,79E+01 1,92E+01 2,20E+01 1,28E+01 1,30E+01 9,93E+00 9,08E+00

96,085 3,45E+01 2,21E+01 2,42E+01 3,77E+01 1,89E+01 2,14E+01 1,32E+01 1,31E+01 9,47E+00 9,95E+00

Table 1. Example of the filtered speech material used as input. The x-axis is the speech signal filtered in 21 bands (10 bands in this example). The y-axis is time represented in terms of 5 millisecond slices.

In this way the incoming signals are continuously represented by paths in the 21-dimensional auditory memory space, where the weaker portions of the signal have been equated to total silence. In this memory space, similarity is measured by the inverse of the distance between the auditory representations (i.e. the distance between the vectors mapping the auditory representations in the memory space). The actual computation of the distances between representations in the memory space was carried out using Mathematica 4.2 in which a city-block distance metric6 was implemented. The algorithm allows for a generalized concept of distance between any two vectors associated with the auditory representations7. The distance metric can be used to study a signal’s internal stability (by relating the vector from a reference time slice with vectors from other time slices within the same signal) or, as we will exemplify below, to study similarity between arbitrary portions of two different input signals.

Since we do not want to include linguistic assumptions in the model, we solve the problem of defining “relevant” chunks of sound by picking up whatever sound sequences that have intensities above 25 dBHL without interruptions longer than 0.5 seconds. The duration of the delimiting silence period was selected to reflect the duration of pauses in typical speech communication situations. In spite of this reference to speech communication, the duration of the pauses is obviously not associated with linguistic knowledge but rather with the typical breathing and phonation rhythms. 6 City-block is the sum of absolute distances along each of the N dimensions:

dist@ref,xD =

⁄i=1

N HdBHLi, ref −dBHLi,xL

N 7 In further implementation of the model a more realistic interpretation of these distances would be according to Steven’s power-law (Stevens, 1955). It would cost more computationally but give more accurate values in terms of auditive perception.

15

The matching process is implemented in Mathematica but the storage of the signals for further matching is done manually (in Excel) to illustrate how the model works (see Evaluation for a few examples). These different blocks of implementations can of course be fused into a more sophisticated program in the future, but for the moment we want to evaluate the principal components of our method. One can think of the memory as a three-dimensional map of sound representations (see figure 2a and 2b), in a Time × Frequency × ActivationLevel space. The spectral information of the incoming signal is continuously passed on to this memory as long as the input level of the signal is above the 25 dBHL

threshold. When the level of the input signal stays under the threshold for at least 0.5 seconds, the mapping process is discontinued and the map stored as a memory buffer that is affected by memory decay. The activation levels of this memory buffer will therefore decrease as a function of time (memory decay). A typical memory decay factor may be defined as a 50% reduction of the activation level per 5 seconds period8. When the input level is again over the 25 dBHL threshold, a new mapping procedure is initiated until there is a new period of silence. The latest representation is transferred to a new memory buffer that is subsequently compared with the previously stored buffers that still have some activity. The information in the buffers is compared by calculating the frame-by-frame similarity between the stored patterns. Whenever similarity between two portions of buffers reaches a critical threshold, the matching portions are considered as relevant recurrent patterns and stored in a library of potential “lexical candidates”.

Band 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Hz 119 204 297 399 509 631 765 915 1081 1268 1479 1720 1997 2319 2698 3152 3702 4386 5258 6407 7992

Hearing Threshold 45,00 35,25 25,50 18,50 11,50 10,38 09,25 08,13 07,00 06,75 06,50 07,75 09,00 09,30 09,60 10,00 09,75 09,50 12,50 15,50 13,00

Table 2. Critical bands.

When the segments are stored in memory, they are also associated with whatever contextual information that might have been available in the input material. A variant of this has been done by Rumelheart and McClelland (1994). By mimicking the general trends seen in the input, their Parallel Distributed Processing (PDP) model learns past tenses of English verbs solely on the basis of statistical regularities. With this learning model they want to show that rules are all right for describing a phenomenon but do not necessarily have anything to do with the underlying processes of that phenomenon. This idea could be illustrated by the regular patterns of a beehive that is nothing more than an emergent result of wax that gets compressed (Bates, 1979). The regularities of the beehive can be described in terms of rules but the mechanisms that produce it do not account for these rules. Rumelheart and McClelland chose to investigate verbs since it is usually claimed that children cannot learn their inflections if they have not acquired the rules of regular and irregular verbs. Their PDP model takes no notice of this; instead it stores all kinds of verbs in a memory that reinforces similar patterns. After a certain amount of input verbs the system finds regularities in the structure of the verbs and generalises them to novel verbs. There are no rules that determine the correct inflection but statistical

8 The specific value of the decay factor is tentative. It is intended to capture the fact that infants may be able to remember the acoustic detail of utterances that have been heard a short while ago, but that such a memory vanishes with time. An empirical basis to have a more realistic decay-time is yet to be established.

16

relationships among the base forms of the verbs. In their view children do not have to find out the rules that could describe a language in order to master it. We agree in their reasoning and will adopt this approach in our own tentative model. Furthermore, by including a multi sensory representation of the input, we could capture the emergence of syntactic structure in the speech material. Indeed the buffering procedure described above can be applied as well to any other sensory information available to the system and synchronic with the acoustic signal, enabling the underlying linguistic structure to be exposed through the interwoven associations of the multi-sensory information.

3.2. Model Flowchart

Figure 1. The theoretical design of the model.

activity during silence

Storage and structuring (in long-term memory)

• scanning and updating of stored sounds • stored sounds are tagged using synchronic (multidimensional) information from the input

Markov Models (morphotactics/syntax)

combinatorial possibilities and probabilities of stored sounds

Auditory similarity (in working memory buffer)

• (internal analysis of distance – stable sound segments) • stored sounds are matched against new input • analysis of distances between auditory (spectrographic) patterns • similar auditory patterns (under pre-defined thresholds) are stored in memory

Memory decay exponential decay affects memory representations

Read input (simulating peripheral hearing)

• input is filtered in 21 frequency critical bands (1 Bark/band) • intensity below hearing threshold within each critical band is wiped out • silence is defined as energy under 25 dBA in all bands • input is read chronologically and continuously • silence (at least 0.5 sec) delimits sound chunks

17

4. Evaluation To illustrate how the model works we present here the two examples of the input signal, figure 2a esteoniko and figure 2b eumlindoniko. Using the city-block distance metric to measure the similarity between figure 2a and figure 2b, we get the dB-calibrated distance displayed in figures 3 and 4. A match is indicated by a dB-value below a calibrated threshold that we define here as 2 dB. Figure 4 also illustrates how a best match can be obtained by hovering the reference pattern over the auditory spectrogram of figure 2b. That is, figure 4 shows a generalization of the distance function in time and in frequency shifts. The y-axis shows the relative shift between the frequency scales of the stored signal and the incoming signal. The x-axis represents the time coordinate (frames) and the darkness of the shaded areas indicates the degree of similarity. Figure 4 indicates that the best match between the stored signal and the incoming signal can be found at about frame 170 to the ending in the incoming signal (figure 2b) and the last 100 frames in the stored signal (figure 2a). No frequency shifts are needed to obtain this best match. This is a perfectly reasonable result, detecting the similarity between the sound sequences oniko in the input signals corresponding to esteoniko and eumlindoniko.

Figure 2a. Auditory spectrogram of stored signal esteoniko.

0 100 200 300 400 Time frames

5

10

15

20

k r aB

0 100 200 300 400 Time frames

5

10

15

20

k a B r

Figure 2b. Auditory spectrogram of incoming signal eumlindoniko.

18

Figure 3. For every time-frame in figure 2b we calculate the distance to every frame in figure 2a and scan 100 time-frames forward the current point. The best match is obtained at the end of 2a (the stored signal) and at the end of 2b (incoming signal). Time-frame 170 in the incoming signal looking 100 time-frames forward has shortest distance from the last 100 time-frames in the stored signal. The match is the sound sequence oniko.

Figure 4. The signals can also be glided up and down in frequency, hovered one on the other in order to find a match (for example comparing band 14 in figure 2a to band 15 in figure 2b, and so on). At frame 170 we have the best matching sound sequence oniko. Y-axis: frequency shifts (0 = no shift). X-axis: time-frames in incoming signal. Shade: darkness = similarity.

The matched sound sequence oniko is stored in the memory buffer but when compared with the input signal chamaseniko the article o in oniko will not be matched since it is missing in the incoming signal. The sound sequences o, niko and chamase will be stored in the memory this time.

One more example is the input signals eonosamigoniko and eonosamigobeto. The beginnings of the two sound sequences are a good match, eonos, figure 5, and when shifting the frequencies +1 Bark for the incoming signal eonosamigobeto we get another match, amigo, figure 6. But the endings of the signals are a mismatch, which is niko and beto (except the aspiration o in the endings, but at this time we take no interest in such small time intervals in the incoming signals). This procedure could be applied over and over again to new signals. Increasing the amount of input tends to lead to more distinct matches, as the probability of spurious matches will decrease as the information gets more structured in memory.

10

8

6

4

50 100 150 200 250

2

300

1.5

1

0.5

0

0 50 100 150 200 250 300-1

2

19

Figure 5. Match eonos. Figure 6. Match amigo.

5. Discussion The implicit linguistic structure in language can be derived from the speech signal without any previous knowledge. This is what we believe we have illustrated in this paper. We do not explicitly search for linguistic structure. The model itself converges to linguistic structure because it is present in the input and the structure of the language emerges when the model, unintentionally, observes and stores recurrent auditory patterns. The model can find units of speech and their relation to each other. It would be possible to feed our model with any input and eventually obtain a lexicon and a grammar representative for the input. There would be little meaning in doing so, however, without adding substance in the form of meaning. Linguistic structure can in fact only serve its purpose as long as the speech signal can be related to something else that is supplied by other sensory input. It is our intention to find the means to represent other sensory input in future development of our model. This would not only give a more realistic picture of the early language acquisition but it would also speed up the computational processing.

Our simple examples suggest that the nature of input may be of great importance for developing linguistic representations. As discussed earlier IDS is characterized by a modulated pitch, repetitions and very often the target word (for example the name of an object) is in focus. Adults also tend to adjust their communicative target to the immediate surroundings of the infant, that is, if the infant does not look at the object in question the adult will talk about any object the infant is focusing on. When interacting with infants the primary intention is not really about exchanging information (which often is the case in adult communication), it is rather a referential act of communication. By adding other sensory input to the auditory signal, for example showing the infant the object when talking about it, the referential act is likely to further enhance the linguistic structure of the infant’s ambient language. The probability for a particular visual stimulus and a particular auditory stimulus to co-occur twice, just by chance, is quite small considering the infinite mass of e.g. visual, tactile and auditory information the infant is exposed to. The vast search-space of having access to all kinds of sensory input will increase the likelihood that a sound and, for example an object, will match and this will speed up the process of making word-object- connections.

Putting other sensory input into practice, in the further development of the model could perhaps be done most fittingly by adding visual information, such as a film

0 100 200 300 400

0

0.5

1

1,5

2

1,5

1

0,5

0

0 100 200 300 400

2

20

sequence illustrating what is being said. Since the motivation to communicate is one of the main reasons for language to occur in the first place and the feedback from the caretaker a reason for the language development to take place this has to be implemented in the model as well. This could be done by a reward system to set in when the model makes a suitable match. Although, this needs to be done with caution since the caretaker in real life tends to reward any kind of response from the infant. Also, for our model to reach the stage of syntactic development, the amount of input needs to be increased. This is most likely the easiest measure to improve the accuracy of the model. It is important to remember that the accuracy of the model does not stand in any direct relation to correct segmentations of words but stands in relation to the closest correspondence to the acquisition process in the child. And since we know that no child acquires a language without making any mistakes, we should not opt for designing a model that is free from mistakes as long as they are “meaningful”. At the very first stage the model makes the expected mistake of attributing the indefinite article –o as belonging to the word niko much like the infant is expected to do. As illustrated in the evaluation section of this paper, the model later learns to separate the two words. This can be seen as representing the very first sign of syntactic development and goes in line with Minsky’s reasoning of intelligent ways of solving problems: to break problems down in smaller parts and to make descriptions based on parts and relations. Furthermore, it supports the reasoning of Bates and Goodman (1999) that grammatical structure in fact arises naturally as the child tries to find a way to handle the information flow when accumulating more and more words. When the model is fed with several representations containing the word niko, the grammatical structure arises naturally (the indefinite article is separated from niko and the model knows it can occur in front of the word niko). The model of Rumelheart and McClelland (1994), which learns past tenses of English verbs on the basis of statistical regularities, also supports the idea that grammatical structure must not be seen as separate component of rule-based knowledge but instead embedded in the internal organization of the input. The nativistic perspective, of viewing the acquisition of grammatical structure as a module separated from the acquisition of words seems less plausible. In further development of our model, we hope to be able to tap on the emergence of syntactic structure as the amount and nature of input is increased.

There are researchers that share our point of view. That is, human intelligence should not be seen as modules of built in knowledge, but instead as something we develop through interaction with our surroundings. We can for instance mention the Artificial Intelligence Laboratory at MIT where they are “raising” two humanoid robots focusing on social interaction. One of the humanoids is built with a physical body, perception and motor control. Their motivation is that if building a robot that is going to develop human-like intelligence it must have a human-like body in order to be able to develop similar representations. Leaving those things out would deprive insights into the nature of human intelligence. Another important aspect, though little investigated, is the motivation for this social interaction. Inspired by infant social development, psychology and evolution, the other humanoid is being raised to intuitively learn social cues from visual and auditory channels in parent-infant interaction. This reasoning, we believe is also crucial for language acquisition whether it concerns infants or computational systems. Their humanoids are still mechanic robots rather than human-like hybrids, but they are important contributions to reach insights of human skills.

21

5.1. Conclusion Processing speech is obviously not a problem to us humans. At the beginning of this

urselves what it is that we need to know in order to make systems study, we asked oprocess speech more successfully. Do we need to know more about language or is it a matter of acquiring better programming skills? We now can attempt to answer these questions by saying that - no, we do not need to know more about language, we need to know more about how we do ourselves when processing speech and we need to know about the way we structure information in our brain. We have suggested that we should start from the beginning, looking at how it is done in infancy. And, if according to our beliefs, infants process speech without any pre-programmed knowledge, we need to bear that in mind when designing speech systems. The programming needs to mimic our biological constraints and social motivation rather than our adult knowledge of language and rules.

22

6. References Aslin, R. N., Jusczyk, P. W. & Pisoni, D. B. (1998). Speech and auditory processing

during infancy: Constraints on and precursors to language. In Kuhn, D., Siegler, R. (Eds.), Handbook of Child Psychology: Cognition, Perception, and Language, vol. 2. New York: Wiley.

Bates, E. (1979). Emergence of Symbols. New York: Academic Press.

Bates, E. and Goodman, J. C. (1999). On the Emergence of Grammar From the Lexicon MacWhinney, B. (Red.) The emergence of language. Lawrence Erlbaum Associates, Publishers, London.

Batliner, A., Möbius, B., Möhler, G., Schweitzer, A. and Nöth, E. (2001). Prosodic models, automatic speech understanding and speech synthesis: towards the common ground. Eurospeech 2001 – Scandinavia.

Best, C. T. (1995). Learning to perceive the sound patterns of English. In Rovee-Collier, C. and Lipsitt, L. P. Eds. Advances in infancy research. Norwood, N. J.: Ablex.

Boysson-Bardies, B., Vihman, M. M., Roug-Hellichius, L., Durand, C., Landberg, I. and Arao, F. (1992). Material evidence of infant selection from the target language: A cross-linguistic phonetic study; In Ferguson, Menn, Stoel-Gammon, Phonological development: Models, research, implications. York Press, Timonium.

Brent, M. J. (1999). Speech segmentation and word discovery: A computational perspective. Trends in Cognitive Science 3 (8).

Carlson, R. and Granström, B. (1982). Towards an auditory spectrograph. In The Representation of Speech in the Pheripheral Auditory System. Eds. Carlson and Granström pp. 109-114. Amsterdam: Elsevier Biomedical Press

Chomsky, N. (1988). Language and Problems of Knowledge MIT Press, Cambridge, MA

Chomsky, N. (1972). Language and Mind New York, Harcourt.

Crystal, D. (1975). The English Tone of Voice: Essays in Intonation, Prosody and Paralanguage Arnold, London.

Elman, J. L. (1999). The Emergence of Language: A Conspiracy Theory. MacWhinney, B. (Red.) The emergence of language. Lawrence Erlbaum Associates, Publishers, London.

Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition. Vol. 48, 71-99

Engstrand, O., Williams, K. and Lacerda, F. (2003). Does Babbling Sound Native? Listener Responses to Vocalizations Produced by Swedish and American 12- and 18-Month-Olds. Phonetica, 60, 17-44.

Fernald, A. (1994). Vocalizations to infants as biologically relevant signals. Language Acquisition. Edt. Bloom, P. MIT Press, Cambridge.

Fernald, A. (1984). The perceptual and affective salience of mother’s speech to infants. In Feagan, L., Garvey, C. and Golinkoff, R. (Red.) The origins and growth of communication. Norwood, N. J., Ablex, 5-29

23

Fernald, A. and Mazzie, C. (1991). Prosody and focus in speech to infants and adults. Developmental Psychology, 27, 209-21.

Fernald, A and McRoberts, G. (1991). Prosody and Early Lexical Comprehension. (Reported in Language Acquisition. Edt. Bloom, P. MIT Press, Cambridge, 1994) Society for Research on Child Development, Seattle.

Gelfand, S. A. (1998). Hearing – an introduction to psychological and physiolocical acoustics. Marcel Dekker, Inc. New York.

Gustafsson, L. and Lacerda, F. (2002). Assessing F0 patterns in infant-directed speech: A tentaive stochastic model. In TMH-QPSR Vol. 44. Fonetik 2002, Stockholm.

Guttentag, R. (1995). Children’s associative learning: Automatic and deliberate encoding of meaningful associations. American Journal Of Psychology. Vol. 108, 99-114.

Horn, J. L. and Cattell, R. C. (1966). Refinement and test of the theory of fluid and crystallized general intelligences. Journal of Educational Psychology, 57. 253-270.

Humanoid Robotics Group at MIT. http://www.ai.mit.edu/projects/humanoid-robotics-group

Jusczyk, P. W., Hirsh-Pasek, K., Kemler-Nelson, D. G., Kennedy, L. J., Woodward, A. and Piwoz, J. (1992). Perception of acoustic correlates of major phrasal units by young infants. Cognitive. Psychology. Vol. 24, 252-293.

Jusczyk, P. W. and Aslin, R. N. (1995). Infants’ detection of the sound patterns of words in fluent speech. Cognitive Psychology. vol. 29, pp. 1-23.

Jusczyk, P. W. (1997). The Discovery of Spoken Language. MIT Press, Cambridge.

Jusczyk, P. W. (2001). In the beginning, Was the word… Lacerda, F., Von Hoffsten, C., and Heimann, M. (Red.) Emerging Cognitive Abilities in Early Infancy. Lawrence Erlbaum Associates, Publishers, Mahwah, New Jersey.

Koponen, E., Gustafsson, L. and Lacerda, F. (2003). Effects of Linguistic Variance on Sound-Meaning-Connections in Early Stages of Language Acquisition. Proceedings of the 15th International Congress of Phonetic Sciences, Barcelona.

Kuhl, P. K. and Miller, J. D. (1982). Discrimination of auditory target dimensions in the presence or absence of variation in a second dimension by infants. Percept. Psychophys. Vol 31, 279-292.

Lacerda, F. and Lindblom, B. (1996). Modeling the early stages of language acquisition. In S. Strömqvist (Red.) COST-96.

Lacerda, F. and Sundberg, U. (2001). Emergent Fonologi: Experimentella studier av spädbarnets talperceptionsutveckling ur ett ekologiskt perspektiv. Project plan, PK.

Lacerda, F. (2003). Modelling Interactive Language Learning. Project plan, RJ.

Liberman, A. M., Cooper, F. S., Shankweiler, D. P. and Studdert-Kennedy, M. G. (1967). Perception of the speech code. Psychological Rewiev 74, 431-461.

24

Locke, J. L. (1983). Phonological acquisition and change. New York: Academic Press.

Mattys, S. L. and Jusczyk, P. W. (2001) Phonotactic Cues for Segmentation of Fluent Speech by Infants. Cognition, 78, 91-121.

Minsky, M. (1985a). Why Intelligent Aliens will be Intelligible. In Regis, E. JR. Extraterrestrials: Science and Alien Intelligence. Cambridge University Press.

Minsky, M. (1985b). Why People Think Computers Can’t. In Donnelly (Ed.) The Computer Culture. Associated Univ. Presses, Cranbury NJ.

Nowak, A. M., Plotkin, B. J. and Jansen, A. A. V. (2000). The evolution of syntactic communication. Nature. Vol. 404, 495-498.

Pinker, S. (1994a). The language instinct: How the mind creates language. New York: Morrow.

Pinker, S. (1994b). Rules of language In Language Acquisition. Edt. Bloom, P. MIT Press, Cambridge.

Polka, L. and Werker, J. F. (1994). Developmental changes in perception of non-native vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20.

Riederer, K. A. J. (2000). Large Vocabulary Continuous Speech Recognition. www.lce.hut.fi/teaching/S-114.240/k2000/ reports/17_riederer_final.pdf

Rumelheart, D. E. and McClelland, J. L. (1994). On learning the past tenses of English verbs. In Language Acquisition. Edt. Bloom, P. MIT Press, Cambridge.

Stevens, S. S. (1955). The measurement of loudness. JASA 27, 815-829.

Sundberg, U. (1998). Mother tongue - Phonetic aspects of infant-directed speech. PhD Dissertation in Phonetics, Stockholm University.

Sur, M., Angelucci, A. and Sharm, J. (1999). Rewiring Cortex: The role of patterned activity in developmental and plasticity of neocortical circuits. Journal of Neurobiology. Vol. 41, 33-43.

Svärd, N., Nehme, P. and Lacerda, F. (2002). Obtaining Linguistic Structure in Continuous Speech. In TMH-QPSR Vol. 44. Fonetik 2002, Stockholm.

Vihman, M. M. (1996). Phonological Development. Blackwell Publishers Inc. Cambridge.

von Melcher, L., Pallas, S. L. and Sur, M. (2000). Visual behavior induced by retinal projections directed to the auditory pathway. Nature. Vol. 404, 871-875.

Warren, R. M. (1982). Auditory Perception: A New Synthesis. Pergamon General Psychology series. Pergamon Press Inc, New York.

Weng, J. and Zhang, Y. (2002). Developmental Robots: A New Paradigm. An invited paper in Proc. of Second International Workshop on Epigenetic Robotics: Modeling Cognitive Development in Robotic Systems. Edinburgh, Scotland, August 10 - 11, 2002

Weng, J. and Zhang, Y. (2001). Grounded auditory development by a developmental

25

robot. In Proc. INNS-IEEE International Joint Conference on Neural Networks, 1059-1064. Washington, DC.

Werner, L. (1992). Interpreting Developmental Psychoacoustics. In Develpmental Psycholinguistics, L. Werner and E. Rubel, Eds. pp. 47-88. Washington: American Psychological Association.

26

Modelling the early stages of language acquisition - CiteSeerX

Documents

Transcript of Modelling the early stages of language acquisition - CiteSeerX