Cortical speech processing unplugged: a timely subcortico-cortical framework

8
Cortical speech processing unplugged: a timely subcortico-cortical framework Sonja A. Kotz and Michael Schwartze Max Planck Institute for Human Cognitive and Brain Sciences, IRG ‘‘Neurocognition of Rhythm in Communication’’, Stephanstrasse 1a, 04103 Leipzig, Germany Speech is inherently tied to time. This fundamental quality has long been deemed secondary, and has con- sequently not received appropriate recognition in speech processing models. We develop an integrative speech processing framework by synthesizing evol- utionary, anatomical and neurofunctional concepts of auditory, temporal and speech processing. These pro- cesses converge in a network that extends cortical speech processing systems with cortical and subcortical systems associated with motor control. This subcortico- cortical multifunctional network is based on temporal processing and predictive coding of events to optimize interactions between the organism and the environ- ment. The framework we outline provides a novel perspective on speech processing and has implications for future studies on learning, proficient use, and devel- opmental and acquired disorders of speech production and perception. The temporal nature of speech Speech essentially conveys patterns of energy distributed over time. However, the temporal nature of speech more or less vanished from linguistics in the wake of structuralist and generative theories of language. This separation ren- ders language a phenomenon independent of temporal and contextual variation. However, neurofunctional data do not consistently support such separation. In the following, we argue that the temporal nature of speech needs to be reappraised to develop a naturalistic model of brainlanguage function [1]. The speech signal constitutes a rich source of infor- mation that is mirrored by sensitive mechanisms for temporal and spectral integration in hearing [2]. The auditory periphery ensures that central processing sys- tems have access to a detailed representation of the acous- tic signal. To achieve the main purpose of speech perception (the inference of meaning) and in view of con- textual, physiological and temporal variability, it is plaus- ible that speech perception makes immediate and opportunistic use of all information sources available: from sound characteristics to syntax and pragmatics. This perspective implies that nonlinguistic and linguistic pro- cesses interact to facilitate this objective. For example, temporal processing mechanisms (i.e. mechanisms under- lying the explicit encoding, decoding and evaluation of temporal information) need to be involved in the interpret- ation of the temporal structure of speech. Timing is fundamental to efficient behavior and originates in evolutionarily primitive brain structures such as the cerebellum (CE) and the basal ganglia (BG) [3]. During speech acquisition their capacity can be used to establish basic routines that advance more sophisticated behavior. Once these routines are acquired, the BG con- tribution can be reduced to a supplementary and corrective function, whereas the CE remains actively engaged in the computation of sensory information [4]. In this article we argue that temporal structure is used to support well-studied fronto-temporal speech processing networks that therefore need to be extended by temporal processing systems. We propose that a subcortico-cortical network that includes the CE and BG is engaged in con- stant attempts to detect temporal regularities in sensory input and to predict the future course of events to optimize cognitive and behavioral performance, including speech processing. Speech constitutes events in time Unlike spatial information, acoustic information is entirely dependent on time. Time and temporal structure are coupled to change. In speech, changes generate events such as vowels, stressed or unstressed syllables, and Opinion Glossary Oscillation: An oscillation describes the unfolding of repeating events in terms of frequency, i.e. the number of events repeated in a specific amount of time. Serial order: Succession of events in the temporal dimension. Serial order precludes simultaneity and is thus not identical to temporal structure. Speech event: Speech events manifest as a set of linguistic and paralinguistic categories (e.g. phoneme, syllable, word, voice, stress or phrase) with partly overlapping borders that can be combined or decomposed into other speech events. Speech processing: Analogous to temporal processing, the term speech processing comprises all mechanisms involved in the perception and production of verbal expressions. Synchronization: Temporal alignment of two or more oscillations. Synchroni- zation is achieved when at least one oscillation adjusts its phase and/or period to match that of another. Temporal processing: The neurocognitive mechanisms that underlie the encoding, decoding and evaluation of temporal structure in perception and production. Temporal processing refers exclusively to duration and temporal relations but not to the formal aspect of information. Temporal structure: Arrangement of events in the temporal dimension. An event in the temporal dimension originates from a contrast between static and dynamic changes that result in a subdivision of time. Temporal structure can be characterized in terms of categories such as duration, order, tempo or regularity of events. Duration describes an implicit property of the signal, whereas the latter categories describe temporal relations and can be considered explicit temporal information. Corresponding author: Kotz, S.A. ([email protected]). 392 1364-6613/$ see front matter ß 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.tics.2010.06.005 Trends in Cognitive Sciences 14 (2010) 392399

Transcript of Cortical speech processing unplugged: a timely subcortico-cortical framework

Cortical speech processing unplugged:a timely subcortico-cortical frameworkSonja A. Kotz and Michael Schwartze

Max Planck Institute for Human Cognitive and Brain Sciences, IRG ‘‘Neurocognition of Rhythm in Communication’’,

Stephanstrasse 1a, 04103 Leipzig, Germany

Opinion

Glossary

Oscillation: An oscillation describes the unfolding of repeating events in terms

of frequency, i.e. the number of events repeated in a specific amount of time.

Serial order: Succession of events in the temporal dimension. Serial order

precludes simultaneity and is thus not identical to temporal structure.

Speech event: Speech events manifest as a set of linguistic and paralinguistic

categories (e.g. phoneme, syllable, word, voice, stress or phrase) with partly

overlapping borders that can be combined or decomposed into other speech

events.

Speech processing: Analogous to temporal processing, the term speech

processing comprises all mechanisms involved in the perception and

production of verbal expressions.

Synchronization: Temporal alignment of two or more oscillations. Synchroni-

zation is achieved when at least one oscillation adjusts its phase and/or period

to match that of another.

Temporal processing: The neurocognitive mechanisms that underlie the

encoding, decoding and evaluation of temporal structure in perception and

production. Temporal processing refers exclusively to duration and temporal

relations but not to the formal aspect of information.

Temporal structure: Arrangement of events in the temporal dimension. An

event in the temporal dimension originates from a contrast between static and

dynamic changes that result in a subdivision of time. Temporal structure can

Speech is inherently tied to time. This fundamentalquality has long been deemed secondary, and has con-sequently not received appropriate recognition inspeech processing models. We develop an integrativespeech processing framework by synthesizing evol-utionary, anatomical and neurofunctional concepts ofauditory, temporal and speech processing. These pro-cesses converge in a network that extends corticalspeech processing systems with cortical and subcorticalsystems associated with motor control. This subcortico-cortical multifunctional network is based on temporalprocessing and predictive coding of events to optimizeinteractions between the organism and the environ-ment. The framework we outline provides a novelperspective on speech processing and has implicationsfor future studies on learning, proficient use, and devel-opmental and acquired disorders of speech productionand perception.

The temporal nature of speechSpeech essentially conveys patterns of energy distributedover time. However, the temporal nature of speech more orless vanished from linguistics in the wake of structuralistand generative theories of language. This separation ren-ders language a phenomenon independent of temporal andcontextual variation. However, neurofunctional data donot consistently support such separation. In the following,we argue that the temporal nature of speech needs to bereappraised to develop a naturalistic model of brain–

language function [1].The speech signal constitutes a rich source of infor-

mation that is mirrored by sensitive mechanisms fortemporal and spectral integration in hearing [2]. Theauditory periphery ensures that central processing sys-tems have access to a detailed representation of the acous-tic signal. To achieve the main purpose of speechperception (the inference of meaning) and in view of con-textual, physiological and temporal variability, it is plaus-ible that speech perception makes immediate andopportunistic use of all information sources available: fromsound characteristics to syntax and pragmatics. Thisperspective implies that nonlinguistic and linguistic pro-cesses interact to facilitate this objective. For example,temporal processing mechanisms (i.e. mechanisms under-lying the explicit encoding, decoding and evaluation of

Corresponding author: Kotz, S.A. ([email protected]).

392 1364-6613/$ – see front matter � 2010 Elsevier Ltd. All rights

temporal information) need to be involved in the interpret-ation of the temporal structure of speech.

Timing is fundamental to efficient behavior andoriginates in evolutionarily primitive brain structures suchas the cerebellum (CE) and the basal ganglia (BG) [3].During speech acquisition their capacity can be used toestablish basic routines that advance more sophisticatedbehavior. Once these routines are acquired, the BG con-tribution can be reduced to a supplementary and correctivefunction, whereas the CE remains actively engaged in thecomputation of sensory information [4].

In this article we argue that temporal structure is usedto support well-studied fronto-temporal speech processingnetworks that therefore need to be extended by temporalprocessing systems. We propose that a subcortico-corticalnetwork that includes the CE and BG is engaged in con-stant attempts to detect temporal regularities in sensoryinput and to predict the future course of events to optimizecognitive and behavioral performance, including speechprocessing.

Speech constitutes events in timeUnlike spatial information, acoustic information is entirelydependent on time. Time and temporal structure arecoupled to change. In speech, changes generate events –

such as vowels, stressed or unstressed syllables, and

be characterized in terms of categories such as duration, order, tempo or

regularity of events. Duration describes an implicit property of the signal,

whereas the latter categories describe temporal relations and can be

considered explicit temporal information.

reserved. doi:10.1016/j.tics.2010.06.005 Trends in Cognitive Sciences 14 (2010) 392–399

[(Figure_1)TD$FIG]

Figure 1. Visualizations of the utterance ‘‘Time must flow for sound to exist’’ [5] in the form of a waveform (top) and a spectrogram (bottom) created using the software

PRAAT (developed by Paul Boersma and David Weenink of the Institute of Phonetics Sciences of the University of Amsterdam, http://www.praat.org/). This utterance can be

decomposed into smaller speech events and categories such as syllables, phonemes or vowels. However, as both visualizations illustrate, neither events nor categories are

discrete due to coarticulation, i.e. the anterograde or retrograde effect of continuously moving articulators on the acoustic signal.

Opinion Trends in Cognitive Sciences Vol.14 No.9

phrases – that evolve in time. Depending on the level ofanalysis, speech events comprise different categories andhierarchies (Figure 1).

Speech events can be combined or decomposed intoshorter and longer events (e.g. the onset of a vowel, con-sonants within a syllable, or words in a phrase). However,the arrangement of these events is not incidental. Con-catenated events follow an order that converges intospecific speech patterns. The order of events in a patterncan be strictly sequential, with one event determining animmediately following event, and/or hierarchical, with oneevent determining another in the presence of interveningevents. However, patterns in speech also imply serialorder. Lashley [6] remarked that serial order in behaviorrelies on the transition of spatially distributed memoryrepresentations into temporal sequence. This operationcan be attributed to syntax. In the broadest sense, syntax‘‘denotes the organization of any combinatorial system inthe mind’’ ([7] p. 276). Syntax can thus be defined as a ‘‘setof principles governing the combination of discrete struc-tural elements into sequences’’ [8].

Another important ordering principle is recurrence.Similar to serial order, recurrence allows the generationof predictions about when specific events are likely tooccur; thereby temporal regularity can facilitate speechand cognitive processing. In its simplest form, recurrence isperiodic and can be depicted as an oscillation whose periodreflects the temporal relation of successive events. Notably,this form of oscillation can complement gamma and thetaband oscillations that constitute an important componentof speech processing [9,10]. However, recognition of rela-tional information, such as temporal regularity, necessi-tates an explicit internal representation of temporalstructure generated by the CE andBG temporal processingsystems. We propose that speech processing exploits both:temporal structure even if it is not regular, but alsotemporal regularity to optimize comprehension. Thisrequires an early and fast interaction of auditory andtemporal processing in a neurofunctional network thatcomprises overlapping neural correlates of auditory andtemporal processing (Box 1). This approach also provides amore general perspective on the interaction between an

393

Box 1. Early auditory input to the temporal processing network

A recent activation-likelihood-estimation meta-analysis [4] showed

that several regions of the CE respond to auditory stimulation. Earlier

tracer studies and unit recordings in animals [11–13] identified a

neural pathway for rapid auditory transmission [14,15] between the

CE and the cochlear nuclei where the fibers of the auditory nerve

terminate. This pathway can transmit auditory information to the

cerebellar temporal processing system. In turn, the nonmotor part of

the cerebellar dentate [16], one of the primary output nuclei to the

thalamus, projects to the frontal cortex [17] (preSMA in monkeys) that

then connects to the BG.

Huang and Liu [18] assume that the CE serves as an interface

between the auditory and motor systems, possibly initiating tracking

behavior. Although the cerebellar neurons seem unfit to process

detailed frequency, intensity or duration information, they display

special sensitivity to temporal and intensity differences [19], functions

that are both important to signal when an event occurs and to track

temporal structure.

Click-evoked responses in the frontal monkey cortex after complete

removal or destruction of the temporal lobe, the CE and the medial

thalamus raise the question whether there are direct projections to the

frontal cortex from the thalamus [20]. The suprageniculate nucleus (SG)

is one possible candidate for such transmission. The SG is responsive

to auditory stimulation [21] and displays fine temporal tuning

characteristics [22]. After injections into the prefrontal cortex, Kobler

et al. [23] found labeled cells in the SG of bats, whereas Kurokawa et al.

[24] found labeled terminals in the frontal cortex and auditory areas of

the temporal cortex after injections into the SG. Reciprocal connections

were identified with injections into both the SG and the fastigial nucleus

of the CE [25] as well as the superior colliculus [26]. Connections from

the SG to the frontal cortex consist of separate neuronal groups of

different sizes and shapes [27] with Fr2, the target location in the frontal

cortex, corresponding to monkey SMA. Thus, the SG could constitute a

relay between the frontal and the cerebellar cortex that is connected to

several cortical auditory fields [28]. These are connected to the pontine

nuclei, indicating cerebellar involvement in a circular architecture. The

CE projects via dentato-thalamo-cortical pathways to areas from which

it receives input via the cortico-ponto-cerebellar pathways [29]. These

pathways form a link between cerebellar, temporal and fronto-striatal

circuitry in which the thalamus plays a pivotal role in earlier and later

processing stages.

Opinion Trends in Cognitive Sciences Vol.14 No.9

ever changing external environment and an equallydynamic cognitive environment.

A subcortico-cortical framework of speech perceptionand productionIf auditory processing is indeed coupled with temporalprocessing, then auditory and temporal processing systemsneed to interface to create a representation of auditorytemporal structure, the backbone of speech. There is someconsensus that classical motor systems are involved intemporal processing. Hence, in speech perception we dis-tinguish two parallel auditory pathways: (i) the preatten-tive encoding of event-based temporal structure in the CE,which forwards temporal information to the frontal cortexvia the thalamus and (ii) the retrieval of memory repres-entations in temporal cortex, which are projected to frontalcortex. The presupplementary motor area (preSMA) bindstemporal structure [30] and receives information from theCE [17]. It further transmits information to the dorsolat-eral prefrontal cortex (DLPFC), which then integratesmemory representations and temporal information to opti-mize comprehension. Additionally, the attention-depend-ent BG temporal processing system and its thalamo-cortical connections continuously evaluate temporalrelations and support the extraction of temporalregularity. It also engages in reanalysis and resequencingof information when the temporal structure of a stimulus isunfamiliar or incongruent (Figure 2A).

In the planning of speech production, the preSMA andthe BG in concert with the CE serve as a pacemaker thatprovides basic temporal structure. An interplay of SMA-proper, premotor and primary motor cortex then utilizesthis temporal structure to guide articulation (Figure 2B).This structural dissociation of the SMA is in line withevidence for preSMA involvement in word selection, encod-ing of word form and the control of syllable sequencing,whereas the SMAproper supports overt articulation [31].

The preSMA connects to the rostral striatum and to thesuperior/inferior frontal gyrus, whereas the SMAproper isconnected to the caudal putamen, the precentral gyrus andthe corticospinal tract [32–34]. Similar circuitry forms the

394

basis of temporal processing in the BG that depends onensembles of cortical oscillations conveying a signature oftemporal structure to the BG [3]. The CE in turn isfunctionally connected to the dorsolateral, medial andanterior prefrontal cortex [35], and via the thalamus tothe SMA. Evidently, subregions of these classical motorareas (e.g. the nonmotor part of the dentate [16], the rostralstriatum or the preSMA) primarily engage in perception,whereas other subregions (e.g. the motor part of the den-tate, the caudal putamen and the SMAproper) are involvedin production. Crucially, the thalamus mediates infor-mation flow in this framework (Box 2).

The most basic function of the motor system (to modifybody posture to produce proactive and reactive move-ments) improves with precise timing. Moreover, themutual influence of motor and cognitive processes suchas temporal processing could represent one of the drivingforces in the development of sophisticated motor and cog-nitive skills such as speech processing.

On the origin and development of speech processingcapacitiesUltimately, functional differentiation goes hand in handwith structural change. Converging morphological evi-dence supports the view that the motor system has recon-figured to meet the challenges posed by developingcommunicative and cognitive skills [41]. Simultaneousenlargement of the lateral CE and the frontal cortex aswell as the formation of a cerebello-cortical loop reflectdeveloping speech processing capacity [42]. The lateral CEengages in cognitive tasks and its increased size in homi-nids is most likely to be accompanied by a similar increasein the brain’s information processing capacity [43,44].Contralateral connections from the CE to the cortex andfrom the cortex to the CE substantiate speculations aboutlateralization at the subcortical and cortical level as evi-denced by functional temporal asymmetry in both hemi-spheres (Box 3).

These reciprocal connections establish a cerebello-tha-lamo-cortical circuit that is comparable to cortico-striato-thalamo-cortical circuits. Together they provide a powerful

[(Figure_2)TD$FIG]

Figure 2. (a) A framework for speech perception. In speech perception, auditory information is transmitted to the auditory cortex via the thalamus (a, blue) and to the CE (b),

where the temporal relationship between successive events is encoded (1) and transmitted to the frontal cortex (red). The seminal AST model on speech perception [50]

accounts for differences in temporal sensitivity (L = left; R = right; orange letters reflect short temporal windows of integration; blue letters reflect longer windows of

integration). Auditory information is mapped onto memory representations (3a) that are transmitted to the frontal cortex (c) to be integrated with temporal event structure

(4) that is conveyed via a cerebello-thalamo-preSMA (3b) pathway (red). Temporal information is transmitted to the BG (5) via connections from the preSMA and from the

frontal cortex (d). The BG evaluate temporal relations and transmit this information back to the cortex (6) via the thalamus. The CE and the thalamus also provide direct

input to the BG (f), thereby possibly modulating BG processing. The descending auditory pathway (g) could modulate processing in the whole network. (b) A framework for

speech production. Memory representations are transmitted from the temporal (1) to the frontal cortex (a, blue), where they are mapped onto temporal event structure (b)

generated by the preSMA (2) in concert with the CE and the BG (4). Furthermore, the CE (5) is involved in the temporal shaping of syllables (d). The integrated information is

then used in motor control of articulation (e, green) interfacing the SMAproper (6), premotor and primary motor cortex. The CE and the thalamus also provide direct input to

the BG (f), thereby possibly modulating BG processing.

Opinion Trends in Cognitive Sciences Vol.14 No.9

computational basis because information in these circuitscan be processed rapidly and repeatedly. Moreover, pro-gression from simple to increasingly more complex beha-vior necessitates additional sequencing and patterningcapacity, a quality that has been attributed to the BG[55]. Although each of these brain structures has probablycontributed to the evolution of speech in isolation, thecombination and resulting functional differentiation ofsubcortical structures provides a novel perspective forspeech processing. Most importantly, this differentiationextends findings on the ontogenetic and phylogenetic de-velopment and maturation of white matter fiber tractsresponsible for information flow between gray matter cor-tical areas [56]. A prominent example is the left-hemi-sphere accentuated arcuate fasciculus that connects

Wernicke’s and Broca’s region. This white matter fiberbundle is only fully developed in humans; in chimpanzees,it is nonexistent [57]. The same can apply to macaques [57]or is at most rudimentary and less specified [58]. Thefunction of these white matter tracts is to convey memoryrepresentations to the DLPFC where this formal infor-mation can be integrated with explicit temporal structureto either comprehend or to produce speech sequences. Inspeech perception, temporal structure complements formalpredictions about upcoming events in a sequence. Inspeech production, the preSMA/BG pacemaker generatesa grid for the temporal alignment of memory representa-tions. In a similar vein, MacNeilage and Davis [59] proposethat speech evolved from a simple biomechanical mechan-ism (i.e. biphasic opening and closing mouth movements).

395

Box 2. Two thalamic firing modes

Sherman and Guillery [36] emphasize that understanding of cortical

function depends on knowledge about the nature of thalamic input to

the cortex. As auditory information passes through the thalamus the

question arises as to how it treats this information. Thalamic cells

respond to input in either ‘tonic’ or ‘burst’ mode [36]. Tonic mode

preserves input linearity, whereas burst mode is more efficient in

detecting input. Consequently, thalamic cells in burst mode send a

‘wake-up call’ to the cortex that is evoked by sudden novel or

unexpected input. Importantly, bursts follow the temporal properties

of stimulation and enhance sensory event detection [37]. For

example, in the visual domain, bursts occur approximately at phase

zero of the oscillation underlying a periodic stimulation [36] (Figure I).

Furthermore, they can signal salient input because bursts affect the

postsynaptic neuron more strongly than single spikes [38].

We speculate that burst-firing marks input events characterized by

salient changes at the energy level (e.g. onsets, offsets or more

intense parts of an acoustic signal). For instance, in speech, these

events might correspond to pauses, vowels or stress correlates. In

analogy to visual processing, we consider that bursts preferentially

occur at vowel onsets. Hence, thalamic bursts could transmit the

temporal relation between events for subsequent cortical processing

and also amplify the neural representation of these events.

Computational simulations of a bursting neuron [39] show that the

biophysical mechanisms of spike generation enable individual

neurons to encode different stimulus features into distinct spike

patterns. However, burst timing is more precise than the timing of

single spikes. Accordingly, the driving input from the cerebellar,

event-based temporal processing system to the thalamus could be

encoded via burst firing to forward precise temporal markers of

events. In parallel, a more linear and continuous stimulus representa-

tion, delivered via the auditory pathway, is primarily encoded via

tonic firing to preserve detailed spectro-temporal structure.

Burst firing is characterized by inter-spike intervals of approxi-

mately 100 ms, whereas tonic firing features intervals of around 10–

30 ms. Poppel [40] hypothesized that temporal processing proceeds

in an oscillatory fashion in which sensory input registered within

30 ms is treated as co-temporal. One can speculate that at a sampling

rate of approximately 30–40 Hz, perception of temporal order is

constrained by thalamic ‘packaging’ of information in tonic mode and

cortical sensitivity to these packages, (e.g. the phonemes of a

syllable). However, a better understanding of thalamic function is

clearly necessary to model speech and temporal processing.[(Figure_I)TD$FIG]

Figure I. Responses of lateral geniculate nucleus relay cells of cats during sinusoidal grating in either tonic (a) or burst (b) mode. Adapted from Sherman and Guillery

[36]. Tonic firing preserves linearity of the input, whereas burst firing selectively encodes parts of the stimulation that correspond to changes in the energy level of a

stimulus.

Opinion Trends in Cognitive Sciences Vol.14 No.9

At a conceptual level, the biphasicmovements of themouthcorrespond to a blank ‘syllabic frame’ [60]. Consonant andvowel sounds (e.g. /ba/) constitute content elements thatare inserted into the syllabic frame during articulation.Furthermore, the temporal structure of concatenated syl-lables (e.g. /bababa/) rests on input from the SMA. At thisstage, sequencing capacity and precise temporal proces-sing should be coupled to establish the serial order offrames and content elements.

Temporal structure and temporal alignment in speechprocessingWhich speech events convey suitable temporal structure?One way to classify the temporal structure of speech is todistinguish envelope, periodicity and fine structure levels[61]. In speech production these levels converge into aunitary acoustic signal, whereas in speech perceptiontwo aspects are fundamental: (i) the envelope conveysinformation about duration, rhythm, tempo and stressand (ii) speech can be understood when all other cuesbesides the slowly varying temporal envelope are degraded[62]. This implies that important speech events are cap-tured by these categories. However, the question remains

396

which part of the signal is used to assess them. Greenberg[63] developed a syllable-centric framework that describesthe syllable as an ‘energy arc’ (spectro-temporal profile).Typically, the syllabic nucleus protrudes as it correlateswith maximum ‘oral aperture’ in the articulatory gesture.The prominence of the vocalic nucleus is accompanied by asteep rise and a subsequent peak in acoustic energybecause it is typically more intense than a consonant. Inaddition to duration, a relative increase in intensity alsodistinguishes rhythmically prominent from nonprominentsyllables [64]. Greenberg proposes that the nucleus setsthe ‘spectro-temporal register’ on which the rest of thesyllable is mapped. This notion seems comparable to Mac-Neilage’s term ‘‘the general-purpose carrier, which weknow today as the syllable’’ ([60] p. 105).

Importantly, there is a related concept in speech percep-tion. Following the notion of perceptual centres or ‘p-centres’ [65], Port [66] describes the vocalic nucleus asthe carrier of a perceptual beat that renders this eventparticularly salient. Port refers to Dynamic AttendingTheory (DAT) [67] by linking the saliency of the vowel tothe pulse of a ‘neurocognitive attentional oscillation’.DAT proposes that the allocation of attention depends

Box 4. Questions for future research

� What are the anatomical and functional commonalities and

differences regarding human and animal temporal processing

networks?

� What is the function of direct and reciprocal subcortico-subcor-

tical pathways besides subcortico-cortical connections and do

these circuits interact?

� How do temporal processing and predictions generated on the

basis of temporal structure relate to other formal and contextual

aspects of speech (e.g. phonotactic rules or semantic priming)?

� Is the proposed temporal aspect of the framework the same in

speech and music? Moreover, is it restricted to the auditory

domain or is it involved across sensory and cognitive domains?

� Are there therapeutic applications (e.g. overemphasizing the

predictive value in the context of pathological speech processing,

or to combine specific temporal patterns with movement)?

Box 3. Lateralized sensitivity to temporal structure

It is well known that the left cortical hemisphere specializes in fine

temporal discrimination, whereas the right hemisphere holds

information over longer periods of time [45]. In a similar vein, the

language-dominant hemisphere differentiates fine temporal input,

whereas the complementary hemisphere integrates information

across longer time spans [46]. More specifically, at the tonal level,

the core bilateral auditory cortices are sensitive to temporal and

spectral variation. Spectral variation is weighted towards the right

hemisphere; the left hemisphere specializes in rapid temporal

processing [47]. In speech (i.e. syllables, words), both auditory

cortices engage in a general temporal processing mechanism. Rapid

variations in temporal sound structure are preferably processed in

the left hemisphere [48], whereas the contour of the speech

envelope concurs with right hemisphere areas [49]. In speech

perception, ‘Asymmetric Sampling in Time’ (AST) [50] ascribes a

short window of integration (20–50 ms) to the left hemisphere, and a

long window (150–250 ms) to the right hemisphere. For example,

spontaneous power fluctuations of intrinsic oscillations in right-

hemisphere regions of Heschl’s gyrus correspond to the dominant

syllabic rate, between 3 and 6 Hz, and to rates between 28 and 40 Hz

in the left hemisphere [10]. Contributions of preceding stages of

auditory processing in addition to those at the cortical level need to

be considered. A similar temporal dissociation is proposed for the

CE. The right CE responds more strongly to high frequency

information and speech, whereas the left CE is more sensitive to

low frequency information and singing [51]. Lateralization also

impacts on auditory information processing at all stages of the

auditory pathway including the thalamus and the brain stem [52].

The bilateral auditory cortices use implicit discharge rate codes

and explicit temporal codes to represent the temporal structure of

auditory signals [53]. Discharge rate codes integrate stimulus

features in discrete 30 ms windows that could reflect cortical

sensitivity to thalamic ‘packaging’ in tonic mode. Animal evidence

[54] indicates a slowdown of temporal response rates along the

ascending auditory pathway from the thalamus (10 ms) to the

auditory cortex (30 ms). Thus, co-temporal information [40] could

correspond to input sampled in a short window of integration within

the left hemisphere.

Opinion Trends in Cognitive Sciences Vol.14 No.9

on synchronization between self-sustained, adaptiveinternal oscillations and external temporal structure. Thissynchronization results in stimulus-driven allocation ofattention. With respect to speech perception, this impliesthat we can attempt to synchronize an internal attentionoscillation with the temporal structure of external speechevents such as the rhythmic succession of vocalic nuclei.Thus, the rise in acoustic energy and the resulting energymaximum in sound structure, and the perceived promi-nence of the p-centre can constitute complementaryphenomena. This event can then be used to: guide atten-tion to points in timewhen important information appears,to set a spectro-temporal register [63] and to open a frame[59] for the subsequent integration of content elementsfrom memory.

Information about temporal structure is then used toalign memory representations (temporal cortex) with thepoint in time that an event is maximally salient to ensuretemporal coherence and to optimize speech processing. Iftemporal structure conveys periodicity and allows theextraction of a regular pattern, then one can conceive ofsuch an alignment as an incidence of synchronizationbetween an external stimulus-inherent oscillation andan internal stimulus-driven oscillation. In line withDAT, information about successive events can provideattractors for an attention oscillation that are used to

synchronize attentional resources, and to generate framesfor content integration. In other words, speech perceptioncan involve integration of spectro-temporal content orformal information (‘what’) and temporal informationdelivered via the event-based temporal processing system(‘when’). The formal aspect involvesmapping from sound tomeaning in the temporal cortex and the transmission ofmemory content via white matter fiber tracts [68]. Thisgoes hand in hand with the functional differentiation of aventral and a dorsal stream in speech processing. Thedorsal stream maps acoustic or phonological representa-tions to articulatory or motor representations whereas theventral stream maps sensory or phonological representa-tions onto lexical conceptual representations [50,69]. For-mal information transfer via ventral and dorsal pathwaysis further subdivided into ‘what’, ‘how’ and ‘where’ streams[70].

Furthermore, the cortico-striato-thalamo-cortical atten-tion-dependent temporal processing system adds higher-order processing routines, such as interval estimation andcomparison, as well as the extraction of temporalregularity. This information can then be used to generatepredictions concerning the temporal locus of importantspeech events. Once regularity is perceived, the systemcan tolerate small perturbations whereas strongregularity-based predictions would allow the maintenanceof a pattern even in the presence of displaced or omittedevents. This function differentiates the proposed role of themotor system in processing temporal aspects such astracking rhythm, speech rate and turn taking in communi-cation [71,72].

Concluding remarksIn this article we have outlined a neurofunctional frame-work of speech processing that emphasizes two elementaryaspects of speech. First, information conveyed in an acous-tic signal is entirely time-dependent. Temporal character-istics of the signal and related temporal processing shouldtherefore play a significant role in both speech productionand perception. Second, the evolution of speech as a com-plex motor behavior originates in subcortico-cortical motorsystems and their capacity to temporally structure beha-vioral sequences. We propose that speech production andperception have retained characteristics of this primordialinteraction between motor timing and sequencing

397

Opinion Trends in Cognitive Sciences Vol.14 No.9

capacities, and a developing cognitive competence. Neu-roanatomically, this fundamental interaction can beretraced in ontogenetic and phylogenetic development inwhich primitive subcortical structures set in motion basiccomputational mechanisms in support of refined neocor-tical functions. In line with a recent proposal on speechproduction [73] we highlight the necessary contributions ofcortical and subcortical brain structures to speech proces-sing. We offer a framework within which to: (i) furtherinvestigate how different aspects of uni- and multimodalinformation converge in time to form unitary percepts [74],(ii) explain how developmental and compensatory mech-anisms in speech disorders impact on speech processing(e.g. [75]) and (iii) elucidate how the underlying perspectivetransfers to other domains such as music (Box 4).

AcknowledgmentsThe authors would like to thank D. Yves von Cramon for expertneuroanatomical input and discussion, and Kathrin Rothermich, Iris N.Knierim, Maren Schmidt-Kassow and Anna S. Hasting for continualfeedback. Special thanks to Robert F. Port and Angela D. Friederici as wellas to the anonymous reviewers for constructive comments on an earlierversion of the manuscript, and Richard Ivry for valuable discussion of theconcept. Lastly, thanks to Kerstin Flake for graphics support.

References1 Poeppel, D. andHickok, G. (2004) Towards a new functional anatomy of

language. Cognition 92, 1–122 Moore, B.C.J. (2003) Temporal integration and context effect in

hearing. J. Phonetics 31, 563–5743 Buhusi, C.V. and Meck, W.H. (2005) What makes us tick? Functional

and neural mechanisms of interval timing. Nat. Rev. Neurosci. 6, 755–

7654 Petacchi, A. et al. (2005) Cerebellum and auditory function: an ALE

meta-analysis of functional neuroimaging studies. Hum. Brain Mapp.25, 118–128

5 de Cheveigne, A. (2003) Time-domain auditory processing of speech.J. Phonetics 31, 547–561

6 Lashley, K.S. (1951) The problem of serial order in behavior. InCerebral mechanisms in behavior (Jeffress, L.A., ed.), pp. 112–136,Wiley

7 Jackendoff, R. (2002) Foundations of language, Oxford UniversityPress

8 Patel, A.D. (2003) Language, music, syntax and the brain. Nat. Rev.Neurosci. 6, 674–681

9 Ghitza, O. and Greenberg, S. (2009) On the possible role of brainrhythms in speech perception: intelligibility of time-compressedspeech with periodic and aperiodic insertions of silence. Phonetica.66, 113–126

10 Giraud, A. et al. (2007) Endogeneous cortical rhythms determinecerebral specialization for speech perception and production.Neuron. 56, 1127–1134

11 Huang, C.M. et al. (1982) Projections from the cochlear nucleus to thecerebellum. Brain Res. 244, 1–8

12 Woody, C.D. et al. (1998) Acoustic transmission in the dentate nucleusI. Changes in activity and excitability after conditioning. Brain Res.789, 74–83

13 Morest, D.K. et al. (1997) Neuronal and transneuronal degeneration ofauditory axons in the brainstem after cochlear lesions in the chinchilla:cochleotopic and non-cochleotopic patterns.Hearing Res. 103, 151–168

14 Wang, X.F. et al. (1991) The dentate nucleus is a short-latency relay of aprimary auditory transmission pathway. Neuroreport 2, 361–364

15 Xi, M.C. et al. (1994) Identification of short latency auditory responsiveneurons in the cat dentate nucleus. Neuroreport 5, 1567–1570

16 Dum, R.P. and Strick, P.L. (2003) An unfolded map of the cerebellardentate nucleus and its projections to the cerebral cortex.J. Neurophysiol. 89, 634–639

17 Akkal, D. et al. (2007) Supplementarymotor area andpresupplementarymotor area: targets of basal ganglia and cerebellar output. J. Neurosci.27, 10659–10673

398

18 Huang, C. and Liu, G. (1990) Organization of the auditory area in theposterior cerebellar vermis of the cat. Exp. Brain Res. 81, 377–383

19 Altman, J.A. et al. (1976) Electrical responses of the auditory area ofthe cerebellar cortex to acoustic stimulation. Exp. Brain Res. 26, 285–

29820 Bignall, K.E. (1970) Auditory input to frontal polysensory cortex of the

squirrel monkey: possible pathways. Brain Res. 19, 77–8621 Benedek, G. et al. (1997) Visual, somatosensory, auditory and

nociceptive modality properties in the feline suprageniculatenucleus. Neuroscience 78, 179–189

22 Paroczy, Z. et al. (2006) Spatial and temporal visual properties of singleneurons in the suprageniculate nucleus of the thalamus. Neuroscience137, 1397–1404

23 Kobler, J.B. et al. (1987) Auditory pathways to the frontal cortex of themustache bat, pteronotus parnellii. Science 236, 824–826

24 Kurokawa, T. et al. (1990) Frontal cortical projections from thesuprageniculate nucleus in the rat, as demonstrated with the PHA-L method. Neurosci. Lett. 120, 259–262

25 Katoh, Y. and Deura, S. (1993) Direct projections from the cerebellarfastigial nucleus to the thalamic suprageniculate nucleus in the catstudied with the anterograde and retrograde axonal transport of wheatgerm agglutinin-horseradish peroxidase. Brain Res. 617, 155–158

26 Katoh, Y. et al. (1994) Bilateral projections from the superior colliculusto the suprageniculate nucleus in the cat: a WGA-HRP /doublefluorescent tracing study. Brain Res. 669, 298–302

27 Kurokawa, T. and Saito, H. (1995) Retrograde axonal transport ofdifferent fluorescent tracers from the neocortex to the suprageniculatenucleus in the rat. Hearing Res. 85, 103–108

28 Budinger, E. (2000) Functional organization of auditory cortex in themongolian gerbil (meriones unguiculatus). IV. Connections withanatomically characterized subcortical structures. Eur. J. Neurosci.12, 2452–2474

29 Pastor, M.A. et al. (2008) Frequency-specific coupling in the cortico-cerebellar auditory system. J. Neurophysiol. 100, 1699–1705

30 Pastor, M.A. et al. (2006) The neural basis of temporal auditorydiscrimination. NeuroImage 30, 512–520

31 Alario, F. et al. (2006) The role of the supplementary motor area (SMA)in word production. Brain Res. 1076, 129–143

32 Lehericy, S. et al. (2004) 3-D diffusion tensor axonal tracking showsdistinct SMA and Pre-SMA projections to the human striatum. Cereb.Cortex 14, 1302–1309

33 Postuma, R.B. and Dagher, A. (2006) Basal ganglia functionalconnectivity based on a meta-analysis of 126 positron emissiontomography and functional magnetic resonance imagingpublications. Cereb. Cortex 16, 1508–1521

34 Middleton, F.A. and Strick, P.L. (2000) Basal ganglia and cerebellarloops: motor and cognitive circuits. Brain Res. Rev. 31, 236–250

35 Krienen, F.M. and Bruckner, R.L. (2009) Segregated fronto-cerebellarcircuits revealed by intrinsic functional connectivity. Cereb. Cortex 19,2485–2497

36 Sherman, M.S. and Guillery, R.W. (2005) Exploring the thalamus andits role in cortical function, MIT Press

37 He, J. and Hu, B. (2002) Differential distribution of burst and single-spike responses in auditory thalamus. J. Neurophys. 88, 2152–

215638 Izhikevich, E.M. (2004) Which model to use for cortical spiking

neurons? IEEE T. Neural. Networ. 15, 1063–200439 Kepecs, A. and Lisman, J. (2003) Information encoding and

computation with spikes and bursts. Netw. Comput. Neural. Syst.14, 103–118

40 Poppel, E. (1997) A hierarchical model of temporal perception. TrendsCogn. Sci. 1, 56–61

41 Lieberman, P. (2002) On the nature and evolution of the neural bases ofhuman language. Yearb. Phys. Anthropol. 45, 36–62

42 Leiner, H.C. et al. (1993) Cognitive and language functions of thehuman cerebellum. Trends Neurosci. 16, 444–447

43 MacLeod, C.E. et al. (2003) Expansion of the neocerebellum inhominoidea. J. Hum. Evol. 44, 401–429

44 Weaver, A.H. (2005) Reciprocal evolution of the cerebellum andneocortex in fossil humans. Proc. Natl. Acad. Sci. U. S. A. 102,3576–3580

45 Allard, F. and Scott, B.L. (1975) Burst cues, transition cues, andhemispheric specialization with real speech sounds. Q. J. Exp.Psychol. 27, 487–497

Opinion Trends in Cognitive Sciences Vol.14 No.9

46 Hammond, G.R. (1982) Hemispheric differences in temporalresolution. Brain Cognition 1, 95–118

47 Zatorre, R.J. and Belin, P. (2001) Spectral and temporal processing inhuman auditory cortex. Cereb. Cortex 11, 946–953

48 Liegeois-Chauvel, C. et al. (1999) Specialization of left auditory cortexfor speech processing in man depends on temporal coding. Cereb.Cortex 9, 484–496

49 Abrams, D.A. et al. (2008) Right-hemisphere auditory cortex isdominant for coding syllable patterns in speech. J. Neurosci. 28,3958–3965

50 Hickok, G. and Poeppel, D. (2007) The cortical organization of speechprocessing. Nat. Rev. Neurosci. 8, 393–402

51 Callan, D.E. et al. (2007) Speech and song: the role of the cerebellum.Cerebellum 6, 321–327

52 Schonwiesner et al. (2007) Hemispheric asymmetry for auditoryprocessing in the human auditory brain stem, thalamus, and cortex.Cereb. Cortex 17, 492–499

53 Wang, X. et al. (2003) Cortical processing of temporal modulations.Speech Commun. 41, 107–121

54 Wang, X. et al. (2008) Neural coding of temporal information inauditory thalamus and cortex. Neuroscience 154, 294–303

55 Graybiel, A.M. (1997) The basal ganglia and cognitive patterngenerators. Schizophrenia Bull. 23, 459–469

56 Friederici, A.D. (2009) Pathways to language: fiber tracts in the humanbrain. Trends Cogn. Sci. 13, 175–181

57 Rilling, J.K. et al. (2008) The evolution of the arcuate fasciculusrevealed with comparative DTI. Nat. Neurosci. 11, 426–428

58 Catani, M. et al. (2005) Perisylvian language networks of the humanbrain. Ann. Neurol. 57, 8–16

59 MacNeilage, P.F. and Davis, B.L. (2001) Motor mechanisms in speechontogeny: phylogenetic, neurobiological and linguistic implications.Curr. Opin. Neurobiol. 11, 696–700

60 MacNeilage, P.F. (2008) The origin of speech, Oxford UniversityPress

61 Rosen, S. (1992) Temporal information in speech: acoustic, auditoryand linguistic aspects. Philos. T Royal Soc. B. 336, 367–373

62 Shannon, R.V. et al. (1995) Speech recognition with primarily temporalcues. Science 270, 303–304

63 Greenberg, S. et al. (2003) Temporal properties of spontaneous speech –

a syllable-centric perspective. J. Phonetics 31, 465–48564 Kochanski, G. and Orphanidou, C. (2008) What marks the beat of

speech? J. Acoust. Soc. Am. 123, 2780–279165 Morton, J. et al. (1976) Perceptual centres (p-centres). Psychol. Rev. 83,

405–40866 Port, R.F. (2003) Meter and speech. J. Phonetics 31, 599–61167 Large, E.W. and Jones, M.R. (1999) The dynamics of attending: how

people track time-varying events. Psychol. Rev. 106, 119–15968 Glasser, M.F. and Rilling, J.K. (2008) DTI tractography of the human

brain’s language pathways. Cereb. Cortex 18, 2471–248269 Saur, D. et al. (2010) Combining functional and anatomical

connectivity reveals brain networks for auditory languagecomprehension. NeuroImage 49, 3187–3197

70 Rauschecker, J.P. and Scott, S.K. (2009) Maps and streams in theauditory cortex: non-human primates illuminate human speechprocessing. Nat. Rev. Neurosci. 12, 718–724

71 Kotz, S.A. et al. (2009) Non-motor basal ganglia functions: A review andproposal for a model of sensory predictability in auditory languageperception. Cortex 45, 982–990

72 Scott, S.K. et al. (2009) A little more conversation, a little less action –

candidate roles for the motor cortex in speech perception. Nat. Rev.Neurosci. 10, 295–302

73 Gunther, F.H. (2006) Cortical interactions underlying the production ofspeech sounds. J. Commun. Disord. 39, 350–365

74 Schroeder, C. et al. (2008) Neuronal oscillations and visualamplification of speech. Trends Cogn. Sci. 12, 106–113

75 Corriveau, K.H. and Goswami, U. (2009) Rhythmic motor entrainmentin children with speech and language impairment: tapping to the beat.Cortex 45, 119–130

399