Parallel Structure in an Integrated Speech-Recognition Network

Parallel Structure in an Integrated

Speech�Recognition Network

M� Fleury� A� C� Downton� and A� F� Clark

Department of Electronic Systems Engineering�University of Essex� Wivenhoe Park�

Colchester� CO� �SQ� U�Ktel� ��

fax� ��

e�mail fleum�essex ac uk

Abstract� Large�vocabulary continuous�speech recognition �LVCR� speaker�independent systems which integrate cross�word context dependent acous�tic models and n�gram language models are di�cult to parallelize becauseof their interwoven structure� large dynamic data structures� and com�plex object�oriented software design� This paper shows how retrospec�tive decomposition can be achieved if a quantitative analysis is made ofdynamic system behaviour� A design which accommodates unforeseene�ects and future modications is presented�

� Introduction

Two varieties of LVCR system exist� a pipelined structure in which componentsof acoustic matching and language modelling are separated� and an approachwhich integrates cross�word context dependent acoustic models and n�gram lan�guage models into the search� The former has been thought to be more computa�tionally tractable �� while the latter has delivered a low mean error rate� ��per word in ARPA evaluation� for a ��k vocabulary� tri�gram language model�� This paper examines whether an integrated system could also be parallelisedas has been achieved � � for the pipelined structure�

On a high�performance workstation� even after introducing e�cient memorymanagement of dynamic data structures� and optimising inner loops� timings ona �k vocabulary application� perplexity� �� indicate that a further �vefoldincrease in execution speed is needed to achieve real�time performance� Increas�ingly complex future applications are likely to maintain this requirement even asuniprocessor performance increases through Moore�s law� This paper proposes aminimum cost redesign of such a sequential system aimed at achieving real�timeperformance for prototyping� rapid performance evaluation� and demonstrationsin the development environment� The imminent ETSI standard for front�endprocessing enables such systems to act as servers to thin clients possibly onmobile stations� To this end a preliminary parallelisation of a �k vocabularyapplication has been made�

� Perplexity is a measure of average recognition network branching�

� Scale of the problem

A standard stochastic modelling approach to speech recognition has both im�proved recognition accuracy and the speed of computation �� Mel�frequencycepstrum acoustic feature vectors� hidden Markov models �HMMs� �� to cap�ture temporal and acoustic variance� tri�phone sub�word representation� andGaussian probability distribution mixture sub�word models �� are amongst thealgorithmic components that have led to the emergence of LVCR� Any paral�lelisation should seek to preserve this stable structure� onto which further algo�rithmic innovations have been conveniently hung� Tied states and modes withinHidden Markov models �HMM� for sub�word acoustic matching improve trainingaccuracy for �unseen� crossword triphones but imply shared data� Such commondata also reduce computation during a recognition run on a uniprocessor or amultiprocessor with a shared address space but pose a problem for a distributed�memory parallel implementation�

Speaker�independent integrated systems are being contemplated for databaseenquiry by telephone services� Development of an LVCR system requires the con�siderable resources available to large organisations� It may be unrealistic to thinkthat a parallel algorithm� for example �� will now be newly applied to existingsystems� British Telecom �BT� have developed a toolkit for constructing LVCRapplications which employs a one�pass time�synchronous speech decoder wheretokens carry scores �the sum of sub�word probabilities accrued�� maintainingpointers to the n�best recognition network paths� The toolkit paradigm allowsa variety of applications to be constructed from a core class library� One suchapplication is considered in this paper� However� even when an application hasbeen developed care must be taken that a subsequent unconsidered paralleli�sation prematurely �xes the algorithmic content� It is not always possible topredict the side�e�ects a change might have� or the restriction a change mighthave on future algorithmic development�

Achieving speaker�independent recognition in real time is signi�cantly harderthan speaker�dependent systems� Compare the IBM Tangora PC system ��which uses an iterative search to reach real�time performance after the recogni�tion network has been trained� Speaker�independent systems must model di�er�ences in speech intonation such as accent� dialect� age� and gender� Telephonicapplications must also cope with a �� kHz bandwidth restriction and noise re�duction is needed on mobile stations� However� reduction of data�storage andalgorithmic complexity is less of an obstacle than when squeezing a recognitioninterface onto a PC� A conversational interface� additionally requiring speechunderstanding� is possible on a larger system� The complete BT system is dis�tributed� connected under the CORBA distributed object standard �� withLVCR as one component� Portability� usually through standardized software�is also an important consideration if adding a parallel extension to the LVCRsystem� In this respect� the standard message�passing libraries� PVM and MPI�seem suitable�

Given the logistic di�culties of developing the BT system� there are alsoshort�term bene�ts from parallelising a LVCR system� Though algorithmic de�

velopment is ongoing� there is an interim need to demonstrate real�time be�haviour to potential clients� An additional bene�t of speed up is a reduction inperformance�tuning run times� Test code in the BT system is included for all ma�jor modules to validate correct behaviour� However� in comparative performanceruns� the complete system must be reset even if a minor change is made� For thispurpose� speed�up need not be real time� In short� any improvement in speedis welcome whether it results from code optimisations� algorithmic innovationwithin the existing system� parallelisation or a combination of e�ects�

� E�ect of object�oriented design

The BT LVCR system has been written in C�� In a run for a �k word vo�cabulary� �� di�erent functions were called� Object�orientation is a necessaryway of coping with this complexity� Even so� a class�browser� such as SNiFF�� is essential for retrospective analysis of the system� Object�orientation in�volves partitioning of data� Within an object� data are normally held privately orin protected state �available only to derived classes�� Partitioning of data detersperformance optimisations arising from merging functions�� Equally� careful con�sideration has to be given to how a system is decomposed as a prelude to paral�lelisation� Experimental systems to combine parallelism and the object�orientedparadigm are documented in �� A class of synchronisation and communica�tion primitives which do not extend the language is provided in �� basedon Parmacs� However� �� notes that simply adding these primitives does notencapsulate parallelism suggesting adding path expressions to remove the in�ex�ibility� In the present design� we were constrained by pragmatic considerationsin introducing a parallel structure�

� Processing di�culties

Processing on workstations is an order of magnitude away from real time� assum�ing a �� ms frame acquisition window� if an n�best single�pass search is made�Formation of the initial feature vector is a task that is well understood and canbe delegated to Digital Signal Processors �DSP�s�� The Viterbi search algorithm�� based on a simple maximal optimality condition� has made the subsequentnetwork search at least feasible on uni�processors� The Viterbi search is breadth�rst and synchronous� not asynchronous and depth �rst which might be moresuited to parallel computation� A beam search �� is a further pruning option�whereby available routes through the network are thresholded� Beam�pruningwith two�tiered score thresholding� signatures �� and path merging �� havebeen added to the BT system to further reduce search complexity� However� itis at the network decoding phase that more processing power still needs to bedeployed if no further radical pruning heuristics are forthcoming�

� In this context� the term function� seems more appropriate than the object�orientedterm method��

BBN�s HARC system took about twenty times real time for an n�best searchon the ��word vocabulary DARPA test with word pair perplexity of �� DualTI C � DSP�s were used to �nd the feature vectors� while a Sun� workstationperformed network parsing� To improve the speed �� to double real�time� a two�pass iteration was made� though language parsing must be added on to this time�AT�T implemented a system for the same task on a symmetric multiprocessor�SMP� with four MIPS R �� processors� but for a full search again recordedtwenty times real time� A multicomputer was designed � � with �� DSP�s ina proprietary interconnection topology� though the system employed to takea standard DARPA test was one node consisting of sixteen processors� Usinglocalised memory and store�and�forward communication� on the AT�T system�results in a non�linear growth in communication overhead and hence in thenumber of processors needed for larger vocabularies�

The AT�T system has a pipeline architecture� possible because the compo�nent parts of the decoder system are separable as feature extraction� mixturescoring� phone scoring� word scoring� and phrase scoring� Other systems suchas the Cambridge University HTK system� also with high accuracy scoring� areorganised as a token�passing network �� which is not easily broken into apipeline� As mentioned in Section �� the BT LVCR system is also of the token�passing network type�

� The BT System

�� Problems to be overcome

The existing BT design� Fig� � �� resists decomposition due to the close cou�pling of the network update procedures� ��way acoustic feature vectors �frames�arriving every �� ms� are applied to each active node of the recognition network�Real� noise� and null nodes embody models for respectively speech� noise� andword connections� The nodes are kept in global lists� necessary because a vari�ety of update procedures are applied� In particular� dormant nodes are reusedfrom application�maintained memory pools without variable delay due to sys�tem memory allocation� Large networks� for unconstrained speech or languagemodels beyond bi�gram� are dynamically extended when a token reaches a net�work boundary� Network extension makes parallel decomposition by staticallyforming sub�networks problematic because of the need to load balance and hencerepeatedly re�divide the network�

�� Execution analysis of the LVCR system

The top�� functions call graph� Fig� � for �� utterances on the �k Wall StreetJournal test with bi�grams� showed �� of total computation time including load�time� was taken up by the �feedforward� update� The branch of functioncalls� Fig� � resulting in the calculation of state output probabilities� bprobs�was uncharacteristically free of sub�function calls which otherwise can give rise

noise

registers

null lists

active node

list

extend node

list

trash list

extend network

network nodes

get new frame

prune nodes

stepforward

propagate null

network nodes

propagate real

sweep into

memory pools may add nodes to

listmay remove nodes

from listuses without

changing list

may add or remove

nodes from list

extend network

Fig� �� Network processing cycle

to unforeseen data dependencies� Other parameters such as state transition prob�abilities� aprobs� remain �xed� The seemingly redundant level of indirection formode�level checks enables future sharing of modes� which model variety in speechintonation� �� of time is spent calculating a quadratic part of the sum formingthe mixture of unimodal Gaussian densities which comprise the core of any state�� nodes were present in the mean for �� frames representing �s of speech�

� Parallel architecture

The parallel architecture that was arrived at can be considered to be a pipeline�Fig� �� though no overlapped processing takes place across the pipeline stagesbecause of the synchronous nature of the processing� The �rst of the two pipeline

Fig� �� Extract from high�level call graph� showing call intensity by link width

stages employs a data�farm� A data�manager farms out the computationally�intensive low�level probability calculations to a set of worker processes� withsome work taking place local to the data�farmer while communication occurs�

The standard PVM library of message�passing communication primitives wasused in the prototype� run over a network of HP Apollo �� series workstations�Worker processes each hold copies of a pool of �� tri�state models �and ��one�state noise models�� State�level parallelism requires broadcast of the currentmodel identity �� bytes�� the prototype system needing no global knowledgeif started at the �� point in Fig� � By a small breach of object demarcation�whereby at the node object level �with embedded HMM� the state object updatehistory was inspected� the number of messages was sharply reduced� as only onein twelve state bprobs for any frame were newly calculated� A small overheadfrom manipulation of the node active list enables the parallel ratio to reach the�� point� collection of thresholding levels then being centralised� Mode levelchecking when introduced would check for replications at the local level thuslimiting the loss in e�ciency�

Model level:update model probabilitiesusing aprobs, bprobs andmax. incoming score 67%

State level:

sum across sub-frames (1)

58%

summation of weighted

55%

12

Mode set level:

Mode level:

summation of bprob component

partspart.44% 48%

1 1

49%

~3 ~1/12

bprob modes

calculate quadratic bprob

Sub-mode level: Mode-component level:

check bprobs have not previously

been calculated.

further check that mode

calculation has not previously

been made.

Fig� �� Probability calculation function hierarchy with call ratios

POSIX�standard pthreads are proposed in the second pipeline stage wherebythe residual system is parallelised� Propagation of null nodes and real nodes�sweeping�up nodes �thus avoiding over�use of free store�� and recognition net�work pruning functions all have a similar structure� For example� the prunefunction �rst establishes pruning levels� which are then globally available forall spawned threads� Once spawned� the host thread of control� i�e� the prunefunction� is descheduled until it is reawoken by the completion of its workerthreads� Worker threads proceed by taking a node�s� from the active list� decid�ing whether pruning should take place� and updating the trash list and activelist if pruning takes place� The large number of active nodes allows granularityto be adapted to circumstances and the few points of potential serialization�requiring locks� increases the potential scale�up�

Future implementation on an SMP

We considered whether a widely�available type of parallel machine would besu�cient to parallelize the complete system� On a symmetric multiprocessor�SMP�� the thread manager would share one processor with the data manager�E�cient message�passing is available for SMPs �� in addition to threads� Tri�phones� usual for continuous speech� restrict potential parallelism but with nodelevel decomposition� Table �� an eight processor machine would approach therequired �vefold speed�up while a four processor machine would reduce turn�around during testing� The estimate assumes conservatively that half of theresidual system is parallel� while scaling of the system to this level is irrespec�tive of the frame processing workload distribution over time� Inlining of somefunctions is available as a further sequential optimisation�

local worker

worker 1 worker 2

worker n-1 worker n

low-level probability calculations

update

multi-threaded

n-best candidate

utterances

acoustic

recognition network data-structures

feature frames

recognition-network

completionfeedback

network-update

data managerthread manager

confidence-rated

Fig� �� Synchronous LVCR process pipeline

parallelization stage � stage � �level�processors� � � � �

state �� node ��

Table �� Speed�up estimate �Amdahl�s law�

Conclusion

Speech recognition exhibits many features of present day systems that must betaken into account if parallelisation is used as a performance enhancing tech�nique�

� The parallelisation should be conservative� i�e� preserve as much of the estab�lished structure as possible� given that the structure is well�proven and has

been arrived at after considerable e�ort� under a sequential programmingmodel�

� The system is complex resulting in considerable logistical di�culties at thetesting phase� which a parallel structure may ameliorate�

� Object�oriented design is a way of managing complexity but can imposerestrictions which should be considered� Given the advantages of object�oriented design in terms of managing large projects� a parallel design cannotignore this issue�

� Software standardisation is important for this type of system� which meansstandards for parallel systems such as PVM and MPI and POSIX threadsshould be used and developed�

� Parallelisation is not the only way to improve performance and indeed algo�rithmic innovation in speech recognition� for example tied states� can resultin greater improvements�

� The system exhibits algorithmic discontinuity which requires the combina�tion of two di�erent parallel programming paradigms�

Presented with a large�scale system� the prospect may seem daunting to thesystem analyst� It is important to examine the dynamic behaviour as well asthe static structure of the system� Indeed doing this allowed a credible parallelarchitecture to be formed from the BT LVCR� combining two parallel program�ming paradigms� the data�farm and the multi�threaded shared�memory model�A high�level design with low�bandwidth message�passing has been developed andprototyped to cope with the major part of the processing� The complete paral�lel system� which is portable� hardly disturbs the existing integrated structure�but with modest hardware outlay is con�dently estimated to bring signi�cantperformance gain�

Acknowledgements

This work was carried out under EPSRC research contract GR�K�� Parallelsoftware tools for embedded signal processing applications� as part of the EPSRCPortable Software Tools for Parallel Architectures directed programme�

The software described in this paper was developed at BT Laboratories�Martlesham Heath� UK� The authors are indebted to BT Speech TechnologyUnit for permission to publish the experimental work on parallelising speechrecognition� Opinions expressed in this paper are the authors� personal views�and should not be assumed to represent the views of BT�

References

�� L� R� Rabiner� Juang B��H�� and C��H� Lee� An overview of automatic speechrecognition� In Lee C��H�� F� K� Soong� and K� K� Paliwal� editors� AutomaticSpeech and Speaker Recognition Advanced Topics� Kluwer� Boston� ��

�� P� C� Woodland� C� J� Leggetter� J� J� Odell� V� Valtchev� and S� J� Young� The�� HTK large vocabulary speech recognition system� In ICASSP�� volume I�pages ��

�� S� Glinski and D� Roe� Spoken language recognition on a DSP array processor�IEEE Transactions on Parallel and Distributed Systems� �� July ��

�� R� Moore� Recogition � the stochastic modelling approach� In C� Rowden� editor�Speech Processing� pages �� McGraw�Hill� London� ��

�� L� R� Rabiner� A tutorial on Hidden Markov Models and selected applications inspeech recognition� Proceedings of the IEEE� �� February ��

�� L� A� Liporace� Maximum likelihood estimation for multivariate observationsof Markov sources� IEEE Transactions on Information Theory� ��September ��

�� W� Turin� Unidirectional and parallel Baum�Welch algorithms� IEEE Transac�tions on Speech and Audio Processing� �� November ��

�� S� K� Das and M� A� Picheny� Issues in practical large vocabulary isolated wordrecognition� The IBM Tangora system� In C�H� Lee� F� K� Soong� and K� K�Paliwal� editors� Automatic Speech and Speaker Recognition Advanced Topics� pages�� Kluwer� Boston� ��

�� S� Baker� CORBA Distributed Objects Using Orbix� Addison�Wesley� Harlow� UK��

�� TakeFive Software GmbH� Stichting Mathematisch Centrum� Amsterdam� theNetherlands� SNiFF� Release �� User�s Guide and Reference� ��

�� G� V� Wilson and P� Lu� editors� Parallel Programming Using C�� MIT� Cam�bridge� MA� ��

�� B� Beck� Shared�memory parallel programming in C�� IEEE Software� �� July ��

�� Y� Wu and T� G� Lewis� Parallelism encapsulation in C�� In International Con�ference on Parallel Processing� volume II� pages �� Pennsylvania State Univer�sity� ��

�� G� D� Forney� The Viterbi algorithm� Proceedings of the IEEE� ��March ��

�� R� Umbach and H� Ney� Improvements in beam search for �� wordcontinuous�speech recognition� IEEE Transactions on Speech and Audio Process�ing� �� April ��

�� S� P� A� Ringland� Application of grammar constraints to ASR using signaturefunctions� In Speech Recognition and Coding� pages �� Springer� Berlin�� Volume �� NATO ASI Series F�

�� S� Hovell� The incorporation of path merging in a dynamic network parser� InESCA� EuroSpeech�� volume �� pages ��

�� S� Austin� R� Schwartz� and P� Placeway� The forward�backward search algorithm�In International Conference on Acoustics� Speech� and Signal Processing� volume ��pages ��

�� S� Young� A review of large�vocabulary continuous�speech recognition� IEEE Sig�nal Processing Magazine� pages �� September ��

�� D� Ollason� S� Hovell� and M� Wright� Requirements and design of the new con�tinuous speech recognition parser � the Grid� Technical report� BT Laboratories�Martlesham Heath� Ipswich� IP� �RE� UK� ��

�� S� S� Lumetta and D� E� Culler� Managing concurrent access for sharedmemory active messages� In IPPSSPDP�� pages fromhttp��now�CS�berkeley�EDU�Papers��

This article was processed using the LATEX macro package with LLNCS style

Parallel Structure in an Integrated Speech-Recognition Network

Documents

Transcript of Parallel Structure in an Integrated Speech-Recognition Network