Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and...

12
Hindawi Publishing Corporation EURASIP Journal on Audio, Speech, and Music Processing Volume 2010, Article ID 546047, 11 pages doi:10.1155/2010/546047 Research Article Automatic Recognition of Lyrics in Singing Annamaria Mesaros and Tuomas Virtanen (EURASIP Member) Department of Signal Processing, Tampere University of Technology, 33720 Tampere, Finland Correspondence should be addressed to Annamaria Mesaros, annamaria.mesaros@tut.fi Received 5 June 2009; Revised 6 October 2009; Accepted 23 November 2009 Academic Editor: Elmar N¨ oth Copyright © 2010 A. Mesaros and T. Virtanen. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one. 1. Introduction Singing is used to produce musically relevant sounds by the human voice, and it is employed in most cultures for entertainment or self-expression. It consists of two main aspects: melodic (represented by the time-varying pitch) and verbal (represented by the lyrics). The sung lyrics convey the semantic information, and both the melody and the lyrics allow us to identify the song. Thanks to the increased amount of music playing devices and available storage and transmission capacity, consumers are able to find plenty of music in the forms of downloadable music, internet radios, personal music collections, and recommendation systems. There is need for automatic music information retrieval techniques, for example, for eciently finding a particular piece of music from a database or for automatically organizing a database. The retrieval may be based not only on the genre of the music [1], but as well on the artist identity [2] (artist does not necessarily mean singer). Studies on singing voice have developed methods for detecting singing segments in polyphonic music [3], detecting solo singing segments in music [4], identifying the singer [5, 6] or identifying two simultaneous singers [7]. As humans can recognize a song by its lyrics, information retrieval based on the lyrics has a significant potential. There are online lyrics databases which can be used for finding the lyrics of a particular piece of music, knowing the title and artist. Also, knowing part of the textual lyrics of a song can help identify the song and its author by searching in lyrics databases. Lyrics recognition from singing would allow searching in audio databases, ideally automatically transcribing the lyrics of a song being played. Lyrics recognition can also be used for automatic indexing of music according to automatically transcribed keywords. Another application for lyrics recognition is finding songs using query-by-singing, based on the recognized sung words to find a match in the database. Most of the previous audio-based approaches to query-by-singing have used only the melodic information from singing queries [8] in the retrieval. In [9], a music retrieval algorithm was proposed which used both the lyrics and melodic information. The lyrics recognition grammar was a finite state automaton constructed from the lyrics in the queried database. In [10]

Transcript of Automatic Speech Recognition Systems for the Evaluation of Voice and Speech Disorders in Head and...

Hindawi Publishing CorporationEURASIP Journal on Audio, Speech, and Music ProcessingVolume 2010, Article ID 546047, 11 pagesdoi:10.1155/2010/546047

Research Article

Automatic Recognition of Lyrics in Singing

Annamaria Mesaros and Tuomas Virtanen (EURASIP Member)

Department of Signal Processing, Tampere University of Technology, 33720 Tampere, Finland

Correspondence should be addressed to Annamaria Mesaros, [email protected]

Received 5 June 2009; Revised 6 October 2009; Accepted 23 November 2009

Academic Editor: Elmar Noth

Copyright © 2010 A. Mesaros and T. Virtanen. This is an open access article distributed under the Creative Commons AttributionLicense, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properlycited.

The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov modelrecognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm isapplied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trainedusing speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance.Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined byeither phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme languagemodels are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textuallyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining anaverage error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words isalso constructed; in 57% of the cases, the first retrieved song is the correct one.

1. Introduction

Singing is used to produce musically relevant sounds bythe human voice, and it is employed in most cultures forentertainment or self-expression. It consists of two mainaspects: melodic (represented by the time-varying pitch) andverbal (represented by the lyrics). The sung lyrics convey thesemantic information, and both the melody and the lyricsallow us to identify the song.

Thanks to the increased amount of music playing devicesand available storage and transmission capacity, consumersare able to find plenty of music in the forms of downloadablemusic, internet radios, personal music collections, andrecommendation systems. There is need for automatic musicinformation retrieval techniques, for example, for efficientlyfinding a particular piece of music from a database or forautomatically organizing a database.

The retrieval may be based not only on the genre ofthe music [1], but as well on the artist identity [2] (artistdoes not necessarily mean singer). Studies on singing voicehave developed methods for detecting singing segments inpolyphonic music [3], detecting solo singing segments in

music [4], identifying the singer [5, 6] or identifying twosimultaneous singers [7].

As humans can recognize a song by its lyrics, informationretrieval based on the lyrics has a significant potential.There are online lyrics databases which can be used forfinding the lyrics of a particular piece of music, knowingthe title and artist. Also, knowing part of the textuallyrics of a song can help identify the song and its authorby searching in lyrics databases. Lyrics recognition fromsinging would allow searching in audio databases, ideallyautomatically transcribing the lyrics of a song being played.Lyrics recognition can also be used for automatic indexingof music according to automatically transcribed keywords.Another application for lyrics recognition is finding songsusing query-by-singing, based on the recognized sung wordsto find a match in the database. Most of the previousaudio-based approaches to query-by-singing have used onlythe melodic information from singing queries [8] in theretrieval. In [9], a music retrieval algorithm was proposedwhich used both the lyrics and melodic information. Thelyrics recognition grammar was a finite state automatonconstructed from the lyrics in the queried database. In [10]

2 EURASIP Journal on Audio, Speech, and Music Processing

the lyrics grammar was constructed for each tested song.Other works related to retrieval using lyrics include [11, 12].

Recognition of phonetic information in polyphonicmusic is a barely touched domain. Phoneme recognition inindividual frames in polyphonic music was studied in [13],but there has been no work done using large vocabularyrecognition of lyrics in English.

Because of the difficulty of lyrics recognition, manystudies have focused on a simpler task of audio andlyrics alignment [14–18], where the textual lyrics to besynchronized with the singing are known. The authors of[15] present a system based on Viterbi forced alignment.A language model is created by retaining only vowels forJapanese lyrics converted to phonemes. In [16], the authorspresent LyricAlly, a system that aligns first the higher-levelstructure of a song and then within the boundaries of thedetected sections performs a line-level alignment. The line-level alignment uses only a uniform estimated phonemeduration, rather than a phoneme-recognition-based method.The system works by finding vocal segments but notrecognizing their content. Such systems have applicationsin automatic production of material for entertainmentpurposes, such as karaoke.

This paper deals with recognition of the lyrics, meaningrecognition of the phonemes and words from singing voice,both in monophonic singing and polyphonic music, whereother instruments are used together with singing. We aimat developing transcription methods for query-by-singingsystems where the input of the system is a singing phrase andthe system uses the recognized words to retrieve the song ina database.

The basis for the techniques in this paper is in auto-matic speech recognition. Even though there are differencesbetween singing voice and spoken voice (see Section 2.1),experiments show that it is possible to use the speechrecognition techniques on singing. Section 2.2 presents thespeech recognition system. Due to the lack of large enoughsinging databases to train a recognizer for singing, we usea phonetic recognizer trained on speech and adapt it tosinging, as presented in Section 2.3. We use different settings:adapt the models to singing, to gender-dependent models,and to singer-specific models, using different number ofbase classes in the adaptation. Section 2.4 presents phoneme-and word-level n-gram language models that will be usedin the recognition. Experimental evaluation and results arepresented in Section 3. Section 4 presents two applications:automatic alignment of audio and lyrics in polyphonicmusic and a small-scale query-by-singing application. Theconclusions and future work are presented in Section 5.

2. Singing Recognition

This section describes the speech recognition techniquesused in the singing recognition system. We first reviewthe main differences between speech and singing voiceand present the basic structure of the phonetic recognizerarchitecture, and then the proposed singing adaptationmethods and language models.

0 0.5 1 1.5 2 2.5 3

Time (seconds)

0

1000

2000

3000

4000

5000

Freq

uen

cy(H

z)

(a)

0 0.5 1 1.5 2 2.5 3

Time (seconds)

0

1000

2000

3000

4000

5000

Freq

uen

cy(H

z)

(b)

Figure 1: Example spectrograms of male singing (a) and malespeech (b). In singing the voice is more continuous, whereas inspeech the pitch and formants vary more rapidly in time. In speechthe amount of unvoiced segments is higher.

2.1. Singing Voice. Speech and singing convey the samekind of semantic information and originate from the sameproduction physiology. In singing, however, the intelligibilityis often secondary to the intonation and musical qualities ofthe voice. Vowels are sustained much longer in singing thanin speech and independent control of pitch and loudnessover a large range is required. The properties of the singingvoice have been studied in [19].

In normal speech, the spectrum is allowed to vary freely,and the pitch or loudness changes are used to expressemotions. In singing, the singer is required to control thepitch, loudness, and timbre.

The timbral properties of a sung note depend on thefrequencies at which there are strong and weak partials. Invowels this depends on the formant frequencies, which canbe controlled by the length and shape of the vocal tractand the articulators. Different individuals tune their formantfrequencies a bit differently for each vowel, but skilled singerscan control the pitch and the formant frequencies moreaccurately.

Another important part of the voice timbre differencesbetween male and female voices seems to be the voicesource: major difference is primarily in the amplitude ofthe fundamental. The voice spectrum of a male voice hasa weaker fundamental than the voice spectrum of a femalevoice.

The pitch range in a singing phrase is usually higherthan in a spoken sentence. In speech the pitch varies allthe time, whereas in singing it stays approximately constantduring a note (with vibrato being used for artistic singing),as illustrated in Figure 1. Therefore in singing the varianceof the spectrum of a phoneme with a note is smallercompared to speech, while difference between phonemes

EURASIP Journal on Audio, Speech, and Music Processing 3

sung at different pitches can be much larger. The acousticfeatures used in the recognizer try to make the representationinvariant to pitch changes.

2.2. HMM Phonetic Recognizer. Despite of the above-mentioned differences between speech and singing, theystill have many properties in common and it is plausiblethat singing recognition can be done using the standardtechnique in automatic speech recognition, a phonetichidden Markov model (HMM) recognizer. In HMM-basedspeech recognition it is assumed that the observed sequenceof speech feature vectors is generated by a hidden Markovmodel. An HMM consists of a number of states with asso-ciated observation probability distributions and a transitionmatrix defining transition probabilities between the states.The emission probability density function of each state ismodeled by a Gaussian mixture model (GMM).

In the training process, the transition matrix and themeans and variances of the Gaussian components in eachstate are estimated to maximize the likelihood of the obser-vation vectors in the training data. Speaker-independentmodels can be adapted to the characteristics of a targetspeaker, and similar techniques can be used for adaptingacoustic models trained on speech to singing, as describedfurther in Section 2.3.

Linguistic information about the speech or singing to berecognized can be used to develop language models. Theydefine the set of possible phoneme or word sequences tobe recognized and associate probabilities for each of them,improving the robustness of the recognizer.

The singing recognition system used in this work consistsof 39 monophone HMMs plus silence and short-pause mod-els. Each phoneme is represented by a left-to-right HMMwith three states. The silence model is a fully connectedHMM with three states and the short pause is a one-stateHMM tied to the middle state of the silence model. Thesystem was implemented using HTK (The Hidden MarkovModel Toolkit (HTK), http://htk.eng.cam.ac.uk/).

As features we use 13 mel-frequency cepstral coefficients(MFCCs) plus delta and acceleration coefficients, calculatedin 25 ms frames with a 10 ms hop between adjacent frames.

Figure 2 illustrates an example of MFCC features cal-culated from a descending scale of notes from G#4 to F#3(fundamental frequency from 415 Hz to 208 Hz) sung by amale singer with the phoneme /m/. Looking at the valuesof different order MFCCs, we can see that the pitch affectsdifferently the different order coefficients. Using only thelow-order cepstral coefficients aims to represent the roughshape of the spectrum (i.e., the formants and phonemes),while making the representation pitch independent. Thesystem also uses cepstral mean normalization [20].

2.3. Adaptation to Singing. Due to the lack of a large enoughsinging database for training the acoustic models of therecognizer, we first train models for speech and then adaptthem linearly to singing.

The acoustic material used for the adaptation is called theadaptation data. In speech recognition the data is typically

0 2 4 6 8 10

Time (seconds)

−1

−0.5

0

0.5

1

1.5

2

2.5

Feat

ure

valu

e

Figure 2: An example of MFCC features calculated from adescending scale of notes from G#4 to F#3 (fundamental frequencyfrom 415 Hz to 208 Hz) sung by a male singer with the phoneme/m/. The solid line represents the 3rd MFCC and the dashed linethe 7nd MFCC. The 3rd MFCC is clearly affected by the variationin pitch.

a small amount of speech from the target speaker. Theadaptation is done by finding a set of transforms for themodel parameters in order to maximize the likelihood thatthe adapted models have produced the adaptation data.When the phoneme sequence is known, the adaptation canbe done in supervised manner.

Maximum linear likelihood regression [21] (MLLR) isa commonly used technique in speaker adaptation. Givenmean vector µ of a mixture component of a GMM, the MLLRestimates a new mean vector as

µ = Aµ + b, (1)

where A is a linear transform matrix and b is a bias vector. Inconstrained MLLR (CMLLR) [22], the covariance matrix Σof the mixture component is also transformed as

Σ = AΣA. (2)

The transform matrix and bias vector are estimated usingthe EM algorithm. It has been observed that MLLR cancompensate the difference in the lengths of the vocal tract[23].

The same transform and bias vector can be sharedbetween all the Gaussians of all the states (global adaptation).If enough adaptation data is available, multiple transformscan be estimated separately for sets of states or Gaussians.The states or Gaussians can be grouped by either theirphonetic similarity or their acoustic similarity. The groupsare called base classes.

We do the singing adaptation using a two-pass proce-dure. The usual scenario in speaker adaptation for speechrecognition is to use a global transform followed by a secondtransform with more classes constructed in a data-drivenmanner, by means of a regression tree [24]. One global

4 EURASIP Journal on Audio, Speech, and Music Processing

Table 1: Divisions of phonemes into classes by phonetic similarity.

Number ofclasses

classes

3 vowels, consonants, silence/noise

8monophthongs, diphthongs, approximants, nasals,fricatives, plosives, affricates, silence/noise

22one class/vowel, approximants, nasals, fricatives,plosives, affricates, silence/noise

adaptation is suitable best when we have a small amount oftraining data or when we need a robust transform [25].

Speech and singing are composed from 3 large classes ofsounds: vowels, consonants, and nonspeech. As nonspeechsounds we consider the silence and pause models of thespeech recognizer. According to this rationale, we use 3classes for adaptation. We also consider 7 broad phoneticclasses: monophthongs, diphthongs, approximants, nasals,fricatives, plosives, and affricates, plus nonspeech, and use 8classes for adaptation. The singing sounds can be groupedalso in 22 classes: each vowel mapped to a separate class,one class for each consonant type (approximants, nasals,fricatives, plosives, affricates) and one for the nonspeechcategory. These 3 grouping methods are summarized inTable 1.

A second pass of the adaptation uses classes determinedby acoustic similarity, by clustering the Gaussians of thestates. We use clusters formed from the speech models andfrom the models after one-pass adaption to singing.

In the initial adaptation experiments [26] we observedthat CMLLR performs better than MLLR, and therefore werestrict ourselves to CMLLR in this paper.

2.4. N-Gram Language Models. The linguistic information inthe speech or singing to be recognized can be modeled usinglanguage models. A language model restricts the possibleunit sequences into a set defined by the model. The languagemodel can also provide probabilities for different sequences,which can be used together with the likelihoods of theacoustic HMM model to find the most likely phoneticsequence for an input signal. The language model consistsof a vocabulary and a set of rules describing how the units inthe vocabulary can be connected into sequences. The units inthe vocabulary can be defined at different abstraction levels,such as phonemes, syllables, letters, or words.

An n-gram language model can be used to modelprobabilities of unit sequences in the language model. Itassociates a probability for each subsequence of length n:given n − 1 previous units wi−1,wi−2, . . . ,wi−n, it definesthe conditional probability P(wi | wi−1,wi−2, . . . ,wi−n) [27].The probability of a whole sequence can be obtained as theproduct of above conditional probabilities over all i unitsin the sequence. An n-gram of size one is referred to asa unigram; size two is a bigram; size three is a trigram,while those of higher order are referred to as n-grams.Bigrams and trigrams are commonly used in automaticspeech recognition.

It is not possible to include all possible words in alanguage model. The percentage of out of vocabulary (OOV)words affects the performance of the language model, sincethe recognition system cannot output them. Instead, thesystem will output one or more words from the vocabularythat are acoustically close to the word being recognized,resulting in recognition errors. While the vocabulary of thespeech recognizer should be as large as possible to ensurelow OOV rates, increasing the vocabulary size increasesthe acoustic confusions and does not always improve therecognition results.

A language model can be assessed by its perplexity, whichmeasures the uncertainty in each word based on the languagemodel. It can be viewed as the average size of the word setfrom which each word recognized by the system is chosen[28, pages 449–450]. The lower the perplexity, the betterthe language model is able to represent the text. Ideally, alanguage model should have a small perplexity and a smallOOV rate on an unseen text.

The actual recognition is based on finding the mostlikely sequence of units that has produced the acousticinput signal. The likelihood consists of the contributionof the language model likelihood and the acoustic modellikelihood. The influence of the language model can becontrolled by the grammar factor, which multiplies the loglikelihood of the language model. The number of wordsoutput by the recognizer can be controlled by the wordinsertion penalty which penalizes the log likelihood by addinga cost for each word. The values of these parametershave to be tuned experimentally to optimize the recognizerperformance.

In order to test phoneme recognition with a languagemodel, we built unigram, bigram and trigram language mod-els. As n-grams are used for language recognition, we assumethat a phoneme-level language model is characteristic to theEnglish language and cannot differ significantly if estimatedfrom general text or from lyrics text. For this reason, astraining data for the phoneme-level language models we usedthe phonetic transcriptions of the speech database that wasused for training the acoustic models.

To construct word language models for speech recog-nition we have to establish a vocabulary chosen as themost frequent words from the training text data. In largevocabulary recognition it is important to choose training textwith similar topic to have a good coverage of vocabularyand words combinations. For our work we chose to usesong lyrics text, with the goal of keeping a 5 k vocabu-lary.

2.5. Separation of Vocals from Polyphonic Music. In orderto enable the recognition of singing in polyphonic music,we employ the vocal separation algorithm [29]. The algo-rithm separates the vocals using the time-varying pitchand enhances them by subtracting a background model. Inour study on singer identification from polyphonic music,an earlier version of the algorithm led to a significantimprovement [6]. The separation algorithm consists of thefollowing processing steps.

EURASIP Journal on Audio, Speech, and Music Processing 5

(1) Estimate the notes of the main vocal line using thealgorithm [30]. The algorithm has been trained usingsinging material, and it is able to distinguish betweensinging and solo instruments, at least to some degree.

(2) Estimate the time-varying pitch of the main vocal lineby picking the prominent local maxima in the pitchsalience spectrogram near the estimated notes andinterpolate between the peaks.

(3) Predict the frequencies of the overtones by assumingperfect harmonicity and generate a binary maskwhich indicates the predicted vocal regions in thespectrogram. We use a harmonic mask where a±25 Hz bandwidth around each predicted partial ineach frame is marked as speech.

(4) Learn a time-varying background model by using thenonvocal regions and nonnegative spectrogram fac-torization (NMF) on the magnitude spectrogram ofthe original signal. A hamming window and absolutevalues of the frame-wise DFT is used to calculate themagnitude spectrogram. We use a specific NMF algo-rithm which estimates the NMF model parametersusing the nonvocal regions only. The resulting modelrepresents the background accompaniment but notvocals.

(5) The estimated NMF parameters are used to predictthe amplitude of the accompaniment in the vocalregions, which is then subtracted from the mixturespectrogram.

(6) The separated vocals are synthesized by assigning thephases of the mixture spectrogram for the estimatedmagnitude spectrogram of vocals and generatingtime-domain signal by inverse DFT and overlap add.

More detailed description of the algorithm is given in [29].The algorithm has been found to produce robust results

on realistic music material. It improved the signal-to-noiseratio of the vocals on average by 4.9 dB on material extractedfrom Karaoke DVDs and on average by 2.1 dB on mate-rial synthesized by mixing vocals and MIDI background.The algorithm causes some separation artifacts because oferroneously estimated singing notes (insertions or dele-tions), and some interference from other instruments. Thespectrogram decomposition was found to perform well inlearning the background model, since the sounds producedby musical instruments are well represented with the model[31].

Even though the harmonic model for singing used inthe algorithm does not directly allow representing unvoicedsounds, the use of mixture phases carries some informationabout them. Furthermore, the rate of unvoiced sounds insinging is low, so that the effect of voiced sounds dominatesthe recognition.

3. Recognition Experiments

We study the effect of above recognition techniques forrecognition of phonemes and words in both clean singing

and singing separated from polyphonic music. Differentadaptation approaches and both phoneme-level and word-level language models are studied.

3.1. Acoustic Data. The acoustic models of the rec-ognizer were trained using the CMU Arctic speechdatabase (CMU ARCTIC databases for speech synthesis:http://festvox.org/cmuarctic/). For testing and adaptation ofthe models we used a database containing monophonicsinging recordings, 49 fragments (19 male and 30 female) ofpopular songs, which we denote as vox clean. The lengths ofthe sung phrases are between 20 and 30 seconds and usuallyconsist in a full verse of a song. For the adaptation and testingon clean singing we used m-fold cross-validation, with mdepending on the test case. The total amount of singingmaterial is 30 minutes, and it consists of 4770 phonemeinstances.

To test the recognition system on polyphonic music wechose 17 songs from commercial music collections. Thesongs were manually segmented into structurally mean-ingful units (verse, chorus) to obtain sung phrases havingapproximately the same durations with the fragments inthe monophonic database. We obtained 100 fragmentsof polyphonic music containing singing and instrumentalaccompaniment. We denote this database as poly 100.For this testing case, the HMMs were adapted using theentire vox clean database. In order to suppress the effectof the instrumental accompaniment, we applied the vocalseparation algorithm described in Section 2.5.

The lyrics of both singing databases were manuallyannotated for reference. The transcriptions are used in thesupervised adaptation procedure and in the evaluation of theautomatic lyrics recognition.

3.2. Evaluation. We measure the recognition performanceusing correct recognition rate and accuracy of the recogni-tion. They are defined in terms of the number of substitutionerrors S, deletion errors D, and insertion errors I , reportedfor the total number of tested instances N . The correct rate isgiven as

correct (%) = N −D − SN

× 100 (3)

and the accuracy as

accuracy (%) = N −D − S− IN

× 100. (4)

The two measures differ only in the number of insertions.In speech recognition usually the reported results are worderror rates. The error rate is defined as

error rate (%) = D + S + I

N× 100. (5)

3.3. Adaptation of Models to Singing Voice. For adapting themodels to singing voice, we use a 5-fold setting for thevox clean database, with one fifth of the data used as testdata at a time and the rest for adaptation. As each song

6 EURASIP Journal on Audio, Speech, and Music Processing

Table 2: Phoneme recognition rates (39 + sil) for clean singing forsystems adapted with different number of base classes in the firstpass and 8 classes in the second pass.

First pass Correct Accuracy Second pass Correct Accuracy

nonadapted 33.3% −6.4% −G 41.2% 20.0% GT8 41.3% 18.9%

G3 40.4% 19.9% G3T8 41.4% 20.0%

G8 40.3% 18.7% G8T8 41.1% 18.9%

G22 38.4% 18.7% G22T8 40.7% 19.5%

was sung by multiple singers, splitting into folds was doneso that the same song appeared either in the test or in theadaptation set, not in both. The same singer was allowedin both adaptation and testing sets. We adapt the modelsusing supervised adaptation procedure, providing the correcttranscription to the system in the adaptation process.

Evaluation of the recognition performance was donewithout a language model; the number of insertion errorswas controlled by the insertion penalty parameter with valuefixed to p = −10 (for reasons explained in Section 3.6).

Table 2 presents recognition rates for systems adaptedto singing voice using different number of classes in thefirst adaptation pass, as presented in Table 1. A single globaltransform (G) improves clearly the performance of the non-adapted system. A larger number of base classes in the firstadaptation pass improves the performance in comparisonwith the nonadapted system, but the performance decreasesfrom the single class adaptation. Using all the informationto estimate a global transform provides a more reliableadaptation than splitting the available information betweendifferent classes. In case of multiple classes it can happen thatfor some class there might not be enough data for estimatinga robust enough transform.

A second pass was added, using classes defined by acous-tic similarity. Different numbers of classes were clusteredusing the speech models and also the models already adaptedto singing in the first pass. Figure 3 presents the averagerecognition results for 2 to 20 classes in the two cases. Thedifferences in the adaptation are not statistically significant(the maximum 95% confidence interval for the test casesis 2%), but this might be due to the limited amount ofdata available for the adaptation. The adaptation classesconstructed from the singing-adapted models reflect betterthe characteristics of the signal and lead to more reliableadaptation. Still, the performance does not vary much as afunction of the number of base classes.

Table 2 also presents recognition performance of systemsadapted with different number of classes in the first passand 8 classes in the second pass using the clustering to formthese 8 classes. The second pass improve slightly the correctrate of systems where multiple classes were used in the firstadaptation pass.

For a better understanding of the adaptation processeffect, in Table 3 we present the phoneme recognition ratesof the nonadapted models and one set of models adapted tosinging, using as test data the speech database and the clean

2 4 6 8 10 12 14 16 18 20

Number of base classes

40

41

42

43

Cor

rect

(%)

(a)

2 4 6 8 10 12 14 16 18 20

Number of base classes

18

19

20

21

22

Acc

ura

cy(%

)

(b)

Figure 3: Correct and accuracy rates of systems adapted to singingusing global adaptation in the first pass and 2 to 20 base classes inthe second pass; the classes are determined by clustering acousticallysimilar Gaussians from speech (marked with circle) and adaptedto singing (marked with square) models. The baseline of globaladaptation is marked with a constant line.

Table 3: Phoneme recognition rates (correct % / accuracy %) forspeech and singing using nonadapted and singing-adapted models.

Test data Nonadapted Models adapted

models to singing (G3T8)

speech 41.6/30.4 9.8/9.3

singing 33.3/−6.4 41.4/20.0

singing database. The system adapted to singing has muchlower recognition rates on speech. The adaptation processmodifies the models such that the adapted models do notrepresent the speech vowels anymore.

3.4. Gender-Dependent Adaptation. The gender differencesin singing voices are much more evident than in speech,because of different singing techniques explained inSection 2.1. Gender-dependent models are used in manycases in speech recognition [32].

Adaptation to male singing voice is tested on fourdifferent male voices. The fragments belonging to the samevoice were kept as test data, while the rest of the male singingin vox clean was used to adapt the speech models using aone-pass global adaptation. We do the same for adaptation tofemale singing voice, using as adaptation data all the femalesinging fragments except the tested one.

The results for individual voices for male and femalegender-adapted recognition systems are presented in Table 4,together with the averages over genders. The other column inthe table represents recognition results for the same test datausing the nonadapted system.

EURASIP Journal on Audio, Speech, and Music Processing 7

Table 4: Phoneme recognition rates (correct % / accuracy %) fornonadapted and gender-adapted models for 4 different male andfemale sets.

Data set Test fragments Nonadapted Gender-adapted

Male 1 3 39.4/8.1 59.3/28.8

Male 2 5 33.0/7.8 46.4/25.2

Male 3 3 36.4/6.6 52.2/28.2

Male 4 5 32.9/−10.7 48.0/1.9

Male average 35.4/2.9 51.5/21.0

Fem 1 3 31.2/−16.1 52.3/10.9

Fem 2 5 33.4/−21.2 58.7/21.2

Fem 3 3 41.4/−11.7 60.7/9.9

Fem 4 4 29.5/−11.2 51.6/24.0

Fem average 33.9/−15.0 55.8/16.5

The gender-specific adaptation improves the recognitionperformance for all the singers. Especially the recognitionperformance for female singers is improved, from negativevalues in the case of the nonadapted system. Negativeaccuracy means over 100% error rate, rendering recognitionresults unusable. The testing in this case was also donewithout a language model, using the fixed insertion penaltyp = −10 (see Section 3.6 for explanation).

3.5. Singer-Specific Adaptation. The models adapted tosinging voice can be further adapted to a target singer. Wetested singer adaptation for three male and three femalevoices. The adaptation to singing was carried out in a one-pass global step, using all the singing material from vox cleanexcept the target voice. After this, the adapted models wereadapted using another one-pass global adaptation and testedin 3-fold (for male 1, male 3, Fem 1, and Fem 3) or 5-fold(for male 2 and Fem 2), so that one fragment at a time wasused in testing and the rest as adaptation data.

Table 5 presents the recognition rates for the six targetvoices, for nonadapted, adapted to singing, and adaptedto target voice systems. On average, the recognition per-formance of singer-specific adapted systems is lower thanthat of the systems adapted to singing in general. The firstadaptation estimates a transform from speech to singingvoice, but its advantage is lost by trying to adapt the modelsfurther for a target singer. This situation may be due tothe very small amount of adaptation data in the attempt tooverfit it [25].

There are significant differences between male and femalesingers, which explains the fact that a gender-dependentrecognizer performs better than gender-independent recog-nizer. The gender-adapted systems have lower accuracy thanthe systems adapted to singing, but higher correct rate. Thetwo situations may need different tuning of the recognitionstep parameters (here only p) in order to maximize theaccuracy of the recognition, but we kept the same value forcomparison purposes.

3.6. Language Models and Recognition Results. Thephoneme-level language models were trained using the

Table 5: Phoneme recognition rates (correct % / accuracy %) for 3male and 3 female voice adapted systems.

Voice Nonadapted Adapted to Adapted to

singing voice

Male 1 39.4/8.1 47.1/41.0 30.5/20.4

Male 3 36.4 /6.6 42.9/33.1 28.7/12.5

Male 4 32.9/−10.7 36.0/16.2 35.0/20.4

Fem 1 31.2/−16.1 39.3/20.6 27.1/12.5

Fem 2 33.4/−21.2 37.7/22.0 36.4/12.5

Fem 3 41.4/−11.7 47.3/13.0 29.9/1.9

Table 6: Perplexities of bigram and trigram phoneme and wordlevel language models on the training text (speech databasetranscriptions for phonemes, lyrics text for words) and on test lyricstexts.

Language model Training text vox clean poly 100

Phoneme bigram 11.5 11.8 11.3

Phoneme trigram 6.4 8.4 8.3

Word bigram 90.2 147.1 97.8

Word trigram 53.7 117.8 77.5

OOV % 5.9 2.2 2.5

phonetic transcriptions of the speech database that wasused for training the acoustic models. The database contains1132 phonetically balanced sentences, over 48000 phonemeinstances.

To test the modeling capabilities of the constructedlanguage models, we evaluate perplexities on the test text incomparison with their perplexity on the training text. Theperplexities of the phoneme bigram, and trigram models onthe speech training text and on the lyrics text from vox cleanand poly 100 databases are presented in Table 6. For aphoneme language model there is no concern over OOV,since all the phonemes are included in the vocabulary ofthe LM and of the recognizer. According to the perplexities,our assumption is correct, and the phoneme languagemodel built using speech represents well the lyrics texttoo.

For constructing a word language model we used thelyrics text of 4470 songs, containing over 1.2 million wordinstances, retrieved from http://www.azlyrics.com/. From atotal of approximately 26000 unique words, a vocabulary of5167 words was chosen by keeping the words that appearedat least 5 times. The perplexities of bigram and trigram wordlevel language models evaluated on the training data and onthe lyrics text of vox clean and poly 100 databases are alsopresented in Table 6. The percentage of OOV words on thetraining text represents mostly words in languages other thanEnglish, also the words that appeared too few times and wereremoved when choosing the vocabulary. The perplexitiesof the language models on poly 100 are not much higherthan the ones on the training text, meaning that the textsare similar regarding the used words. The vox clean text isless well modeled by this language model. Nonetheless, inalmost 4500 songs we could only find slightly over 5000

8 EURASIP Journal on Audio, Speech, and Music Processing

Table 7: Phoneme recognition rates (correct % / accuracy %) formonophonic singing with no language model, unigram, bigram ortrigram, using gender-adapted models.

LM Male Female

No language model 51.5/21.0 55.8/16.5

Phoneme unigram 44.8/32.8 47.8/29.3

Phoneme bigram 45.3/34.9 50.6/32.0

Phoneme trigram 49.3/16.3 56.2/16.6

words that appear more than 5 times; thus, the languagemodel for lyrics of mainstream songs is quite restrictedvocabulary-wise.

In the recognition process, the acoustic models providea number of hypotheses as output. The language modelprovides complementary knowledge about how likely thosehypotheses are. The balance of the two components inthe Viterbi decoding can be controlled using the grammarscale factor s and the insertion penalty parameter p. Theseparameters are usually set experimentally to values where thenumber of insertion errors and deletion errors in recognitionis nearly equal. We fixed the values to s = 5 and p = −10,where deletion and insertion errors for phoneme recognitionusing bigrams were approximately equal. No tuning wasdone for the other language models to maximize accuracyof the recognition results.

3.6.1. Phoneme Recognition. The average phoneme recogni-tion rates for clean singing voice using different languagemodels are presented in Table 7. The systems used in the testuse the gender-adapted models, with adaptation steps andtest settings described in Section 3.4. The parameters s and pare 5 and −10, respectively.

When there is no prior information about the phonemeprobabilities (no language model), the rate of recognizedphonemes is quite high, but with a low accuracy. Includingphoneme probabilities into the recognition process (unigramlanguage model), the accuracy of the recognition improvessignificantly. The bigram language models gives more controlover the output of the recognizer, yielding better rates thanthe unigram. For the trigram language model we obtainedhigher recognition rate but with lower accuracy of therecognition. This case might also need different tuning ofthe language model control parameters to maximize theaccuracy of the recognition, but we kept the same values forcomparison purposes.

The 2000 NIST evaluation of Switchboard corpus auto-matic speech recognition systems [33] reports error ratesof 39%–55% for phoneme recognition in speech, while thelowest error rate (100-accuracy) in Table 7 is approximately65%. Even though our singing recognition performanceresults are clearly lower, we find our results encouraging,considering that singing recognition has not been studiedbefore.

Phoneme recognition results for vocal lines separatedfrom polyphonic music are presented in Table 8. We did notuse the gender information about the polyphonic material,

Table 8: Phoneme recognition rates (correct / accuracy %) for vocalline extracted from polyphonic music, with no language model,unigram, bigram or trigram, using singing-adapted models.

LM Correct% Accuracy%

No language model 23.5 5.8

Phoneme unigram 18.7 16.7

Phoneme bigram 21.8 19.2

Phoneme trigram 30.2 13.8

Table 9: Word recognition for clean singing and vocal line extractedfrom polyphonic music, with bigram and trigram language models.

LM Correct% Accuracy%

Clean singing Word bigram 23.9 12.4

Word trigram 21.0 −1.4

Polyphonic Word bigram 6.8 5.5

Word trigram 6.5 3.9

therefore we used the systems adapted to singing using theentire material from the vox clean database. The separatedvocal line is more difficult to recognize, because of someinterference of other sources which have not been properlyseparated, and also artifacts caused by the separation algo-rithm. In some cases parts of the singing are missing, forexample, consonants being removed at the beginning of theword by the separation algorithm, resulting in recognitionerrors.

3.6.2. Word Recognition. Word recognition of monophonicsinging was tested on the vox clean database in the 5-foldsetup presented in Section 3.3. We use the word languagemodels presented in Section 3.6, with a vocabulary size of5167. The recognition results for bigram and trigram lan-guage models are presented in Table 9. Again, the languagemodel and insertion penalty parameters were kept fixed.In this case, the use of the bigram language model offersbetter results than the trigram. The trigram language modelresults to negative accuracy for the female test case, meaningthere are too many insertion errors. The best results obtainedare the correct recognition of one fifth of the words, usingthe bigram language model. Recognition rate of singingextracted from polyphonic music using the same vocabularyand language models is presented in the same table.

If a closed vocabulary language model can be constructedfrom the lyrics of the songs in the database, then suchknowledge gives an important advantage for recognition[9]. For example, in the case of the vox clean database, abigram language model constructed from the lyrics text ofdatabase has a vocabulary of only 185 words (compared tothe vocabulary size of 5167 of the previously used languagemodel) and a perplexity of 2.9 on the same text, offering arecognition rate of 55% with 40% accuracy for the singing-adapted models in the 5-fold test case.

The word recognition results are low, with even loweraccuracy, and as a speech recognition tool, this system fails.Still, thinking about information retrieval purposes, even

EURASIP Journal on Audio, Speech, and Music Processing 9

highly imperfect transcriptions of the lyrics can be useful.By maximizing the rate of correctly recognized words, evenwith producing a lot of insertion errors, the results may beuseful. In the next section we present two applications forlyrics recognition.

4. Applications

4.1. Automatic Singing-to-Lyrics Alignment. Alignment ofsinging to lyrics refers to finding the temporal relation-ship between a possibly polyphonic music audio and thecorresponding textual lyrics. We further present the systemdeveloped in [17].

A straightforward way to do alignment is by creating aphonetic transcription of the word sequence comprising thetext in the lyrics and aligning the corresponding phonemesequence with the audio using the HMM recognizer. Foralignment, the possible paths in the Viterbi search algorithmare restricted to just one string of phonemes, representing theinput text.

The polyphonic audio from the poly 100 database waspreprocessed to separate the singing voice. The text filesare processed to obtain a sequence of words with optionalsilence, pause, and noise between them. An optional shortpause is inserted between each two words in the lyrics. Atthe end of each line we insert optional silence or noiseevent, to account for the voice rest and possible backgroundaccompaniment. An example of resulting grammar for oneof the test songs is

[sil | noise] I [sp] BELIEVE [sp] I [sp] CAN [sp]FLY [sil | noise] I [sp] BELIEVE [sp] I [sp] CAN[sp] TOUCH [sp] THE [sp] SKY [sil | noise] I[sp] THINK [sp] ABOUT [sp] IT [sp] EVERY [sp]NIGHT [sp] AND [sp] DAY [sil | noise] SPREAD[sp] MY [sp] WINGS [sp] AND [sp] FLY [sp] AWAY[sil | noise]

where the [ ] encloses options and | denotes alternatives. Thisway, the alignment algorithm can choose to include pausesand noise where needed. The noise model was separatelytrained on instrumental sections from different songs, otherthan the ones in the test database.

The text input contains a number of lines of text, eachline corresponding roughly to one singing phrase.

The timestamps for beginning and end of each line in thelyrics were manually annotated. As a performance measureof the alignment we use the average of the absolute alignmenterrors in seconds at the beginning and at the end of eachlyric line. The absolute alignment errors range from 0 to9 seconds. The average errors for different adapted systemsare presented in Table 10. The best achieved performance is0.94 seconds average absolute alignment error. Examples ofalignments are presented in Figure 4.

One main reason for misalignments is a faulty outputof the vocal separation stage. Some of the songs are frompop-rock genre, featuring loud instruments as an accompa-niment, and the melody transcription (step 1 in Section 2.5)fails to pick the voice signal. In this case, the output containsa mixture of the vocals with some instrumental sounds, but

Music piece 1

(a)

Music piece 2

(b)

Figure 4: Automatic alignment examples. The black line representsthe manual annotation and the gray line the automatic alignmentoutput. The errors are calculated at the ends of each black segment.

Table 10: Average alignment errors for different sets of singing-adapted models.

Adapted system Avg. error (s) Adapted system Avg. error (s)

G3 1.27 G3T8 0.97

G8 1.31 G8T8 0.94

G22 1.31 G22T8 1.07

the voice is usually too distorted to be recognizable. In othercases, the errors appear when the transcribed lyrics do nothave the lines corresponding to singing phrases, so there arebreathing pauses in the middle of a text line. In these caseseven the manual annotation of the lyrics can have ambiguity.

The relatively low average alignment error indicates thatthis approach could be used to produce automatic alignmentof lyrics for various applications such as automatic lyricsdisplay in karaoke.

4.2. Query-by-Singing Based on Word Recognition. In query-by-humming/singing, the aim is to identify a piece ofmusic from its melody and lyrics. In a query-by-hummingapplication, the search algorithm will transcribe the melodysung by the user and will try to find a match of the sungquery with a melody from the database. For large databases,the search time can be significantly long. Assuming that wealso have the lyrics of the songs we are searching through,the words output from a phonetic recognizer can be searchedfor in the lyrics text files. This will provide additionalinformation and narrow down the melody search space.Furthermore, lyrics will be more reliable than the melody inthe case of less skilled singers.

The output of the recognition system offers sometimeswords that are acoustically very similar with the correct ones,sometimes cases with different spelling but same phonetictranscription. For recognition performance evaluation theycount as errors, but for music information retrieval purpose,we do not need perfect transcription of the lyrics. Someexamples representing typical recognition results can befound in Table 11.

We built a retrieval system based on sung queriesrecognized by the system presented in Table 9 (23.93%correct recognition rate), which that uses a bigram languagemodel to recognize the clean singing voice in the presented5-fold experiment. For this purpose, we constructed alyrics database consisting of the text lyrics of poly 100 and

10 EURASIP Journal on Audio, Speech, and Music Processing

Table 11: Examples of errors in recognition.

correct transcription recognized

yesterday yes today

seemed so far away seem to find away

my my mama

finding the answer fighting the answer

the distance in your eyes from this is in your eyes

all the way all away

cause it’s a bittersweetsymphony

cause I said bittersweetsymphony

this life this our life

trying to make endsmeet

trying to maintain sweetest

you’re a slave to themoney

ain’t gettin’ money

then you die then you down

I heard you crying loud I heard you crying alone

all the way across town all away across the sign

you’ve been searchingfor that someone

you been searching for someone

and it’s me out on theprowl

I miss me I don’t apologize

as you sit around you see the rhyme

feeling sorry for yourself feelin’ so free yourself

Table 12: Query-by-singing retrieval results.

Top 1 Top 5 Top 10

recognized (%) 57% 67% 71%

vox clean databases. We used as test queries the 49 singingfragments of the vox clean database. The recognized wordsfor each query will be matched to the content of the lyricsdatabase to identify the queried song.

For retrieval we use a bag-of-words approach, simplysearching for each recognized word in all the text files andranking the songs according to the number of matchedwords. We consider a song being correctly identified whenthe queried fragment appears among the first N-ranked lyricsfiles. Table 12 presents the retrieval accuracy for N being 1,5, and 10. The application shows promising results, the firstretrieved song being correct in 57% of the cases.

5. Conclusions

This paper applied speech recognition methods to recognizelyrics from singing voice. We attempt to recognize phonemesand words in singing voice from monophonic singing inputand from polyphonic music. In order to suppress the effectof the instrumental accompaniment, a vocal separationalgorithm was applied.

Due to the lack of large enough singing databases to traina singing recognizer, we used a phonetic recognizer that wastrained on speech and applied speaker adaptation techniquesto adapt the models to singing voice. Different adaptation

setups were considered. A general adaptation to singing usinga global transform was found to provide a system with muchhigher performance in recognizing sung phonemes than thenonadapted one. Different numbers of base classes in thesetup of the adaptation did not have very much importancefor the system performance. Separate adaptation using maleand female data led to gender-dependent models, producingthe best performance in phoneme recognition. More specificspeaker-dependent adaptation did not improve the results,but this may be due to the limited amount of speaker-specificadaptation data in comparison with the adaptation data usedto adapt the speech models to singing in general.

The recognition results are also influenced by the lan-guage models used in the recognition process. Phonemebigram language model built on phonetically balancedspeech data was found to increase the accuracy of therecognition with up to 13% both for clean singing test casesand for vocal line extracted from polyphonic music. Wordrecognition in clean singing using a bigram language modelbuilt from lyrics text allows recognition of approximately onefifth of the sung words in clean singing. In polyphonic music,the results are lower.

Even though the results are far from being perfect, theyhave potential in music information retrieval. Our query-by-singing experiment indicates that a song might be retrievedbased on words that are correctly recognized from a userquery. We also demonstrated the capability of the recognitionmethods in automatic alignment of singing from polyphonicaudio and text, where an average alignment error of 0.94seconds was obtained.

The constructed applications prove that even such lowrecognition results can be useful in particular tasks. Still, itis important to find methods for improving the recognitionrates. Ideally, a lyrics recognition system should be trainedon singing material. We lack a large enough database withmonophonic recordings, but we do have at our disposalplenty of polyphonic material. One approach could be usingvocals separated from polyphonic music for training of themodels. Also, considering that there are millions of songsout there, we know that we only selected a small amountof information to build the word language model. A betterselection of the vocabulary can be obtained by using moretext in the construction of the language model.

Acknowledgment

This research has been funded by the Academy of Finland.

References

[1] G. Tzanetakis and P. Cook, “Musical genre classificationof audio signals,” IEEE Transactions on Speech and AudioProcessing, vol. 10, no. 5, pp. 293–302, 2002.

[2] S. Downie, K. West, A. Ehmann, and E. Vincent, “The 2005music information retrieval evaluation exchange (MIREX2005): preliminary overview,” in Proceedings of the 6th Interna-tional Conference on Music Information Retrieval (ISMIR ’05),2005.

EURASIP Journal on Audio, Speech, and Music Processing 11

[3] S. Z. K. Khine, T. L. New, and H. Li, “Singing voice detectionin pop songs using co-training algorithm,” in Proceedings ofIEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP ’08), pp. 1629–1632, 2008.

[4] C. Smit and D. P. W. Ellis, “Solo voice detection vio optimalcancelation,” in Proceedings of IEEE Workshop on Applicationsof Signal Processing to Audio and Acoustics, 2007.

[5] W.-H. Tsai and H.-M. Wang, “Automatic singer recognitionof popular music recordings via estimation and modeling ofsolo vocal signals,” IEEE Transactions on Audio, Speech andLanguage Processing, vol. 14, no. 1, pp. 330–341, 2006.

[6] A. Mesaros, T. Virtanen, and A. Klapuri, “Singer identificationand polyphonic music using vocal separation and patternrecognition methods,” in Proceedings of International Confer-ence on Music Information Retrieval (ISMIR ’07), 2007.

[7] W.-H. Tsai, S.-J. Liao, and C. Lai, “Automatic identificationof simultaneous singers in duet recordings,” in Proceedingsof International Conference on Music Information Retrieval(ISMIR ’08), 2008.

[8] R. Typke, F. Wiering, and R. C. Veltkamp, “Mirex sym-bolic melodic similarity and query by singing/humming,” inInternational Music Information Retrieval Systems EvaluationLaboratory(IMIRSEL), http://www.music-ir.org/mirex/2006/index.php/Main Page.

[9] M. Suzuki, T. Hosoya, A. Ito, and S. Makino, “Musicinformation retrieval from a singing voice using lyrics andmelody information,” EURASIP Journal on Advances in SignalProcessing, vol. 2007, Article ID 38727, 8 pages, 2007.

[10] A. Sasou, M. Goto, S. Hayamizu, and K. Tanaka, “An auto-regressive, non-stationary excited signal parameter estimationmethod and an evaluation of a singing-voice recognition,”in Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP ’05), vol. 1, pp. 237–240,2005.

[11] H. Fujihara, M. Goto, and J. Ogata, “Hyperlinking lyrics:a method for creating hyperlinks between phrases in songlyrics,” in Proceedings of the 9th International Conference onMusic Information Retrieval (ISMIR ’08), 2008.

[12] M. Muller, F. Kurth, D. Damm, C. Fremerey, and M. Clausen,“Lyrics-based audio retrieval and multimodal navigation inmusic collections,” in Proceedings of European Conference onResearch and Advanced Technology for Digital Libraries, pp.112–123, 2007.

[13] M. Gruhne, K. Schmidt, and C. Dittmar, “Phoneme recogni-tion in popular music,” in Proceedings of the 8th InternationalConference on Music Information Retrieval (ISMIR ’07), 2007.

[14] H. Fujihara and M. Goto, “Three techniques for improvingautomatic synchronization between music and lyrics: fricativedetection, filler model, and novel feature vectors for vocalactivity detection,” in Proceedings of IEEE International Con-ference on Acoustics, Speech and Signal Processing (ICASSP ’08),pp. 69–72, 2008.

[15] H. Fujihara, M. Goto, J. Ogata, K. Komatani, T. Ogata, andH. G. Okuno, “Automatic synchronization between lyrics andmusic CD recordings based on Viterbi alignment of segregatedvocal signals,” in Proceedings of the 8th IEEE InternationalSymposium on Multimedia (ISM ’06), pp. 257–264, 2006.

[16] M.-Y. Kan, Y. Wang, D. Iskandar, T. L. Nwe, and A. Shenoy,“LyricAlly: automatic synchronization of textual lyrics toacoustic music signals,” IEEE Transactions on Audio, Speechand Language Processing, vol. 16, no. 2, pp. 338–349, 2008.

[17] A. Mesaros and T. Virtanen, “Automatic alignment of musicaudio and lyrics,” in Proceedings of the 11th InternationalConference on Digital Audio Effects (DAFx ’08), 2008.

[18] C. H. Wong, W. M. Szeto, and K. H. Wong, “Automatic lyricsalignment for Cantonese popular music,” Multimedia Systems,vol. 12, no. 4-5, pp. 307–323, 2007.

[19] J. Sundberg, The Science of Singing Voice, Northern IllinoisUniversity Press, 1987.

[20] B. S. Atal, “Effectiveness of linear prediction characteristicsof the speech wave for automatic speaker identification andverification,” Journal of the Acoustical Society of America, vol.55, no. 6, pp. 1304–1312, 1974.

[21] C. J. Leggetter and P. C. Woodland, “Flexible speaker adapta-tion using maximum likelihood linear regression,” in Proceed-ings of the ARPA Spoken Language Technology Workshop, 1995.

[22] M. J. F. Gales, “Maximum likelihood linear transformationsfor HMM-based speech recognition,” Computer Speech andLanguage, vol. 12, no. 2, pp. 75–98, 1998.

[23] M. Pitz and H. Ney, “Vocal tract normalization equals lineartransformation in cepstral space,” IEEE Transactions on Speechand Audio Processing, vol. 13, no. 5, pp. 930–944, 2005.

[24] D. Pye and P. C. Woodland, “Experiments in speakernormalisation and adaptation for large vocabulary speechrecognition,” in Proceedings of IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP ’97), 1997.

[25] C. J. Leggetter and P. C. Woodland, “Maximum likelihoodlinear regression for speaker adaptation of continuous densityhidden Markov models,” Computer Speech and Language, vol.9, no. 2, pp. 171–185, 1995.

[26] A. Mesaros and T. Virtanen, “Adaptation of a speech recog-nizer to singing voice,” in Proceedings of 17th European SignalProcessing Conference, 2009.

[27] D. Jurafsky and J. H. Martin, Speech and Language Processing,Prentice-Hall, Upper Saddle River, NJ, USA, 2000.

[28] L. R. Rabiner and B.-H. Juang, Fundamentals of SpeechRecognition, Prentice-Hall, Upper Saddle River, NJ, USA, 1993.

[29] T. Virtanen, A. Mesaros, and M. Ryynanen, “Combining pitch-based inference and non-negative spectrogram factorizationin separating vocals from polyphonic music,” in Proceedingsof the ISCA Tutorial and Research Workshop on Statistical andPerceptual Audition SAPA, 2008.

[30] M. P. Ryynanen and A. P. Klapuri, “Automatic transcription ofmelody, bass line, and chords in polyphonic music,” ComputerMusic Journal, vol. 32, no. 3, pp. 72–86, 2008.

[31] T. Virtanen, “Monaural sound source separation by non-negative matrix factorization with temporal continuity andsparseness criteria,” IEEE Transactions on Audio, Speech, andLanguage Processing, vol. 15, no. 3, 2007.

[32] P. C. Woodland, J. J. Odell, V. Valtchev, and S. J. Young, “Largevocabulary continuous speech recognition using HTK,” inProceedings of the International Conference on Acoustics, Speechand Signal Processing (ICASSP ’94), 1994.

[33] S. Greenberg, S. Chang, and J. Hollenback, “An introductionto the diagnostic evaluation of the switchboard-corpus auto-matic speech recognition systems,” in Proceedings of the NISTSpeech Transcription Workshop, 2000.

Photograph © Turisme de Barcelona / J. Trullàs

Preliminary call for papers

The 2011 European Signal Processing Conference (EUSIPCO 2011) is thenineteenth in a series of conferences promoted by the European Association forSignal Processing (EURASIP, www.eurasip.org). This year edition will take placein Barcelona, capital city of Catalonia (Spain), and will be jointly organized by theCentre Tecnològic de Telecomunicacions de Catalunya (CTTC) and theUniversitat Politècnica de Catalunya (UPC).EUSIPCO 2011 will focus on key aspects of signal processing theory and

li ti li t d b l A t f b i i ill b b d lit

Organizing Committee

Honorary ChairMiguel A. Lagunas (CTTC)

General ChairAna I. Pérez Neira (UPC)

General Vice ChairCarles Antón Haro (CTTC)

Technical Program ChairXavier Mestre (CTTC)

Technical Program Co Chairsapplications as listed below. Acceptance of submissions will be based on quality,relevance and originality. Accepted papers will be published in the EUSIPCOproceedings and presented during the conference. Paper submissions, proposalsfor tutorials and proposals for special sessions are invited in, but not limited to,the following areas of interest.

Areas of Interest

• Audio and electro acoustics.• Design, implementation, and applications of signal processing systems.

l d l d d

Technical Program Co ChairsJavier Hernando (UPC)Montserrat Pardàs (UPC)

Plenary TalksFerran Marqués (UPC)Yonina Eldar (Technion)

Special SessionsIgnacio Santamaría (Unversidadde Cantabria)Mats Bengtsson (KTH)

FinancesMontserrat Nájar (UPC)• Multimedia signal processing and coding.

• Image and multidimensional signal processing.• Signal detection and estimation.• Sensor array and multi channel signal processing.• Sensor fusion in networked systems.• Signal processing for communications.• Medical imaging and image analysis.• Non stationary, non linear and non Gaussian signal processing.

Submissions

Montserrat Nájar (UPC)

TutorialsDaniel P. Palomar(Hong Kong UST)Beatrice Pesquet Popescu (ENST)

PublicityStephan Pfletschinger (CTTC)Mònica Navarro (CTTC)

PublicationsAntonio Pascual (UPC)Carles Fernández (CTTC)

I d i l Li i & E hibiSubmissions

Procedures to submit a paper and proposals for special sessions and tutorials willbe detailed at www.eusipco2011.org. Submitted papers must be camera ready, nomore than 5 pages long, and conforming to the standard specified on theEUSIPCO 2011 web site. First authors who are registered students can participatein the best student paper competition.

Important Deadlines:

P l f i l i 15 D 2010

Industrial Liaison & ExhibitsAngeliki Alexiou(University of Piraeus)Albert Sitjà (CTTC)

International LiaisonJu Liu (Shandong University China)Jinhong Yuan (UNSW Australia)Tamas Sziranyi (SZTAKI Hungary)Rich Stern (CMU USA)Ricardo L. de Queiroz (UNB Brazil)

Webpage: www.eusipco2011.org

Proposals for special sessions 15 Dec 2010Proposals for tutorials 18 Feb 2011Electronic submission of full papers 21 Feb 2011Notification of acceptance 23 May 2011Submission of camera ready papers 6 Jun 2011