Vocal fold disorder detection based on continuous speech by using MFCC and GMM

Vocal Fold Disorder Detection based on Continuous

Speech by using MFCC and GMM

Zulfiqar Ali1,2, Mansour Alsulaiman1, Ghulam

Muhammad1, Irraivan Elamvazuthi2 1Digital Speech Processing Group, College of Computer

and Information Sciences, King Saud University

Saudi Arabia

Email:{ zuali, msuliman, ghulam}@ksu.edu.sa

2Department of Electrical and Electronic Engineering,

Universiti Teknologi PETRONAS

Malaysia

Email: [email protected]

Tamer A. Mesallam

ENT Department, College of Medicine,

King Saud University

Saudi Arabia

Email: [email protected]

Abstract—Vocal fold voice disorder detection with a sustained

vowel is well investigated by research community during

recent years. The detection of voice disorder with a sustained

vowel is a comparatively easier task than detection with

continuous speech. The speech signal remains stationary in

case of sustained vowel but it varies over time in continuous

time. This is the reason; voice detection by using continuous

speech is challenging and demands more investigation.

Moreover, detection with continuous speech is more realistic

because people use it in their daily conversation but sustained

vowel is not used in everyday talks. An accurate voice

assessment can provide unique and complementary

information for the diagnosis, and can be used in the treatment

plan. In this paper, vocal fold disorders, cyst, polyp, nodules,

paralysis, and sulcus, are detected using continuous speech.

Mel-frequency cepstral coefficients (MFCC) are used with

Gaussian mixture model (GMM) to build an automatic

detection system capable of differentiating normal and

pathological voices. The detection rate of the developed

detection system with continuous speech is 91.66%.

Keywords-Voice disorder; pathology detection; continuous

speech; MFCC; GMM

I. INTRODUCTION

Many voice clinicians use sustained vowel samples rather than continuous speech samples in performing acoustic analysis for their patients. Although some researchers have found that the sustained vowel is optimal for obtaining a voice sample for a variety of reasons, but it does not truly represent voice use patterns in daily speech. At the same time, fluctuations of vocal characteristics in relation to voice onset, voice termination, and voice breaks, which are considered to be crucial in voice quality evaluation, are not fully represented in short signals of phonation such as a sustained vowel.

Furthermore, dysphonia symptoms often are more evident in conversational voice production compared to sustained vowels, and they are most often gestured by the

dysphonic persons themselves in continuous speech. In addition, some voice pathologies, like adductor spasmodic dysphonia during sustained vowel production, can be distinguished from a relatively normal voice. Moreover, some of the acoustic correlates of an individual’s voice are the result of the influence of the segmental and supra-segmental structure of speech and cannot be represented in the sustained vowel.

Detection is the first crucial step to correctly diagnose and manage voice disorders. The use of the objective assessment of voice pathology has grown in interest over the last several years. Voice pathology detection and classification are a topic that has interested the international voice community [1]. Objective measurement that includes acoustical analysis is independent of human bias and can assess the voice quality more reliably by relating certain parameters to vocal fold behavior. On the other hand, subjective measurement of voice quality is based on individual experience [2-5]. Many automatic voice disorder systems are developed for objective evaluation of the voice disorder. An automatic system consists of two important steps; the first is feature extraction such as linear predictive coefficients (LPC) [5, 6], linear predictive cepstral coefficients (LPCC) [7], Mel-frequency cepstral coefficients (MFCC) [8, 9] etc., and the second is pattern matching technique such as Gaussian mixture model (GMM) [10, 11], Hidden markov model (HMM) [12], Support vector machine (SVM) [13], Artificial neural networks (ANN) [14], etc.

Automatic voice pathology detection systems that use only sustained vowels are developed in [15, 16, 17, and 18]. Moreover, few detection systems have been proposed using MFCC in [19, 20, 21, 22, 23], and MFCC has shown better performance in disease detection as compared to LPC. In [24], various experiments are performed on a voice pathology detection system using the Massachusetts Eye and Ear Infirmary (MEEI) [25] Voice and Speech Laboratory database. Multi-dimensional voice program (MDVP) parameters and MFCC features are extracted from all

2013 IEEE GCC Conference and exhibition, November 17-20, Doha, Qatar

978-1-4799-0724-3/13/$31.00 ©2013 IEEE 384

https://www.researchgate.net/publication/8549510_Voice_Disorders_in_Teachers_and_the_General_Population?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/223169791_An_investigation_of_dependencies_between_frequency_components_and_speaker_characteristics_for_text-independent_speaker_identification?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/225573048_Front_end_analysis_of_speech_recognition_A_review?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/220691539_Fundamentals_of_Speech_Recognition?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/232639384_A_Speaker_Identification_System_using_MFCC_Features_with_VQ_Technique?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/3791384_Automatic_speaker_recognition_using_Gaussian_mixture_models?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/6774937_Dimensionality_Reduction_of_a_Pathological_Voice_Quality_Assessment_System_Based_on_Gaussian_Mixture_Models_and_Short-Term_Cepstral_Parameters?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/38365401_Statistical_Inference_for_Probability_Functions_of_Finite_State_Markov_Chains?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/11153781_Pathological_voice_quality_assessment_using_artificial_neural_networks?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/221197700_Pathological_voice_discrimination_using_cepstral_analysis_vector_quantization_and_Hidden_Markov_Models?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/8883405_Automatic_Detection_of_Voice_Impairments_by_Means_of_Short-Term_Cepstral_Parameters_and_Neural_Network_Based_Detectors?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/5505275_Automated_speech_analysis_applied_to_laryngeal_disease_categorization?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/5899723_Pathological_Voice_Assessment?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/223627185_An_improved_method_for_voice_pathology_detection_by_means_of_a_HMM-based_feature_space_transformation?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/3978198_Feature_analysis_for_automatic_detection_of_pathological_speech?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

sustained vowels /a/ and are fed into various modeling techniques. The highest MDVP parameter accuracy for the sustained vowel /a/ is 97.67% with a GMM based system. MFCC alone and with pitch are fed to HMM for pathology detection. The best obtained accuracy for /a/ with MFCC is 97.75%. In [26], MFCC is extracted from the sustained vowel /a/ of dysphonic and normal voices from the database at the ENT department of Busan National University Hospital, Korea, and used with HMM, GMM, SVM and ANN for disease detection. The voice pathologies present in this database are cyst, edema, laryngitis, nodule, palsy, polyp, and glottis cancer. The highest pathology rate obtained is 95.2% with GMM. In [27], a local recorded database containing 29 dysphonic patients and 52 normal persons is developed for the detection of the disorder. In this study, LPC features are fed to vector quantization technique (VQ), and an 82.6% correct detection of pathology is obtained. In [28], LPC is extracted from 21 vocal fold edema, 21 vocal fold paralysis, and 21 normal samples of sustained vowel /a/ taken from the MEEI database and inputted to k-nearest neighbors (KNN) and neural network (NN). The vocal folds paralysis is detected with a classification rate of 84% and 85% for KNN and NN, respectively, from normal persons and patients affected by paralysis. Vocal fold edema is detected with 75% and 83% accuracy with KNN and NN, respectively.

However, in everyday life, people do not use sustained voice; instead, they use continuous speech. Therefore, for an automatic voice pathology detection system to work, in reality, it will be better to detect pathology from continuous sentences. There are reports on voice pathology detection system using continuous speech, but they are far from optimal [29, 30, 31]. In this paper, an automatic voice disorder detection system with continuous speech is developed with MFCC and GMM. The obtained results are good.

The rest of the paper is organized as follows: section II provides overview of the speech database. Section III presents the automatic voice detection system. Section IV

describes experimental setup and discussion on the results. Finally, section V draws some conclusion.

II. SPEECH DATABASE

The speech samples collected in different sessions at Communication and Swallowing Disorders Unit, King Abdul Aziz University Hospital, Riyadh, Saudi Arabia, by experienced phoneticians in a soundproof room using a standardized recording protocol. The database collection is one of the major tasks of the undergoing project funded by National Plans for Science and Technology (NPST), Saudi Arabia, for the duration of two years. The protocol of the database designed such that it should avoid different shortcomings of MEEI database [32]. The project records the speech of patients having vocal fold disorders and the speech of normal persons after clinically verifying this. The recording have different types of text; three vowels with onset and offset information, isolated words containing Arabic digits and some common words, and continuous speech. The selected text covers all Arabic phonemes. All speakers record three utterances of each vowel /a/, /u/ and /i/, while, isolated words and continuous speech is recorded once to avoid making a burden on patients. The sampling frequency of the database is 50 KHz and the speech is recorded by using Kay Pentax computerized speech lab (CSL Model 4300).

In this paper, continuous speech is used for the detection of voice disorder and down sample to 16 KHz to perform the experiments. The recorded disorders are cyst, polyp, sulcus vocalis, nodules, and paralysis. Twenty six patients and 12 normal people have been examined in the clinics so far and part of their recording is used in this paper. The continuous speech is the first chapter from the Holy book of the Muslims. The text with pronunciation and translation in English is provided in Table I. This text is selected because every patient and normal person memorizes it, even if he/she is illiterate.

TABLE I. CONTINUOUS SPEECH FOR VOICE DISORDER DETECTION

Translation in English Pronunciation Arabic

In the name of God, the infinitely Compassionate and

Merciful.

Bismillaah ar-Rahman ar-Raheem الرحيم الرحمن هللا بسم

Praise be to God, Lord of all the worlds. Al hamdu lillaahi rabbil ‘alameen الحمد العالمين رب لل

The Compassionate, the Merciful. Ar-Rahman ar-Raheem الرحيم الرحمن

Ruler on the Day of Reckoning. Maaliki yaumid Deen ين يوم مالك الد

You alone do we worship, and You alone do we ask

for help.

Iyyaaka na’abudu wa iyyaaka

nasta’een نستعين وإياك نعب د إياك

Guide us on the straight path, Ihdinas siraatal mustaqeem راط اهدنا م ستقيم ال الص

The path of those who have received your grace; Siraatal ladheena an ‘amta’ alaihim عليهم أنعمت الذين صراط

Not the path of those who have brought down wrath,

nor of those who wander astray.

Ghairil maghduubi’ alaihim

waladaaleen ين الضال ول عليهم المغض وب غير


385

https://www.researchgate.net/publication/5843681_Vocal_Folds_Disorder_Detection_using_Pattern_Recognition_Methods?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/21582407_Detection_of_Laryngeal_Function_Using_Speech_and_Electroglottographic_Data?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/221481111_Automatic_detection_of_vocal_fold_paralysis_and_edema?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/26778505_Spectral_Cepstral_and_Multivariate_Exploration_of_Tracheoesophageal_Voice_Quality_in_Continuous_Speech_and_Sustained_Vowels?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

https://www.researchgate.net/publication/20808113_Acoustic_recognition_of_voice_disorders_A_comparative_study_of_running_speech_versus_sustained_vowels?el=1_x_8&enrichId=rgreq-0e4452bc-e7d5-4af2-8456-7c2c9daa67d1&enrichSource=Y292ZXJQYWdlOzI2MTI0MjU4MjtBUzo5NzQ3NDEyNzU5NzU3M0AxNDAwMjUxMDUxNjQ1

Pmf

m ,...,2,1 ,700

1log2595 10

Frame

Blocking

FrameWindowing

FFTSpectrumMEL Filter

Bank

Mel

Weighted

Spectrum

Log

Compression

DCT MFCC

Continuous

Speech

Figure 1. Block diagram of MFCC calculation

III. AUTOMATIC DISORDER DETECTION SYSTEM

The built Automatic detection system has two phases, feature extraction technique, MFCC, and pattern matching technique, GMM. Both techniques are briefly described in following subsections.

A. Mel-frequency Cepstral Coefficients

MFCC simulates human auditory mechanism and performs reasonably well under robust conditions. Fig.1 shows a block diagram of the MFCC calculation. First, digitized wave data is divided into overlapping frames, where the frame length is 16 milliseconds. This division is needed to analyze the speech in small pseudo-stationary segments.

The resultant frame is multiplied by a Hamming window

to minimize the effect of spectral leakage. The Hamming window has almost zero values towards the both ends ensuring the continuity of the signal in successive frames. The equation of the Hamming window is given in Eq. 1

(1) Fourier transformation (FT) is applied to the windowed

signal to convert the time-domain signal into a frequency-domain signal (spectrum). Triangular band-pass filters (BPFs) are applied to divide the spectrum into certain frequency bands. The center frequencies of the BPFs are spaced on a Mel-scale, and the bandwidths correspond to well-known auditory perception phenomena called critical bandwidth. The relation between Mel-scale and linear scale (Hz) is almost linear up to around 800 Hz, and logarithmic beyond that (see Eq. 2).

(2)

In Eq. 2, m corresponds to Mel's index (P filters are used)

and f refers to frequency in Hz. Therefore, by applying (P = 24) BPFs, N points of the spectrum are converted to only 24 values. Log is applied to the 24 outputs to make the convolution components additive and to adjust the dynamic range in the spectrum. In this way, source excitation signal and vocal tract filter response become additive. The log outputs are then passed through discrete cosine transform (DCT) to de-correlate the components and reduce the dimension. The output of DCT is called MFCC. Typically, 0-th coefficient is ignored and 1st to 12-th coefficients are retained to represent 12 MFCC. First and second orders derivatives are calculated on these 12 components to find velocity and acceleration coefficients. These derivatives are normally extracted using linear regression as in Eq. 3.

B

i

B

imitmit

t

i

cci

1

2

1,,

2

)(

Where, ∆t corresponds to a velocity component at t-th

frame, Ct,m stands for m-th MFCC at t-th frame, and B is the length of the regression window.

B. Gaussian Mixture Model

GMM [33] is a state of the art modeling technique that copes more with the space of the features, rather than the time sequence of their appearance. Healthy and pathological persons are modeled by GMMs that represent, in a weighted manner, the occurrence of the feature vectors. The well-known method to model the speaker GMM is the Expectation-Maximization algorithm, where model parameters (Mean, variance and mixture coefficients) are adapted and tuned to converge to a model giving a maximum log-likelihood value.

The GMM model is given by the weighted sum of individual Gaussians

(4)

where X is a D-dimensional continuous-valued data vector (i.e. measurements or features), wi are the mixture weights, and, are the component Gaussian densities. Each component density is a D-dimensional Gaussian function of the form,

(5)

with mean vector µi and covariance matrix Σi. The mixture

weights satisfy the constraint

M

i

iw1

1 . The model of the

GMM is denoted as M1,2,3,...,i ),Σ,μ,(wλ iii .

nnw s

N

ns

1

)1(2cos46.054.0

(3)

1

i i

'

i2

1

i2

Dii )μ(X)μ(X2

1exp

Σ2

1)Σ,μ|g(X

M

1i iii )Σ,μ|g(Xw)|p(X


386

IV. EXPERIMENTS AND RESULTS

Different experiments, using a different number of MFCC and Gaussian mixtures, are performed to observe the accuracy of the developed automatic voice disorder detection system. The database is divided into two parts; 60% of the samples of patients and normal persons are used for the training and remaining the 40% are used for the testing. Different numbers of MFCC are extracted, 12, 24 and 36, where, 24 and 36 coefficients contain 12 MFCC+12 delta and 12 MFCC + 12 delta + 12 delta-delta, respectively. The models for disorder and normal speech are generated with a different number of Gaussian mixtures, 4, 8, 16 and 32. The results of the conducted experiments are presented in Table II.

The performance of the automatic systems is evaluated by following parameters:

True negative (TN): The system detects normal voice as normal voice.

True positive (TP): The system detects disordered voice as disordered voice.

False negative (FN): The system detects disordered voice as normal voice.

False positive (FP): The system detects normal voice as disordered voice.

Sensitivity (SE): The likelihood that the system detects disordered voice when the input is a disordered voice [11].

100

FNTP

TPSE

Specificity (SP): The likelihood that the system detects normal voice when the input is a normal voice [11].

100

FPTN

TNSP

Accuracy (Acc. %): The ratio between correctly detected files and the total number of files.

AUC: The area under the receiver operating characteristics (ROC) curve.

TABLE II. PERFORMANCE OF THE DEVELOPED DETECTION SYSTEM

Performance Parameters

12, 24 and 36 MFCC

4 GMM 8 GMM 16 GMM 32 GMM

Sensitivity 87.5 87.5 87.5 87.5

Specificity 100 100 100 100

Accuracy 91.66 91.66 91.66 91.66

AUC 90.91

95% C.I. [.75 1.0]

1-tail p-value 0.000245( < 0.05)

The values of the performance parameters are same for 12, 24 and 36 MFCCs. There is no effect on the performance of the developed system by increasing the number of coefficients. Similarly, no effect is observed when the number of mixtures is increased in GMM. The obtained sensitivity is 87.5% expresses the truly detected pathological voice samples. The specificity of 100% represents that all normal samples are truly classified. Overall, an accuracy of 91.66% is achieved. The trend of the performance parameters is depicted in Fig. 2. The Receiver Operating Characteristic (ROC) curve, shown in Fig.3, is a graphical way to assess the system's ability to be discriminate between healthy and pathological samples. The area under the ROC curve is 0.9091 shows the efficiency of the developed system. The p-value (< 0.05) of the t-test illustrate the significance of the healthy and pathological data.

Figure 2. Trend of performance parameters for voice disorder detection

Figure 3. ROC curve for voice disorder system


387

V. CONCLUSION

Many vocal fold detection systems has been developed for voice pathology detection by using sustained vowel with good results. Voice pathology detection systems based on continuous speech will be a better alternative that is easier to be used in practice. Therefore, in this paper we presented a voice disorder detection system based on continuous. The developed system had very good performance and detection rate.

As discussed in the section I, cepstral coefficients, MFCC is a good choice for voice disorder detection with a sustained vowel as compared to other speech features. The results of the developed system show that even with continuous speech MFCC is performing better. Disorder detection with continuous speech can be further investigated by applying other speech features and compare its performance with MFCC. To build better systems with better performance we need to increase the number of samples in the database. This is what we hope to achieve by the end of the project.

ACKNOWLEDGMENT

This work is supported by the National Plan for Science and Technology in King Saud University under grant number 12-MED2474-02. The authors are grateful for this support.

REFERENCES

[1] G. Muhammad, T. A. Mesallam, K. H. Malki, M. Farahat, M. Alsulaiman, and M. Bukhari, "Formant analysis in dysphonic patients and automatic Arabic digit speech recognition", BioMedical Engineering OnLine, vol. 10, no. 40, pp. 1-12, 2011.

[2] National Institute on Deafness and Other Communication Disorders: Voice, Speech, and Language: Quick Statistics, 2011. Available at http://www.nidcd.nih.gov/health/statistics/vsl/Pages/stats.aspx

[3] Research Chair of Voicing and Swallowing Disorders. Available at http://vas.ksu.edu.sa/en/page-122.html

[4] N. Roy, R.M. Merrill, S. Thibeault, R.A. Parsa, S.D. Gray, and E.M. Smith, “Prevalence of voice disorders in teachers and the general population,” J Speech Lang Hear Res., vol.47, no. 2, pp. 281-93, Apr 2004.

[5] B. S. Atal, "Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and recognition", J. Acoustic. Soc. Amer., vol. 54, no. 6, pp. 1304-1312, 1974.

[6] L. Xugang, and D. Jianwu, “An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification”, Speech Communication’ 07, vol. 50, no. 4, pp. 312-322, Oct 2007.

[7] M. A. Anusuya, S. K. Katti, “Front end analysis of speech recognition: a review”, International Journal of Speech Technology, vol. 14, pp. 99-145, Dec. 2010.

[8] L. Rabiner and B.H. Juang, Fundamentals of speech recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993.

[9] Z. Ali, M. Aslam., and M.E. Ana María, “A speaker identification system using MFCC features with VQ technique”, Proceedings of 3rd IEEE International Symposium on Intelligent Information Technology Application, pp. 115-119,2009.

[10] W.J.J. Roberts, and J.P. Willmore, "Automatic speaker recognition using Gaussian mixture models", proceedings of Information, Decision and Control, IDC’99, pp. 465 – 470, 1999.

[11] J.I. Godino-Llorente, P. Gomes-Vilda and M. Blanco-Velasco, "Dimensionality reduction of a pathological voice quality assessment

system based on Gaussian mixture models and short-term cepstral parameters", IEEE Transactions on Biomedical Engineering, vol. 53, no. 10, pp. 1943-1953. Oct. 2006.

[12] L.E. Baum and T. Petrie, “Statistical inference for probabilistic functions of finite state Markov Chains”, Ann. Math. Stat., vol. 37, pp. 1554-1563, 1966.

[13] S. Abe, Support Vector Machines for Pattern Classification. Springer-Verlag, Berlin Heidelberg New York, 2005

[14] T. Ritchings, M. McGillion, and C. Moore, “Pathological voice quality assessment using artificial neural networks,” Med. Eng. Phys., vol. 24, no. 8, pp. 561–564, Sept 2002.

[15] Y. D. Heman-Ackah, R. J. Heuer, D. D. Michael, R. Ostrowski, M. Horman, M. Baroody, J. Hillenbrand, and R. T. Sataloff, "Cepstral peak prominence: a more reliable measure of dysphonia," Ann Otol Rhinol Laryngol., vol. 112, no. 4, pp. 324-333, 2003.

[16] M. Vieira, F. Mclnnes, and M. Jack, “On the influence of laryngeal pathologies on acoustic and electroglottalgraphic jitter measures,” Journal of Acoustic Society of America, vol. 111, no. 2, pp. 1045-1055, 2002.

[17] R.J. Moran, R.B. Reilly, P. Chazal, and P.D. Lacy, "Telephony-based voice pathology assessment using automated speech analysis," IEEE Trans. Biomedical Engineering, vol. 53, no. 3, pp. 468-477, 2006.

[18] R.C. Rabinov, J. Krieman, B.R. Gerratt, and S. Bielamowicz, “Comparing reliability of perceptual ratings of roughness and acoustic measures of jitter,” Journal Speech Hearing Research, vol. 38, pp. 26–32, 1995.

[19] S.C. Costa, B.G. Aguiar Neto, J.M. Fechine, "Pathological voice discrimination using cepstral analysis, vector quantization and hidden markov models", Proceedings of 8th IEEE International Conference on BioInformatics and BioEngineering, BIBE, pp. 1 – 5, 2008.

[20] J.I. Godino-Llorente and P. Gomez-Vilda, “Automatic detection of voice impairments by means of short-term cepstral parameters and neural network based detectors”, IEEE Trans Biomed. Eng., vol. 51, pp. 380–384, 2004.

[21] A. Gelzinis, A. Verikas, and M. Bacauskiene , "Automated speech analysis applied to laryngeal disease categorization", Journal of Computer Methods and Programs in Biomedicine, vol. 91, no. 1, pp. 36-47, July 2008.

[22] A.A. Dibazar, T. W. Berger, and S. Narayanan, “Pathological voice assessment,” Engineering in Medicine and Biology Society, 2006, EMBS '06. 28th Annual International Conference of the IEEE, pp. 1669 –1673, Aug 2006.

[23] J.D. Arias-Londoño, J.I. Godino-Llorente, N. Sáenz-Lechón, and V. Osma-Ruiz, "An improved method for voice pathology detection by means of a Hmm-based feature space transformation", J. Pattern Recognition, vol. 43, no. 9, pp. 3100-3112, Sep 2010.

[24] A.A. dibazar, S. Narayanan, and T.W. Berger, "Feature analysis for automatic detection of pathological speech", Proceedings of 2nd Joint Conference of EMBS / BMES , vol. 1, Houstan, TX, USA, 2002.

[25] “Disorder Database Model 4337” Massachusetts Eye and Ear Infirmary Voice and Speech Lab, Boston, MA, Jan. 2002.

[26] J. Wang, C. Jo, "Vocal folds disorder detection using pattern recognition method", Proceedings of 29th Annual International Conference of the IEEE EMBS, pp. 3253-3256, Lyon, France, 2007.

[27] D.G. Childers, and K. Sung-Bae, “Detection of laryngeal function using speech and electroglottographic data”, IEEE Trans. Biomed. Eng., vol. 39, no. 1, pp. 19–25, Jan 1992.

[28] M. Marinaki, C. Kotropoulos, I. Pitas, and N. Maglaveras, “Automatic detection of vocal fold paralysis and edema”, Proceedings of ICSLP ’04, Jeju Island, South Korea, Nov. 2004.

[29] F. Klingholz, “Acoustic recognition of voice disorders: A comparative study, running speech versus sustained vowels,” Journal of Acoustic Society of America, vol. 87, pp. 2218–2224, 1990.

[30] J. I. Godino-Llorente, R. Fraile, N. Saenz-Lechon, V. Osma-Ruiz, and P. Gomez-Vilda, "Automatic detection of voice impairments from


388

text-dependent running speech," Biomedical Signal Processing and Control, vol. 4, pp. 176–182, 2009.

[31] Y. Maryn, C. Dick, C. Vandenbruaene, T. Vauterin, and T. Jacobs, "Spectral, cepstral, and multivariate exploration of tracheoesophageal voice quality in continuous speech and sustained vowels," Laryngoscope, vol. 119, pp. 2384–2394, 2009.

[32] N. Sáenz-Lechón, J.I. Godino-Llorente, Ví. Osma-Ruiz, and P. Gómez-Vilda, “Methodological issues in the development of automatic systems for voice pathology detection,” Biomedical Signal Processing and Control, vol. 1, no. 2, pp. 120-128, April 2006.

[33] D. A. Reynolds, “Gaussian Mixture Models”, In proc. of Encyclopedia of Biometrics, pp. 659-663, 2009.


389

Vocal fold disorder detection based on continuous speech by using MFCC and GMM

Documents

Transcript of Vocal fold disorder detection based on continuous speech by using MFCC and GMM