An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy...

1978 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 22, NO. 12, DECEMBER 2014

An Investigation into Back-end Advancementsfor Speaker Recognition in Multi-Session

and Noisy Enrollment ScenariosGang Liu, Student Member, IEEE, and John H. L. Hansen, Fellow, IEEE

Abstract—This study aims to explore the case of robust speakerrecognition with multi-session enrollments and noise, with anemphasis on optimal organization and utilization of speakerinformation presented in the enrollment and development data.This study has two core objectives. First, we investigate morerobust back-ends to address noisy multi-session enrollment datafor speaker recognition. This task is achieved by proposing novelback-end algorithms. Second, we construct a highly discriminativespeaker verification framework. This task is achieved throughintrinsic and extrinsic back-end algorithm modification, resultingin complementary sub-systems. Evaluation of the proposedframework is performed on the NIST SRE2012 corpus. Resultsnot only confirm individual sub-system advancements over anestablished baseline, the final grand fusion solution also representsa comprehensive overall advancement for the NIST SRE2012 coretasks. Compared with state-of-the-art SID systems on the NISTSRE2012, the novel parts of this study are: 1) exploring a morediverse set of solutions for low-dimensional i-Vector based mod-eling; and 2) diversifying the information configuration beforemodeling. All these two parts work together, resulting in verycompetitive performance with reasonable computational cost.

Index Terms—Classification algorithms, GCDS, PLDA, speakerrecognition, universal background support.

I. INTRODUCTION

S PEAKER recognition/verification, similar to other patternrecognition/verification tasks, depends largely on the en-

rollment and development utterances from each speaker/class.For nearly two decades, the Speaker Recognition Evaluation(SRE) program hosted by the National Institute of Standardsand Technology (NIST) has focused on a single instance (or ses-sion) for each speaker enrollment [1], [2] (all the core conditionsare single session training). Thus, speaker recognition systems,

Manuscript received July 12, 2013; revised October 30, 2013; accepted Au-gust 14, 2014. Date of publication August 26, 2014; date of current versionSeptember 26, 2014. This work was supported by the AFRL under ContractFA8750-12-1-0188 (approved for public release, distribution unlimited), andin part by the University of Texas at Dallas from the Distinguished UniversityChair in Telecommunications Engineering held by J. H. L. Hansen. The authorsnote that a preliminary investigation on UBS (from Section III) was presentedin [44], and an initial evaluation of the combined back-ends (shown in Fig. 4)was presented in [68]. The associate editor coordinating the review of this man-uscript and approving it for publication was Dr. Rodrigo C. Guido.The authors are with the Center for Robust Speech Systems (CRSS),

Erik Jonsson School of Engineering and Computer Science, University ofTexas at Dallas, Richardson, TX 75252 USA (e-mail: [email protected];[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/TASLP.2014.2352154

including universal background modeling-Gaussian Mixturemodeling (UBM-GMM) [3], the generalized linear discriminantsequence kernel support vector machine (GLDS-SVM) [4], theGaussian Super-vector SVM (GSV-SVM) [5], [6], [7], [8], andthe i-Vector with probabilistic linear discriminative analysis(PLDA) [9]–[13] back-ends1, have become rather optimizedfor the speaker detection task assuming a single enrollmentutterance. These solutions have produced impressive gains[14], [15], [16], with an accuracy of greater than 99% andcorresponding equal error rates of less than 2%.However, in realistic applications, the performance of

speaker verification systems is often limited by the variabilitybetween enrollment and verification sessions caused by thefollowing:1) intraspeaker variability (such as health or aging [17],sleepiness [18], circadian rhythm [19], emotional state[20], [21], [67], and conversational dynamics [22], [23],[24])

2) interspeaker variability (such as phonetic variation [25],background noise [26], [27], [28], and transmissionchannel [29]–[32])

One solution to address both intra- and interspeaker variabilityis to employ multiple sessions for the enrollment speaker/class.This trend is natural, given the relative ease in access to voicecontent in the digit era, in which multiple sessions for onespeaker can be accumulated over time.However, even if more than one token/instance is available

for one speaker/class, the resulting model is not always guaran-teed to represent an improvement as the solution is influencedby many details. What if the token or information is noisy? Ad-ditionally, can we extract more constructive information fromdevelopment data beyond enrollment data? Thus, the questionof how to employ instances for maximum benefit is critical.As a response to the aforementioned concerns, multiple ses-

sions and noisy data were introduced jointly for the first timein 2012 for SRE. In addition, there is a duration mismatch be-tween the enrollment and test data, which can cause phoneticcoverage variations. Enrollment data are allowed to be used forcollective modeling. These issues represent significant diver-sions from past evaluations and thus require a vigorous re-de-sign and optimization of the verification framework, using the

1‘Back-end’ in this content refers to classifier, which does not include scorefusion but may include score normalization to assure a stand-alone back-endcan individually provide final score and decision for further fusion

2329-9290 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU AND HANSEN: INVESTIGATION INTO BACK-END ADVANCEMENTS FOR SPEAKER RECOGNITION 1979

Fig. 1. High-level system flowchart. Five back-ends will be introduced in the following sections (Section II–V).

available enrollment and development data to the greatest ex-tent feasible.As a first exploration, this study will be constrained to the

multiple-session and noisy aspects of the data. We propose a se-ries of solutions toward finding a generic and effective strategyof utilizing the available enrollment and development data forsignificantly improving speaker verification in the noisy multi-session enrollment condition.Multi-session enrollment for speaker recognition has been

investigated previously based on corpora, such as YOHO[33]–[37], TIMIT and its derivatives [15], [38]–[41], andROSSI [10], [42], [43]. These datasets contain a limitednumber of speakers and/or noise variations and are typicallycollected under controlled conditions (for example, read speechwith a limited vocabulary), which precludes large-scale experi-ments reflecting more realistic and challenging conditions (forexample, spontaneous speech from telephone channels) foundin the SRE. There are also published reports based on multi-ses-sion corpora such as Switchboard 1 and 2 [3], [70]–[73], butthey mainly rely on less effective algorithms (i.e., GMM orGMM-UBM) and the recording conditions are all constrainedto be relatively clean telephone data. This study, to the bestof our knowledge, together with the efforts of other sites par-ticipating in SRE2012, is a first attempt at using large, noisymulti-session data and state-of-the-art technology for speakerrecognition. Part of this work is based on the methods proposedin [9]–[12], [44][45], but these methods are further adapted andmodified to investigate noisy multi-session scenarios.This paper is organized as follows: Section II–V describe the

five back-ends utilized in this study and discuss the specifictypes of information and processing employed. Back-end fusionsolutions are addressed in Section VI. A comprehensive exper-imental paradigm is established and discussed in Section VIIand VIII, and research findings are summarized in Section IX.A high-level flowchart of the system is illustrated in Fig. 1.

II. GAUSSIANIZED COSINE DISTANCE SCORING (GCDS)

The classical cosine distance scoring (CDS) for an i-Vector(one type of low-dimensional feature, more details will be pro-vided in Section VIII) speaker verification system can be for-mulated as follows [45]:

(1)

where is a projection matrix that may come from a within-class covariance normalization (WCCN) or linear discrimina-tive analysis (LDA) projection, ‘ ’ indicates the transpose oper-ation, and denotes the i-Vector of the th speech utterance.The operations are generally performed in a cascade fashion,where the i-Vector is first projected through the LDA and thentransformed through the WCCN matrix, both of which are esti-mated from a background dataset.LDA is a widely used technique for dimensionality reduction.

It can also enhance the discrimination of features by finding aset of orthogonal axes that minimize the within-class variationand maximize the between-class variation. One way to achievethis is to use the following optimization criterion:

(2)

where is the projection matrix consisting of the desired or-thogonal axes, is the between-class covariance matrix, and

is the within-class covariance matrix. These two matricesare derived as follows, where we assume a total of classes areavailable and a total of sessions are available for the th class:

(3)

(4)

where is the mean of the th class/speaker data

and is the global mean of all of the data.WCCN [46] was originally used for normalizing the kernels

in SVMs and was later adopted by Dehak et al. [45]. TheWCCN projection matrix can be derived from a Choleskydecomposition of , and can be calculated asin (4). The WCCN algorithm uses the within-class covariancematrix to normalize the cosine kernel functions to compensatefor inter-session variability while guaranteeing conservationof directions in space, whereas LDA alters the directions inthe feature space and makes discriminative information moresalient by suppressing the less-informative information. There-fore, WCCN is typically adopted after LDA to achieve optimalperformance.The performance of classical LDA-WCCN-CDSmethods de-

pends highly on the WCCN projection, which is often difficult


to estimate (particularly in noisy and/or channel mismatch con-ditions). Therefore, we propose to replace the WCCN with analternative idea based on Gaussianization, named GaussianizedCDS (GCDS). The algorithm is outlined below (the source codeis provided online [47]):

Algorithm 1: Gaussianized Cosine Distance scoring (GCDS)

input:

Enrollment Data: .

Test Data: .

Background Data: .

is the session index, is the speaker index, andis the feature dimension. Note, for test data D2, wehypothesize that all audio files are spoken by the samespeaker to simplify the notation.

output: a score matrix

1 of D1 over

2 ; ;

3 for

4

5 )

6 end

7 // A: LDA projection matrix

8 for

9 // t is the transpose operation

10

11 end

12 output .

Note, the ‘Gaussianization’ operation is performed on avector basis: , where is an -vector.The ‘Length_normalization’ operation is performed ona vector basis: , where is an -vector. The‘Score_normalization’ operation is performed on a vectorbasis: , where is the score vectorof a test file against all speaker models.

III. UBS-SVM ANTI-MODEL (UBSSVM)

During the SVM modeling process, only the support vectorhas an impact on the final classification performance. Fur-thermore, to build an SVM with only one positive example(enrollment speaker) and thousands of negative examples (im-poster speakers), the unbalanced data issue must be carefullyaddressed; otherwise, the final performance may be severelydegraded. The typical practice is to start with an individualSVM model training using a common imposter backgrounddataset with the assumption that the imposters may share acommon subspace [69]. This is a natural assumption, but thedata imbalance could produce an over-fitting hyperplane forthe enrollment speaker.Based on this reasoning, we propose to employ all of the en-

rollment speakers as samples of positive data and all of the im-poster background entries as instances of negative samples totrain a single binary-class SVM (noted as the universal SVM);

Fig. 2. UBS imposter dataset selection. (a) Traditional SVM modeling for aone-enrollment vs. all-imposter set; (b) the proposed UBS-based SVM mod-eling is based on all-enrollment vs. all-imposter idea. Here, “X” and “+” entriesdenote the negative and positive examples, respectively. The circled “X” and“+” entries denote the support vectors for the negative and positive examples,respectively. All circled “X” in (b) consist of the UBS imposter dataset.

next, all of the support vectors from the imposter side will forma new dataset that can be called the Universal Background Sup-port (UBS) imposter dataset (Fig. 2 illustrates this concept). Be-cause the UBS imposter dataset is derived by referring the in-formation from all enrollment speakers, it is expected to spana more realistic imposter space and thereby prevent potentialover-fitting issues (Fig. 2(a)), which plague many developmentstages. The enrollment data are used collectively, which is a newtrend that was presented in SRE2012 but has been neglected bymany sites.Once the UBS imposter dataset is established, it can be used

to replace the original entire imposter background dataset totrain the individual enrollment speaker SVMmodel. We can seethis procedure extracts the imposter information from a “gen-eral” perspective. This is a straightforward method without therequirement of tuning any complex configuration parameters asin [74].Because the UBS imposter dataset selection method is an ap-

proach to finding the more general imposter data, it may sufferby neglecting the “specific” imposter information for an enroll-ment speaker SVM model. If there is an approach that can in-tegrate a more target-related specific imposter dataset or struc-ture, then this more comprehensive imposter selection methodis expected to contribute to a more discriminative verificationsystem. From a conceptual perspective, the rationale for thiscomprehensive method is similar to the procedure followed ina traditional UBM with “adaptation”, a classic solution used inmany previous speaker verification/identification studies to ad-dress the sparse training data issue. In the construction of theSVM, each support vector in the model is assigned a weightingcoefficient that indicates how much influence a given supportvector has on the positioning of the hyperplane. Thus, everygroup of imposter support vectors in an individual SVM modelcan be thought of as a possible imposter “adaptation” datasetsource. Specifically, for the “UBM” step, the proposed methodwill be used to prepare the initial background imposter dataset,and support vector samples will be utilized to accomplish theSVM model “adaptation”.To sum up, the proposed background data selection method

can be illustrated in Fig. 3 and implemented as follows [44]:


Fig. 3. The flowchart of UBS data selection-based SVM modeling.support vector; ; is the pooling of all NSVs

from -type models (i.e. , where ). Each of thefour steps is isolated with a purple dash line.

Algorithm 2: Universal Background Supportdata selection-based SVM and Adaption (denotedUBSSVM-adaption).

Step 1: Using all of the enrollment speaker data (denotedset ) as positive examples and all of the imposter speakerdata (denoted set ) as negative examples, train a universalSVM (denoted ).

Step 2: Using only negative support vectors of theand data of the th enrollment speaker, train the th speakermodel (denoted ), where and S is thetotal number of enrollment speakers.

Step 3: Using set as negative samples and the data of theth enrollment speaker, train the th speaker model (denoted

, resulting in S -type models in total).

Step 4: Pooling the negative support vectors from theand the top most frequent negative support

vectors from all of the S , form a new negativesupport vector set (denoted set ) using the set and the dataof the th enrollment speaker to train the th speaker model(denoted ) (Note: this is termed a “specific” or“adapted” SVM, where is a parameter that was optimizedon the development data by following data-driving algorithmproposed in [74].).

It should be noted that for the last three steps (Step 2, 3 and4), the modeling process for the enrollment speakers in eachstep can be parallelized. Step 3 can be processed side-by-sidewith the cascade procedure of Step 1 and Step 2 (Fig. 3). Ourobservation shows that Step 3 takes comparable time cost asStep 1 and 2 and therefore do not incur extra time costs.

Given some sessions are seriously corrupted by noise andchannel distortion, and the complexity of Step 4, we also pro-pose the following modification of Algorithm 2 to better addresspotential distortion and make it more feasible:

Algorithm 3: Universal Background Support dataselection-based SVM in the noisy multi-session scenario(denoted UBSSVM).

Step 1: Average the i-Vectors of the th enrollment speaker.

Step 2: Follow first two steps of Algorithm 2.

Step 3: Apply score normalization (same as in Section II).

Multi-session averaging is utilized because we observed thatthis technique can yield better performance than not averagingbased on the development data. This is because multi-sessionsinvolve noise and channel disturbances, which may lead to aless robust hyperplane. This approach agrees with the commonsense notion that a few high-quality samples of a speaker arepreferred over a large number of poor or questionable samples ofthat speaker. Therefore, “averaging” is employed as one methodto achieve a high-quality sample in Step 1 of Alg. 3.This proposed method utilizes information derived not only

from the limited available enrollment data but also from a largeamount of imposter data that helps the verification decisionwhen a non-target trial is encountered. In Section IX, it will beshown that this back-end is very competitive.

IV. L2-REGULARIZED LOGISTIC REGRESSION (L2LR)

The third back-end approach considers logistic regres-sion (LR), which is useful in many areas, such as documentclassification and natural language processing (NLP). Thereis also ongoing research exploring the application of LR inspeaker verification. Kernel LR is proposed in [48], [49]for an MFCC frame-level, feature-based speaker verificationsolution. i-Vector-based LR was recently investigated forthe telephone-telephone condition of SRE2010 [12]. Limitedstudies have investigated its application in the multi-sessionand noisy case. Therefore, a more comprehensive investigationis warranted here.Given a set of instance-label pairs , whereand , LR binary classification assumes the

following probability model:

(5)

To consider a simpler derivation without the bias term , oneoften augments each instance with an additional dimension asfollows:

(6)

With this, (5) can be updated as follows:

(7)


The only LR model parameter, , can now be estimated bymaximizing the probability as follows:

(8)

or by minimizing the negative log-likelihood as follows:

(9)

Moreover, to obtain good generalization abilities, one addsa regularization term ; thus, in this study, we considerthe following unconstrained optimization problem of regular-ized logistic regression:

(10)

where is a penalty parameter used to balance the twoterms in (10). The trust region Newton method is employedto learn the LR model (i.e., , by solving (10)), which isalso referred to as the primal form of the logistic regression(written as L2-regularized LR by solving the primal form or asL2LR-primal) because the dual problem can instead be solved(written as L2LR-dual) [50]. Because our pilot experiment indi-cated that better results can be achieved through L2LR-primal,only L2LR-primal is employed in the following analysis, whichwill be written as L2LR to avoid confusion.In the testing phase, we predict a data point to be positive if

and negative otherwise. In this study, LR is extendedvia the one-versus-the-rest strategy to a multi-class classifica-tion model. This model is a special case of conditional randomfields and is also called the maximum entropy model in the NLPcommunity. The LIBLINEAR toolkit is employed to deploy theL2LR (the source code having been slightly modified for par-allel computation) [51].In contrast to the two previous approaches (i.e., GCDS and

UBSSVM), the i-Vectors of each enrollment speaker are notaveraged and LDA is not applied. This solution only employsenrollment data to build the model for the target speaker; nothird-party imposter data are involved. That is, for a specifictarget speaker, all other speakers’ data are used as imposter data.The parameters of this classifier are optimized on the devel-opment dataset using the hill-climbing method. The scores areagain normalized (the same as in Section III).

V. MULTI-SESSION PLDA (PLDA1, PLDA2)

In speaker verification, it is typical that only one instance isavailable for each enrollment target speaker, where probabilisticlinear discriminant analysis (PLDA) has been shown to be thestate-of-art back-end [9], [11]. Here, we propose two methodsto address the multi-session case with PLDA.The widely adopted Gaussian PLDA is a generativemodeling

approach. We assume that we are given recordings/sessionsfor the th speaker and denote the corresponding -dimen-sional feature vector by ,where is the total number of enrollment speakers. With the

above notation, the given feature vector can be decomposedas [11]:

(11)

where and are rectangular matrices and represent the eigen-voice and eigenchannel subspaces, respectively. Additionally,and are the speaker and channel factors, respectively.

is the residual term. For convenience, (11) can be reformattedas

(12)

Equation (12) means that each instance can be decomposedinto two components: a) the speaker component , which de-pends only on the speaker, and b) the “channel” component ,which varies among different sessions (here, the term ‘channel’is used in a generic sense because it represents all other infor-mation of non-interest other than the speaker component). Theresidual term is typically assumed to be a Gaussian distribu-tion with a zero mean and diagonal covariance. However, if theresidual term is assumed to be a full covariance matrix, we canachieve the same modeling capability by removing . Then,the PLDA model becomes

(13)

For verification involving two feature vectors and , wenext need to explore the following hypothesis:

comes from the same speakercomes from different speakers

(14)

which can be unified by calculating the following log likelihoodratio:

(15)

A. Before-scoring Average PLDA (PLDA1)

The i-Vectors of the th enrollment speaker are grouped andaveraged before applying PLDA for the likelihood scoring.Therefore, we name this approach as “before-scoring averagePLDA” or simply PLDA1. Supposing that representsall of the enrollment data for the th enrollment speaker andthat is the test data, the verification score can be formulatedas

(16)

This formulation allows us to use the centroid of multipleinstances of each speaker to average out the potential noiseand/or the channel mismatch. This approach is widely used inSRE2012 and is thus employed as the baseline in this study.


B. Post-Scoring Average PLDA (PLDA2)

Each i-Vector of the enrollment target file is treated as if itoriginated from a different speaker. After applying PLDA, thescores of the test file against all of the instances of the th en-rollment speaker are averaged and used as the likelihood scorefor the trial of the test data coming from the th speaker. Thatis, the verification score can be formulated as follows:

(17)

Compared with PLDA1, this approach employs an aver-aging operation after one pass of scoring based on PLDA,and therefore is named as “post-scoring” average PLDA orsimply PLDA2. This score is equivalent to a majority votefor the verification decision with the expectation that eachindividual sample or utterance captures some combination ofthe acoustic-based speaker characteristics and environmentaldistortion. This technique can be understood as multi-con-dition/multi-style training [41] and echoes the multi-sessionenrollment scenario that is neglected in PLDA1.

VI. AUXILIARY SIDE INFORMATION

One of the fundamental questions of automatic speaker veri-fication is how to extract multi-level information from raw data.In most cases, we simply transfer the “human learning” task toa machine by providing raw data blindly to the various machinelearning-based classifiers. However, there are many examplesof non-relevant (or, “red-herring”) information that do not con-structively improve the solution. For example, some data maybe extremely noisy andmay consume a disproportionate amountof attention, causing the system to easily fail during real testing.One effective solution for this situation is to integrate auxiliaryside information, also known as quality measures, into the fu-sion process [52]–[54]. Unlike traditional methods, where thefusion system is trained on development data and kept fixedduring run-time, the idea of using auxiliary information fusionis to adapt the fusion for each test case [23]. The signal-to-noiseratio (SNR) [55], segment duration [56], and non-nativenessscores [53] have all been investigated as potential auxiliary sideinformation.To further expand research in this direction, this study will in-

vestigate new auxiliary information. In the UBSSVM describedin Section III, we use all of the enrollment samples as positiveinstances and all of the imposter negative instances to build theuniversal SVM (denoted ). This technique can be ap-plied to measure the “genuine degree” of the test data; the prob-ability that the sample comes from some enrollment speaker in-creases with an increasing value of the “genuine degree”. The“genuine degree” for the th test data is calculated as:

(18)

where is the th test data (i.e., i-Vector) and is the normalvector to the hyperplane of which defines the positionof the hyperplane with the scalar variable .We expect that this can help in addressing the adverse im-

pact from the so-called “Doddington’s zoo” (or the Doddingtonmenagerie or biometric menagerie) [57], in which some animal

may be easily imitated (lambs), some may be good at imitating(wolves), and yet others may be particularly difficult to rec-ognize (goats). The existence of lambs and wolves in speakerrecognition may cause system performance to degrade with ahigh rate of false alarms, and that of goats may cause a highrate of missing alarms. The idea behind the universal SVM isto make a collective vote on the genuine degree of the test (orquery) data. Based on this assumption, only the genuine degreescore of the testing file is derived.To provide a point of reference, some traditional side infor-

mation (e.g., duration and SNR) are also explored. For the aux-iliary information of duration, two different VAD algorithms areemployed because we also wish to investigate whether there is abenefit from potential alternate redundant information. For theVAD auxiliary information, the average active speech durationof all enrollment sessions is used. For the test segments, the ac-tive speech duration of the corresponding utterance is used. Theduration value is directly used as the quality measure. For theSNR auxiliary information, the recently proposed noise-robustWaveformAmplitudeDistributionAnalysis (WADA) algorithmis used [58]. One shared characteristic of these two types of aux-iliary information is that they can measure the quality of thepotential model. They are expected to work together with theaforementioned genuine degree score to boost overall systemperformance.Here, we wish to highlight some of the differences between

our proposed “genuine degree” and traditional side information.Unlike traditional side information (such as SNR, duration) thatsolely depends on the audio files per se and can be derived be-fore modeling, “genuine degree” is produced from modelingand therefore depends on both the audio file and the modelingprocess. However, given the nature of being side product frommodeling, it is still grouped with side information. In terms oftime cost, 1) “Genuine degree” is a natural side product resultingfrom the modeling phase, which incurs no significant extra timecost is needed to derive this information; and 2) For “genuinedegree”, only the test file in needed for processing, which can bedone quickly in the UBS-SVM framework, plus they can alsobe processed in parallel. The training files do not need to be pro-cessed, which actually gives this approach improved time costadvantages over traditional side information, especially whenthe training corpus is very large since traditional side informa-tion most operator on both train and test files.

VII. BACK-END FUSION

A majority of fusion approaches to speaker recognition arebased on trial-and-error and optimization on given datasets. Thesuccess of a particular combination depends on the performanceof the individual systems, as well as their complementariness.Ferrer et al. [59] suggested that uncorrelated systems fuse betterthan correlated ones, which is one of the guidelines behind thebuilding of individual back-ends. Special attention is given todiversification of information utilization to promote the com-plementary properties and to avoid model over-fitting.As noted in the previous sections, in addition to the ap-

parent differences from information processing inside differentback-end systems due to the modeling nature, another signif-icant difference lies in the outside information combination


Fig. 4. i-Vector based Speaker verification system block diagram. Data 1 and 2 correspond to raw feature data for UBM and the total variability matrix, respec-tively. ‘Audio data’ means all of the acoustic data involved for the verification task. PLDA1 is employed as the baseline back-end.

TABLE IDATA/INFORMATION USAGE AMONG DIFFERENT BACK-ENDS

variation supplied to various back-ends. Some back-ends onlyuse enrollment data and test data (L2LR) for modeling, whereasother back-ends exploit imposter data (UBSSVM), and yetother back-ends leverage background data (which may havesimilar channel variations) to learn the data variability and thuscompensate for the environment mismatch (PLDA, GCDS)in potential verification trials. Another difference results fromwhether we employ “information emphasis” via LDA, in whichsome discriminative information is preserved at the cost ofremoving less relevant information. A third difference lies inhow the multiple enrollment sessions and their by-products areprocessed (for example, verification scores involving differentsessions of the same enrollment target speaker). Table I sum-marizes the similarities and differences between the particularback-ends.In addition, we also propose a new type of auxiliary informa-

tion extraction method, called the genuine degree, to suppressthe “red-herring” effect produced by the atypical data. Togetherwith the VAD duration and SNR measures, we expect that thesetypes of auxiliary information can build amore effective speakerverification back-end framework.We observe that various back-ends differ in the degree to

which speaker information is utilized (whether using imposterdata or full dimension) and in how instances of each categoryare utilized (with or without averaging). Different informationutilization combines with different modeling. Therefore, usingfusion, we expect that the likelihood score will be reinforced to-ward the correct decision while the contradictory decision willbecome less likely. The logistic regression algorithm from theBOSARIS toolkit is used for fusion [60].

VIII. EXPERIMENT SETUP

The proposed methods are validated on the SRE2012 evalua-tion dataset [2]. The flowchart is illustrated in Fig. 4. The majorcomponents are highlighted in the following sub-section. Dueto its popularity, PLDA1 is employed as the baseline back-end.

TABLE IINUMBER OF SPEAKERS, SEGMENTS AND TRIALS IN THE EVAL SET

A. Front-End Processing

Before feature extraction, all waveforms are first down-sam-pled to 8 kHz, and blocked into 25 ms frames with a 10 msskip-rate. All our features use 12 cepstral coefficients andlog-energy/C0, appended with the first and second order timederivatives, thus providing 39 dimensional feature vectors.VAD is employed to remove silence and low energy speechsegments. We utilized four different acoustic features: (a) Melfrequency Cepstral coefficients (MFCCs) (normalized byquantile cepstral normalization2(QCN) [61] and low-passRASTA filtering [62]); (b) Rectangular Filter-bank CepstralCoefficients (RFCCs) [61] (processed through feature warping[63]); (c) Mean Hilbert Envelope Coefficients (MHECs) [64];and (d) perceptual Minimum Variance Distortionless Response(PMVDR) [64].

B. i-Vector Extraction

Introduced in [45], the i-Vector is one of the key parts of astate-of-the-art speaker verification system. The model is repre-sented by

(19)

where is the total variability space matrix, is the i-Vector,is the UBM mean supervector, and is the super-vector

derived from the supra-segmental features.The extensive computational requirements for an i-Vector

system result mainly from the UBM training and total vari-ability matrix training. Gender-dependent UBMs with 1,024mixtures are trained on telephone utterances (see Table II). Forthe total variability matrix, , the UBM training dataset andadditional SRE2012 target speakers’ data (both clean and noisyversions) are used. Five iterations are used for EM training[65]. The i-Vector dimension is set to 600. In the i-Vectorpost-processing module, LDA is applied to reduce the i-Vectorto 400 dimensions, and length normalization is then applied(the L2LR solution skipped this step; see Table I).

2The codes are available via http://www.utdallas.edu/~hynek/tools.html


TABLE IIICORPORA USED TO ESTIMATE THE SYSTEM COMPONENTS

C. Experimental Dataset

To perform a pilot experiment, the proposed method will beevaluated on both male and female cases for two feature front-ends (MFCC and RFCC). Then, the optimal system will beapplied to the SRE2012 evaluation [2], where all of the fourfront-ends are employed.Experimental Setup: In preparing the development system,

we maintained a close collaboration with the I4U consortium. Atotal of 1,918 SRE2012 target speakers’ utterances are collectedfrom SRE2006-2010, and a training-testing list pair is preparedfor evaluation. The training list includes multiple sessions perspeaker, and the testing list includes both known and unknownnon-target speakers, following the SRE2012 protocol. The trialsare designed based on the three criteria: a) training and testingfiles are always from a different session, b) both telephone andmicrophone recordings are kept for enrollment, and c) 6 dB and15 dB noisy versions are created for each segment. The noisesample is randomly chosen from a pool of noise files [65].Two set of speaker ID tasks were prepared, namely Dev and

Eval. The motivation to do this was to train and test the systemon Dev and run it on Eval to verify if the methods used in Devalso provide benefit in Eval. The Eval task was designed to becloser to the actual SRE’12 evaluation, so that the fusion andcalibration parameters trained onDev can also be tested on Eval.This means, the Eval task had a larger number of test utterancesand trials (see Table II3). Pilot experiments are based on Evalset.Assistive Modeling Dataset: Except the necessary data for

the target enrollment data and test data, we still need some dataas building block for some back-ends (hence named as “assis-tive modeling dataset”) for such operations as LDA and SVM.Specifically, the development dataset for PLDA1 and 2 includesthe UBM training dataset and the clean and noisy speaker en-rollment data. TheUBSSVMdevelopment dataset includes onlyUBM data and has no speaker data that overlaps with the enroll-ment data. For the GCDS back-end, the UBM training data areused as background data. L2LR does not use a development setand avoids potential information distortion stemming from de-velopment sets. Table III summarizes how the specific datasetare used across the various sub-tasks.Before moving on, we design a small experiment to prove the

indispensability of imposter data and the importance of correctmodeling data configuration for the state-of-the-art system. Theexperiment is based on male trials of NIST SRE 2010 tele-phone train/test condition (condition 5, normal vocal effort).After voice activity detection, 39-dimensionMFCC features are

3The lists are available via http://cls.ru.nl/~saeidi/file_library/I4U.tgz

extracted, using a 25 ms window with 10 ms shift and Gaussian-ized using a 3-s sliding window. Gender dependent diagonal co-variance UBM with 1024 mixtures are trained on utterances se-lected from Switch-board II Phase 2 and 3, Switchboard CellularPart 1 and 2, and the NIST 2004, 2005, 2006 SRE enrollmentdata (note as Dataset A). For the TV matrix training, the samedataset (Dataset A) is utilized. 400 dimensional i-Vectors are ex-tracted, whitened and then length normalized. For session vari-ability compensation and scoring we use PLDA. According tothe data source variation used for the UBM and TV matrix, weconsider three scenarios: a) enrollment data; b) enrollment dataand test data (This is illegal according to the NIST protocol, weonly mean to illustrate our point); and c) independent third partydata (such as Dataset A; this is also imposter data since there isno overlap among Dataset A and SRE 2010 data). The EER forthe three scenarios are: 49.7%, 33.4%, and 2.1%, respectively.The first two cases do not rely on background/imposter data butclearly perform very poorly. This clearly demonstrates the piv-otal role of background/imposter data configuration.SRE2012 Evaluation: Finally, the proposed methods will be

validated on the actual SRE2012 evaluation data.This study will focus on five core test conditions [2]:

Cond. 1:interview speech without added noise (int-cln);Cond. 2:phone call speech without added noise (phn-cln);Cond. 3:interview speech with added noise (int-noi);Cond. 4:phone call speech with added noise (phn-noi);Cond. 5:phone call speech collected in a noisy environment

(phn-noiE).

IX. RESULTS AND ANALYSIS

A. Sub-System Investigation

First, we wish to validate the performance of GCDS. Theresults are detailed in Table IV. For simplicity and generic com-parison, only the MFCC-based i-Vector is used here (similarresults were observed in the other features; see Fig. 5.). Scorenormalization can significantly enhance performance in both theclassical CDS and proposed GCDS. We also observe similartrends in the L2LR and UBSSVM back-ends. Thus, for thosethree back-ends, score normalization is always employed. It isobserved that GCDS outperforms CDS significantly for bothmales and females, which confirms the validity of GCDS.The performance of individual back-ends is summarized in

Fig. 5 using detection error tradeoff (DET) curves. GCDS andPLDA2 are very competitive in terms of EER. UBSSVM canoffer the most competitive minDCF in both male and femalecases [2]. L2LR is inferior to the other three top back-ends (i.e.,PLDA2, GCDS, and UBSSVM). However, it is observed later(See Section IX-B1 and Fig. 9) that L2LR provides a significantperformance gain with respect to fusion.To further verify the merits of the fusion approaches, we

also summarize the relative gain of different fusion approachesagainst the averaged PLDA1-based individual front-ends inFig. 6 (‘Avg. 1’ in Table V). The front-end based fusion canimprove the EER and minDCF by approximately 16% for bothgenders. However, this benefit is acquired at a high computa-tional cost for the i-Vector extraction pipeline, which involvesextracting raw features, training the UBM, and calculating


Fig. 5. DET plot for male (Left) and female (Right) conditions. Numbers in parentheses are EER (%) and minDCF (x100), respectively.

TABLE IVPERFORMANCE COMPARISON BETWEEN CLASSICAL CDS AND GCDS.RELATIVE GAIN IS COMPUTED BETWEEN 5TH AND 6TH COLUMNS

Fig. 6. Relative gain of specific fusion combinations vs. average individualfront-end performance metrics (based on back-end 1: PLDA1). Gain 1-3 areobtained from Table IV.

the total variability matrix for each front-end (usually on thescale of thousands of audio files). Compared with the averagedsingle front-end and back-end case (‘Avg. 1’ in Table V), thefusion based on multiple back-ends but with a single front-endperforms considerably better, achieving a relative gain (Gain2) of 49% and 42% in EER and minDCF, respectively. Finally,when the two front-ends and five back-ends are combined, wehave a fusion of 10 systems providing a relative gain (Gain 3)of 56.5% and 49.4% in EER and minDCF, compared to the

Fig. 7. Performance of four quality measures and their fusion. The first fiveblocks correspond to thefive core test conditions. The last block, denoted “avg.”,corresponds to the performance averaged across all five conditions. In eachblock, the five bars correspond to all of the subsystems fused with VAD1, VAD2,SNR, and UBSSVM genuine degree scores and all of the four quality measures.

case ‘Avg. 1’ in Table V, where an additional 7% gain againstGain 2 comes from the various features. This gain is limited bythe similar signal processing procedure behind them.

B. Results and Analysis for the Five Core SRE2012 Conditions

The analysis presented thus far is based on the developmentdataset, which is designed to mimic what might be encoun-tered in the real evaluation for SRE2012. To further validatethe proposed approaches and compare them with other sites, wealso report the results on the five core evaluation conditions ofSRE2012. For the development set, we mainly rely on EER andminDCF to validate performance because we wish to quicklyjudge the discrimination rather than the calibration of our algo-rithm [66]. To assist in the potential comparison and analysis,minCprimary will be used as the metric, except when noted oth-erwise. Here minCprimary is the minimum detection cost forCprimary, all of which are defined in [2].Individual Sub-System Performance: First, we validate the

genuine degree score quality measure derived from UBSSVM


TABLE VPERFORMANCE COMPARISON OF INDIVIDUAL FRONT-END/BACK-END SYSTEMS AND DIFFERENT FUSION COMBINATION SYSTEMS. ‘ALL-1’ INDICATES A FUSIONSYSTEM BASED ON TWO FRONT-ENDS (MFCC AND RFCC) WITH BACK-END 1 (PLDA1).‘ALL-ALL’ INDICATE THE FUSION IS BASED ON THE TWO FRONT-ENDSWITH ALL FIVE BACK-ENDS (A TOTAL OF 2 X 5 SYSTEMS). ‘AVG. 1’ IS THE AVERAGE PERFORMANCE OF BACK-END 1-BASED SINGLE FRONT-END (DARKSHADED AREA) SYSTEMS. ’AVG. 2’ IS THE AVERAGE PERFORMANCE OF THE BACK-END-BASED FUSION SYSTEM (LIGHT SHADED AREA). ’GAIN 1’, ‘GAIN2’, AND ‘GAIN 3’ ARE THE RELATIVE GAINS OBTAINED FROM ‘ALL-1’ VS. ‘AVG. 1’, ‘AVG. 2’ VS. ’AVG. 1’, AND ‘ALL-ALL’ VS. ‘ALL-1’, RESPECTIVELY.

BACK-ENDS ARE THE FIVE BACK-ENDS CORRESPONDING TO PLDA1, GCDS, L2LR, UBSSVM, AND PLDA2.

Fig. 8. Back-end performance comparison on the five core SRE2012 conditions in the scenarios of (a) without a quality measure (top), and (b) with a qualitymeasure (bottom). The first five blocks correspond to five core test conditions. The last block, denoted “avg.”, corresponds to the performance averaged across allfive conditions. In each block, the six bars correspond to back-end PLDA1, GCDS, L2LR, UBSSVM, PLDA2, and their fusion.

and compare it with three traditional quality measures, as illus-trated in Fig. 7. The UBSSVM genuine degree score providesbetter performance than the SNR as auxiliary information in allconditions except condition 4, where it is inferior to the SNRbut still provides the second best individual performance. Thissuggests that the proposed genuine degree can work as an al-ternative quality measure. In addition, improvements across allfive conditions can be observed by fusing all of the quality mea-sures. Therefore, all four of the quality measures are employedin the final system evaluations.Comparisons of each individual back-end performance

based on the core conditions, both without and with auxiliaryinformation, are demonstrated in Fig. 8(a) and (b), respec-tively. Fig. 8(b) highlights improved performance comparedwith Fig. 8(a) based on the impact of quality measures.Fig. 8(a) indicates that although back-end L2LR and PLDA2are competitive overall, they are less effective than PLDA1in three of the five conditions, and PLDA2 is less effectivethan PLDA1 in one of the five conditions (i.e., in condition3). Similar trends can be observed in Fig. 8(b). Overall,fusions based on all back-ends outperform the baseline, that is,PLDA1, by 32.0% and 42.2% in the cases of with and withouta quality measure, respectively. This result suggests that the

quality measures improve performance more for back-endsother than PLDA1.Due to the lackluster performance of L2LR, one natural re-

sponse is to determine if L2LR should be retained. To check theindividual impact, a series of one-left-out rotation experimentsis performed. For example, in the scenario of “w/o PLDA1”,PLDA1 is left out while the remaining four back-ends are uti-lized in the fusion process. As the final performance becomesworse, the contribution from the removed back-end to the finalintegrated solution becomes more important. The results areillustrated in Fig. 9. The case where all five back-ends are usedis also provided as a reference. The L2LR solution clearlyplays a vital role in the fusion. Furthermore, PLDA1 does notperform well, being the least constructive (as in conditions 1,2 and 5) and even reducing performance (in condition 3). Theother three proposed back-ends all contribute to the fusion byroughly the same degree in terms of the ‘avg.’ bar block inFig. 9.System Gain Flow: Finally, we analyze the gain flow in

Table VI step-by-step, starting from the baseline system to thefinal grand system, where all of the 80 sub-systems (includingfour front-ends, five back-ends, and four quality measures) areinvolved. The relative gain and gain distribution are illustrated


Fig. 9. Back-end performance comparison on the five SRE2012 conditions. The first five blocks correspond to the five core test conditions. The last block, denoted“avg.”, corresponds to the performance averaged across all five conditions. In each block, the six bars correspond to “without PLDA1, GCDS, L2LR, UBSSVM,PLDA2”, and “with all back-end”.

Fig. 10. Gain flow and distribution. Relative gains are computed as in Table VI. The first top five blocks correspond to the five core test conditions. The last block,denoted “avg.”, corresponds to the performance averaged across all five conditions. In each block, the three bars correspond to Gain 1, 2, and 3, respectively. Gain1 through 3 are computed as in the last row and last three columns ( Quality Measure or side information).

TABLE VISYSTEM GAIN FLOW FROM BASELINE SYSTEM TO THE FINAL GRAND SYSTEM. ‘GAIN 1’, ‘GAIN 2’, AND ‘GAIN 3’ ARE THE

RELATIVE GAINS OBTAINED FROM STEP 2, 3, AND 4 VS. STEP 1, RESPECTIVELY. MINCPRIMARY IS DEFINED IN [2].

in Fig. 10. Table VI indicates that overall, each additionalstep yields further improvements. The PLDA1-based front-endfusion fails to produce better performance on condition 3(where the fusion gain degrades by 5.63%) and highlightsthe limitation of a purely front-end-based fusion. Otherwise,Steps 2 through 4 offer significant and consistent gains on allconditions (Table VI). Given the total 48.52% improvement,more than half of that is derived from the back-end fusion. Thesecond largest contribution comes from the front-end fusion.The specific sources of performance gains here correspond toour earlier observations from the development set. Thus, the

proposed multi-session back-end framework can be generalizedto the unseen SRE2012 evaluation data.Less than 1 hour is needed to complete the back-end system

processing, while roughly 48 hours are needed to derive thei-Vector for the four front-end features (on our cluster com-puting system with 408 CPUs [65]). Thus, the observation ofthe distribution of gain is meaningful for realistic applicationsin the speaker verification system when resources are limited.Overall Analysis for the Five Common Conditions: It should

be noted that the data that makes up in the five common con-ditions differ across the common conditions in ways unrelated


TABLE VIIFINAL SYSTEM PERFORMANCE ON THE FIVE SRE2012 CONDITIONS.CPRIMARY AND MINCPRIMARY ARE THE COST FUNCTIONS DEFINED

IN [2] (SMALLER VALUES ARE BETTER.)

to the condition definition (for example, balance of durationand training file number for enrollment speakers, varied envi-ronment noise). It is not advised by NIST to compare perfor-mance across conditions, but only within them [75]. So, wewill only rely on the relative improvement of proposed sys-tems against the baseline for initial analysis. Overall, the pro-posed back-end frame-work helps most in telephone conver-sation speech (cond. 2, 4, and 5), where the relative improve-ment is close to, or higher than, the average relative improve-ment (i.e., 48.52%). As a comparison, the relative improvementon the interview speech is below average (cond. 1 and 3). Thismay be due to the relative high quality speech resulted from in-terview speech style and recordings (for example, less disturbedcontinuous speech from the same speaker, head mounted close-talking microphone), which have less potential room for per-formance gain. It is especially noted that the system offers sig-nificant improvement in cond. 5, where the telephone channelspeech were intentionally collected in a noisy environment. Theother noisy conditions, by contrast, were established with arti-ficially added noise (cond. 3, 4, and 5). These two observationshave a practical impact, since in real life applications telephonespeech occurs more often than interview speech, and naturallycorrupted noise occurs more often than artificially added noise.Final System Performance: The final performance is detailed

in Table VII, which is comparable with some of the best sys-tems in SRE2012. The results presented here are based on asingle fusion solution, unlike a number of the SRE sites whosesubmitted systems included different fusion configurations. Theresults here confirm the usefulness of the proposed individualsystem advancements, as well as the final fusing strategy in ad-dressing the problem of robust speaker ID for the SRE2012 task.

X. CONCLUSION

For almost two decades, the NIST SRE has focused on asingle instance (or session) per enrollment for each speakeras the primary task. Consequently, speaker recognition sys-tems have become optimized for the speaker detection taskconsisting of a single utterance enrollment. This study hasaimed to explore robust speaker recognition in multi-sessionenrollment and noise cases, with an emphasis on optimal uti-lization and organization of the speaker information presentedin the data. Two objectives were originally established. First,we wanted to investigate back-ends that would be more robustto noisy multi-session data. Second, we wanted to construct ahighly discriminative back-end framework by fusing severalback-ends.The first goal was achieved by proposing a new back-end al-

gorithm (UBSSVM, which performs the best among all back-ends when averaging across the five conditions), modifying the

traditional back-end algorithms (such as GCDS and PLDA2,which are almost always better than the baseline system) andintroducing some seldom-explored back-end algorithms (suchas L2LR, which is less impressive but still better than the base-line system in some situations and is very constructive in systemfusion). The second goal was an extension of the first, whichwas partially realized by constructing heterogeneous yet poten-tially complementary individual back-ends that can work to-gether to present a more discriminative back-end framework.While there are differences in terms of the internal operation foreach back-end, attention was also given to externally diversifythe information pool and utilization, which is a starting pointin many pattern recognition studies but is often neglected. Tofurther realize the second goal, a novel type of auxiliary infor-mation, the genuine degree score, was proposed to reduce theover-fitting phenomenon of atypical data (i.e., “red-herrings”).Finally, as an additional effort in diversification, four parallelfront-ends were also employed.In terms of overall performance, results from both develop-

ment data (male and female) and actual SRE2012 evaluationdata were very promising. The back-end framework dramat-ically improved performance against the baseline system aswell as against the front-end based fusion. Overall, the EERcan be improved by on the development data acrossboth genders, and the minCprimary was improved byfor SRE2012 across the five conditions. In terms of the gainflow and distribution, the contributions from back-end fusionaccounts for more than 50% of the total gain (i.e., 48.52%).Front-end fusion is the second important gain source. Thequality measure fusion also contributes significantly. Thisresult suggests that compared with a state-of-the-art PLDAsystem, the proposed back-end framework and the informationdiversified organization-based back-end fusion framework canbetter address the new challenges presented in the multi-ses-sion enrollment and noisy speaker verification posed by theSRE2012 evaluation. The final performance is comparablewith, and even outperforms, some of the best systems forseveral testing conditions in SRE2012 [2].Compared with state-of-the-art SID systems on the NIST

SRE2012, the novel parts of this study are: 1) exploring a morediverse set of solutions for low-dimensional i-Vector basedmodeling (most of the top systems are purely PLDA basedsystem [76]–[78]); 2) diversifying the information for mod-eling to increase the system complementarity, which is seldominvestigated by other researchers. All these work together tocontribute competitive performance with decent computationalresources. Compared with other top systems where the mixtureof UBM is 2048 ([76]–[78]) and front-ends have at least 5 vari-ations ([76][77]), our system is based on 1024 UBM mixtureswith 4 front-ends, which dramatically reduces the modelingand verification resources due to the fact that the computationcomplexity grows linearly with the mixture number [80]. Evenwhen the computation resources are limited, it is shown thatsignificant improvement can be gained from only back-endmodeling advancements, and therefore warrant potential furtherendeavor (Section IX-B-2).This study has proposed and demonstrated a generic frame-

work for noisy multi-session tasks, which is not over tuned for


any specific condition discussed in this study and is not limitedto only speaker identification related tasks. For future work, itsapplication in other speech-based identification problems, suchas language/dialect identification [79], or emotion identifica-tion, could also be explored.It is noted that the imposter data set needs to be ready a priori

in some of the proposed methods (for example, UBS-SVM).In the current mobile computing era, however, the amountof imposter data can grow rapidly. Leveraging this from thisdynamically grown data should also help improve perfor-mance in speaker verification and therefore warrants furtherinvestigation.

ACKNOWLEDGMENT

The authors would like to thank Yun Lei, former UTDallas-CRSS member, for helpful discussions; member’s of the I4Ugroup (especially Rahim Saeidi) for their development exper-imental setup; and CRSS lab members for their help and par-ticipation in the NIST SRE-2012 evaluation/competition. Theauthors would also like to thank the anonymous reviewers fortheir valuable comments and suggestions to improve the qualityof the paper.

REFERENCES[1] “The NIST year 1997- 2010 speaker recognition evaluation plans,”

2010, [Online]. Available: http://www.nist.gov[2] “TheNIST year 2012speaker recognition evaluation plans,” 2012, [On-

line]. Available: http://www.nist.gov[3] D. A. Reynolds, T. F. Quatieri, and R. Dunn, “Speaker verification

using adapted Gaussian mixture models,” Digital Signal Process., vol.10, no. 1–3, pp. 19–41, Jan. 2000.

[4] W. Campbell, J. Campbell, D. A. Reynolds, E. Singer, and P.Torres-Carrasquillo, “Support vector machines for speaker and lan-guage recognition,” Comput. Speech Lang., vol. 20, no. 2–3, pp.210–229, Apr. 2006.

[5] C. You, K. Lee, and H. Li, “An SVM kernel with GMM-supervectorbased on the Bhattacharyya distance for speaker recognition,” IEEESignal Process. Lett., vol. 16, no. 1, pp. 49–52, Jan. 2009.

[6] N. Dehak and G. Chollet, “Support vector GMMs for speaker verifica-tion,” in Proc. Odyssey, San Juan, Puerto Rico, Jun. 2006, pp. 1–4.

[7] W. Campbell, D. Sturim, and D. Reynolds, “Support vector machinesusing GMM supervectors for speaker verification,” IEEE SignalProcess. Lett., vol. 13, no. 5, pp. 308–311, May 2006.

[8] K. Lee, C. You, H. Li, T. Kinnunen, and D. Zhu, “Characterizingspeech utterances for speaker verification with sequence kernel SVM,”in Proc. Interspeech, Brisbane, Australia, Sep. 2008, pp. 1397–1400.

[9] S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminantanalysis for inferences about identity,” in Proc. ICCV, Rio de Janeiro,Brazil, Oct. 2007, pp. 1–8.

[10] N. Brummer, “EM for Probabilistic LDA,” Feb. 2010, [Online]. Avail-able: https://sites.google.com/site/nikobrummer

[11] P. Kenny, “Bayesian speaker verification with heavy-tailed priors,” inProc. Odyssey, Brno, Czech Republic, Jun. 2010.

[12] L. Burget, O. Plchot, S. Cumani, O. Glembek, P. Matejka, and N.Brummer, “Discriminatively trained probabilistic linear discriminantanalysis for speaker verification,” in Proc. ICASSP, Florence, Italy,Oct. 2011, pp. 4832–4835.

[13] J. Suh, S. O. Sadjadi, G. Liu, T. Hasan, K. W. Godin, and J. H.L. Hansen, “Exploring Hilbert envelope based acoustic features ini-Vector speaker verification using HT-PLDA,” in Proc. NIST SpeakerRecogn. Eval., Atlanta, GA, USA, Dec. 2011.

[14] D. Garcia-Romero and C. Y. Espy-Wilson, “Analysis of i-Vector lengthnormalization in speaker recognition systems,” in Proc. Interspeech,Florence, Italy, Aug. 2011, pp. 249–252.

[15] D. A. Reynolds, “Large population speaker identification using cleanand telephone speech,” IEEE Signal Process. Lett., vol. 2, no. 3, pp.48–48, Mar. 1995.

[16] D. A. Reynolds, “An overview of automatic speaker recognitiontechnology,” in Proc. ICASSP, Orlando, FL, USA, May 2002, pp.4072–4075.

[17] F. Kelly, A. Drygajlo, and N. Harte, “Speaker verification in score-ageing-quality classification space,” Comput. Speech Lang., vol. 27,no. 5, pp. 1068–1084, Aug. 2013.

[18] T. Rahman, S. Mariooryad, S. Keshavamurthy, G. Liu, J. H. L. Hansen,and C. Busso, “Detecting sleepiness by fusing classifiers trained withnovel acoustic features,” in Proc. Interspeech, Florence, Italy, Aug.2011, pp. 3285–3288.

[19] M. Artkoski, J. Tommila, and A. M. Laukkanen, “Changes in voiceduring a day in normal voices without vocal loading,” LogopedicsPhonatrics Vocology, vol. 27, no. 3, pp. 118–123, 2002.

[20] B. Schuller, S. Steidl, A. Batliner, E. Noth, A. Vinciarelli, F. Burkhardt,R. V. Son, F. Weninger, F. Eyben, T. Bocklet, G. Mohammadi, and B.Weiss, “The interspeech 2012speaker trait challenge,” in Proc. Inter-speech, Portland, OR, USA, Sep. 2012, pp. 254–257.

[21] G. Liu, Y. Lei, and J. H. L. Hansen, “A novel feature extraction strategyfor multi-stream robust emotion identification,” in Proc. Interspeech,Makuhari, Japan, Sep. 2010, pp. 26–30.

[22] K. W. Godin and J. H. L. Hansen, “Session variability contrasts in theMARP corpus,” in Proc. Interspeech, Makuhari, Japan, Sep. 2010, pp.298–301.

[23] T. Kinnunen and H. Li, “An overview of text-independent speakerrecognition: From features to supervectors,” Speech Commun., vol. 52,no. 1, pp. 12–40, Jan. 2010.

[24] G. Liu, Y. Lei, and J. H. L. Hansen, “Dialect identification: Impactof differences between read versus spontaneous speech,” in Proc. EU-SIPCO, Aalborg, Denmark, Aug. 2010, pp. 2003–2006.

[25] R. J. Vogt, B. J. Baker, and S. Sridharan, “Factor analysis subspaceestimation for speaker verification with short utterances,” in Proc. In-terspeech, Brisbane, Australia, Sep. 2008, pp. 853–856.

[26] G. Liu, Y. Lei, and J. H. L. Hansen, “Robust feature front-end forspeaker identification,” in Proc. ICASSP, Kyoto, Japan, Mar. 2012, pp.4233–4236.

[27] Y. Lei, L. Burget, and N. Scheffer, “A noise robust i-vector extractorusing vector taylor series for speaker recognition,” in Proc. ICASSP,Vancouver, BC, Canada, May 2013, pp. 6788–6791.

[28] Y. Lei, L. Burget, L. Ferrer, M. Graciarena, and N. Scheffer, “Towardsnoise-robust speaker recognition using probabilistic linear discriminantanalysis,” in Proc. ICASSP, Kyoto, Japan, Mar. 2012, pp. 4253–4256.

[29] P. Kenny, G. Boulianne, P. Ouellet, and P. Dumouchel, “Speaker andsession variability in GMM-based speaker verification,” IEEE Trans.Audio, Speech, Lang. Process., vol. 15, no. 4, pp. 1448–1460, May,May 2007.

[30] M. Senoussaoui, P. Kenny, N. Dehak, and P. Dumouchel, “An i-Vectorextractor suitable for speaker recognition with both microphone andtelephone speech,” in Proc. Odyssey, Brno, Czech Republic, Jun. 2010,pp. 28–30.

[31] N. Dehak, Z. N. Karam, D. A. Reynolds, R. Dehak, W. M. Campbell,and J. R. Glass, “A channel-blind system for speaker verification,” inProc. ICASSP, Prague, Czech Republic, May 2011, pp. 4536–4539.

[32] L. Burget, N. Brummer, and D. Reynolds, “Robust speaker recognitionover varying channels,” in Johns Hopkins University CLSP SummerWorkshop Rep., 2008 [Online]. Available: www.clsp.jhu.edu/work-shops/ws08/documents/jhu_report_main.pdf

[33] D. A. Reynolds, “Speaker identification and verification usingGaussian mixture speaker models,” Speech Commun., vol. 17, no. 1,pp. 91–108, Aug. 1995.

[34] J. Campbell, “Testing with the YOHO CD-ROM voice verificationcorpus,” in Proc. ICASSP, Detroit, MI, USA, May 1995, pp. 341–344.

[35] J. M. Colombi, D. W. Ruck, T. R. Anderson, S. K. Rogers, and M.Oxley, “Cohort selection and word grammar effects for speaker recog-nition,” in Proc. ICASSP, Atlanta, GA, USA, May 1996, pp. 85–88.

[36] P. Angkititrakul and J. H. L. Hansen, “Discriminative in-set/out-of-setspeaker recognition,” IEEE Trans. Audio, Speech, Lang. Process., vol.15, no. 2, pp. 498–508, Feb. 2007.

[37] V. Wan and S. Renals, “Evaluation of kernel methods for speaker veri-fication and identification,” in Proc. ICASSP, Orlando, FL, USA, May2002, pp. 669–672.

[38] J. Campbell and D. A. Reynolds, “Corpora for the evaluation of speakerrecognition systems,” in Proc. ICASSP, Phoenix, AZ, USA,Mar. 1999,pp. 829–832.

[39] R. Sarikaya, B. L. Pellom, and J. H. L. Hansen, “Wavelet packet trans-form features with application to speaker identification,” in Proc. IEEENordic Signal Process. Symp., Vigso, Denmark, Jun. 1998, pp. 81–84.

[40] J. Ming, T. J. Hazen, J. R. Glass, and D. A. Reynolds, “Robust speakerrecognition in noisy conditions,” IEEE Trans. Audio, Speech, Lang.Process., vol. 15, no. 5, pp. 1711–1723, Jul. 2007.

[41] J. Ming, J. Lin, and F. J. Smith, “A posterior union model with appli-cations to robust speech and speaker recognition,” EURASIP J. Appl.Signal Process., vol. 2006, pp. 1–12, Apr. 2006.


[42] H. Lei and N. Mirghafori, “Data Selection with Kurtosis and Nasalityfeatures for Speaker Recognition,” in Proc. Interspeech, Florence,Italy, Aug. 2011, pp. 2753–2756.

[43] H. Lei, B. T. Meyer, and N. Mirghafori, “Spectro-temporal gabor fea-tures for speaker recognition,” in Proc. ICASSP, Kyoto, Japan, Mar.2012, pp. 4241–4244.

[44] G. Liu, J. W. Suh, and J. H. L. Hansen, “A fast speaker verificationwith universal background support data selection,” in Proc. ICASSP,Kyoto, Japan, Mar. 2012, pp. 4793–4796.

[45] N. Dehak, P. Kenny, R. Dehak, P. Ouellet, and P. Dumouchel,“Front-end factor analysis for speaker verification,” IEEE Trans.Audio, Speech, Lang. Process., vol. 19, no. 4, pp. 788–798, Aug. 2010.

[46] A. Hatch, S. Kajarekar, and A. Stolcke, “Within-class covariance nor-malization for SVM-based speaker recognition,” in Proc. ICSLP, Pitts-burgh, PA, USA, Sep. 2006, pp. 1471–1474.

[47] “GCDS algorithm source code,” [Online]. Available: www.utdallas.edu/~gang.liu/code.htm

[48] M. Katz, S. E. Krüger, M. Schaffoner, E. Andelic, and A. Wende-muth, “Speaker identification and verification using support vector ma-chines and sparse kernel logistic regression,” Adv. Mach. Vis. , ImageProcess., Pattern Anal., vol. 4153, pp. 176–184, 2006.

[49] M. Katz, M. Schaffoner, E. Andelic, S. E. Krüger, and A. Wende-muth, “Sparse kernel logistic regression using incremental feature se-lection for text-independent speaker identification,” in Proc. Odyssey,San Juan, Puerto Rico, Jun. 2006.

[50] C. Lin, R. Weng, and S. Keerthi, “Trust region Newton method forlarge-scale logistic regression,” J. Mach. Learn. Res., vol. 9, pp.627–650, 2008.

[51] R. Fan, K. Chang, C. Hsieh, X. Wang, and C. Lin, “LIBLINEAR: Alibrary for large linear classification,” J. Mach. Learn. Res., vol. 9,pp. 1871–1874, 2008 [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/liblinear

[52] D. Garcia-Romero, J. Fierrez-Aguilar, J. Gonzalez-Rodriguez, and J.Ortega-Garcia, “On the use of quality measures for text-independentspeaker recognition,” in Proc. Odyssey, Toledo, Spain, May 2004, pp.105–110.

[53] L. Ferrer, M. Graciarena, A. Zymnis, and E. Shriberg, “System com-bination using auxiliary information for speaker verification,” in Proc.ICASSP, Las Vegas, NV, USA, Mar. 2008, pp. 4853–4856.

[54] Y. Solewicz and M. Koppel, “Using post-classifiers to enhance fu-sion of low- and high-level speaker recognition,” IEEE Trans. Audio,Speech, Lang. Process., vol. 15, no. 7, pp. 2063–2071, Sep. 2007.

[55] K. Kryszczuk, J. Richiardi, P. Prodanov, and A. Drygajlo, “Reliability-based decision fusion in multimodal biometric verification systems,”EURASIP J. Adv. Signal Process., vol. 2007, no. 1, pp. 1–9, Jan. 2007.

[56] S. Nicolas, L. Ferrer, M. Graciarena, S. Kajarekar, E. Shriberg, andA. Stolcke, “The SRI NIST 2010speaker recognition evaluationsystem,” in Proc. ICASSP, Prague, Czech Republic, May 2011, pp.5292–5295.

[57] G. Doddington,W. Liggett, A.Martin, M. Przybocki, and D. Reynolds,“Sheep, goats, lambs and wolves: A statistical analysis of speakerperformance,” in Proc. ICSLP, Sydney, Australia, Nov. 1998, pp.1351–1354.

[58] C. Kim and R. Stern, “Robust signal-to-noise ratio estimation basedon waveform amplitude distribution analysis,” in Proc. Interspeech,Brisbane, Australia, Sep. 2008, pp. 2598–2601.

[59] L. Ferrer, K. Sonmez, and E. Shriberg, “An anticorrelation kernelfor improved system combination in speaker verification,” in Proc.Odyssey, Stellenbosch, South Africa, Jan. 2008.

[60] N. Brummer and E. Villiers, “The BOSARIS toolkit: Theory, algo-rithms and code for surviving the new DCF,” in Proc. NIST SpeakerRecogn. Eval., Atlanta, GA, USA, Dec. 2011.

[61] H. Boril and J. H. L. Hansen, “Unsupervised equalization of Lombardeffect for speech recognition in noisy adverse environments,” IEEETrans. Audio, Speech, Lang. Process., vol. 18, no. 6, pp. 1379–1393,Aug. 2010.

[62] H. Boril and J. H. L. Hansen, “UT-Scope: Towards LVCSR under lom-bard effect induced by varying types and levels of noisy background,”in Proc. ICASSP, Prague, Czech Republic, May 2011, pp. 4472–4475.

[63] J. Pelecanos and S. Sridharan, “Feature warping for robust speaker ver-ification,” in Proc. Odyssey, Crete, Greece, Jun. 2001, pp. 213–218.

[64] T. Hasan, S. O. Sadjadi, G. Liu, N. Shokouhi, H. Boril, and J. H.L. Hansen, “CRSS systems for 2012NIST speaker recognition eval-uation,” in Proc. ICASSP, Vancouver, BC, Canada, May 2013, pp.6783–6787.

[65] T. Hasan, G. Liu, S. O. Sadjadi, N. Shokouhi, H. Boril, A. Ziaei, A.Misra, K. W. Godin, and J. H. L. Hansen, “UTD-CRSS systems for2012NIST speaker recognition evaluation,” in Proc. NIST SpeakerRecogn. Eval., Orlando, FL, USA, Dec. 2012.

[66] N. Brummer, A. Strasheim, V. Hubeika, P. Matejka, L. Burget, and O.Glembek, “Discriminative acoustic language recognition via channel-compensated GMM statistics,” in Proc. Interspeech, Brighton, U.K.,Sep. 2009, pp. 2187–2190.

[67] J. H. L. Hansen, “Analysis and compensation of speech under stressand noise for environmental robustness in speech recognition,” SpeechCommun., vol. 20, no. 1, pp. 151–173, Nov. 1996.

[68] G. Liu, T. Hasan, H. Boril, and J. H. L. Hansen, “An investigation onback-end for speaker recognition in multi-session enrollment,” in Proc.ICASSP, Vancouver, BC, Canada, May 2013, pp. 7755–7759.

[69] N. Brummer, L. Burget, P. Kenny, P. Matějka, E. Villiers de, M.Karafiát, M. Kockmann, O. Glembek, O. Plchot, D. Baum, and M.Senoussauoi, “ABC system description for NIST SRE 2010 Proc.NIST 2010 speaker recognition evaluation,” in Proc. NIST SpeakerRecogn. Eval., Brno, Czech Republic, Jun. 2010, pp. 1–20.

[70] L. Lamel and J. L. Gauvain, “Speaker recognition with the Switch-board corpus,” in Proc. ICASSP, Munich, Germany, Apr. 1997, pp.1067–1070.

[71] D. A. Reynolds, “The effects of handset variability on speaker recog-nition performance: Experiments on the switchboard corpus,” in Proc.ICASSP, Atlanta, GA, USA, May 1996, pp. 113–116.

[72] M. Przybocki and A. Martin, “NIST speaker recognition evaluationchronicles,” in Proc. Odyssey, Toledo, Spain, May 2004, pp. 15–22.

[73] G. N. Ramaswamy, A. Navratil, U. V. Chaudhari, and R. D. Zilca, “TheIBM system for the NIST-2002 cellular speaker verification evalua-tion,” in Proc. ICASSP, Hong Kong, Apr. 2003, pp. 61–64.

[74] M. McLaren, B. Baker, R. Vogt, and S. Sridharan, “Improved SVMspeaker verification through data-driven background dataset collec-tion,” in Proc. ICASSP, Taipei, Taiwan, Apr. 2009, pp. 4041–4044.

[75] C. S. Greenberg, V. M. Stanford, A. F. Martin, M. Yadagiri, G. R.Doddington, J. J. Godfrey, and J. Hernandez, “The 2012NIST speakerrecognition evaluation,” in Proc. Interspeech, Lyon, France, Aug.2013, pp. 1971–1975.

[76] L. Ferrer, M. McLaren, N. Scheffer, Y. Lei, M. Graciarena, and V.Mitra, “A noise-robust system for NIST 2012speaker recognitionevaluation,” in Proc. Interspeech, Lyon, France, Aug. 2013, pp.1981–1985.

[77] D. Colibro, C. Vair, K. Farrell, N. Krause, G. Karvitsky, S. Cumani, andP. Laface, “Nuance-Politecnico di Torino’s 2012NIST speaker recog-nition evaluation system,” in Proc. Interspeech, Lyon, France, Aug.2013, pp. 1996–2000.

[78] N. Brummer et al., “ABC System description for NIST SRE 2012,” inProc. NIST Speaker Recogn. Eval., Orlando, FL, USA, Dec. 2012.

[79] Q. Zhang, G. Liu, and J. H. L. Hansen, “Robust Language RecognitionBased on Hybrid Fusion,” in Proc. Odyssey ’14 Speaker Lang. Recogn.Workshop, Joensuu, Finland, Jun. 2014.

[80] O. Glembek, L. Burget, P. Matejka, M. Karafiat, and P. Kenny, “Sim-plification and optimization of i-vector extraction,” in Proc. ICASSP,Florence, Italy, Oct. 2011, pp. 4516–4519.

Gang Liu received the B.S. degree in electrical en-gineering from Beijing Jiaotong University, Beijing,China, in 2001 and the M.S. degree in biomedicalengineering from Tsinghua University, Beijing,China, in 2007. Since 2008, he has been a GraduateResearch Assistant pursuing the Ph.D. degree atthe Center for Robust Speech Systems (CRSS),University of Texas at Dallas (UTD), Richardson,USA. He served as the lead CRSS-UTDallas studentin the NIST Language Recognition Evaluation(LRE) 2011 and NIST 2014 Speaker Recognition

i-Vector Machine Learning Challenge submission from the CRSS group. Hewas a recipient of the IBM Best Student Paper Award at IEEE ICASSP-2013,ISCA travel grant at Interspeech-2013, and National Award for OutstandingSelf-financed Students Abroad by China Scholarship Council in 2013. Hisresearch interests focus on large data purification, feature extraction, and classi-fier modeling for robust speaker/language identification in adverse conditions.He has authored/co-authored 30 journal and conference papers in the field ofspeech processing and language technology. He was reviewer of more than 20journal and conference papers.


John H. L. Hansen (S’81–M’82–SM’93–F’07)received the Ph.D. and M.S. degrees in electricalengineering from Georgia Institute of Technology,Atlanta, Georgia, in 1988 and 1983, and B.S.E.E.degree from Rutgers University, College of Engi-neering, New Brunswick, N.J., in 1982. He joinedUniversity of Texas at Dallas (UTDallas), Erik Jons-son School of Engineering and Computer Sciencein the fall of 2005, where he is presently servingas Jonsson School Associate Dean for Research, aswell as Professor of Electrical Engineering and also

holds the Distinguished University Chair in Telecommunications Engineering.He previously served as Department Head of Electrical Engineering from Aug.2005–Dec. 2012, overseeing a +4x increase in research expenditures ($4.5M to$22.3M) with a 20% increase in enrollment and the addition of 18 T/TT faculty,growing UTDallas to be the 8th largest EE program from ASEE rankings interms of degrees awarded. He also holds a joint appointment as Professorin the School of Behavioral and Brain Sciences (Speech & Hearing). AtUTDallas, he established the Center for Robust Speech Systems (CRSS) whichis part of the Human Language Technology Research Institute. Previously,he served as Dept. Chairman and Professor of Dept. of Speech, Languageand Hearing Sciences (SLHS), and Professor of the Dept. of Electrical &Computer Engineering, at Univ. of Colorado Boulder (1998-2005), wherehe co-founded and served as Associate Director of the Center for SpokenLanguage Research. In 1988, he established the Robust Speech ProcessingLaboratory (RSPL) and continues to direct research activities in CRSS atUTD. In has been named IEEE Fellow (2007) for contributions in “RobustSpeech Recognition in Stress and Noise,” named Inter. Speech CommunicationAssociation (ISCA) Fellow (2010) for contributions on “research for speechprocessing of signals under adverse conditions,” and received The AcousticalSociety of America’s 25 Year Award (2010)–in recognition of his service,contributions, and membership to the Acoustical Society of America. He iscurrently serving as Technical Committee (TC) Chair and Member of the

IEEE Signal Processing Society: Speech-Language Processing TechnicalCommittee (2005-08; 2010-13; elected IEEE SLTC Chairman for 2011-2013),and elected ISCA Distinguished Lecturer (2012). He has also served asmember of the IEEE Signal Processing Society Educational Technical Com-mittee (2005-08; 2008-10). Previously, he served as The Technical Advisorto the U.S. Delegate for NATO (IST/TG-01), IEEE Signal Processing SocietyDistinguished Lecturer (2005/06), Associate Editor for IEEE TRANSACTIONSON SPEECH AND AUDIO PROCESSING (1992-99), Associate Editor for IEEESIGNAL PROCESSING LETTERS (1998-2000), Editorial Board Member for theIEEE SIGNAL PROCESSING MAGAZINE (2001-03). He has also served as guesteditor of the Oct. 1994 special issue on Robust Speech Recognition for IEEETRANSACTIONS ON SPEECH AND AUDIO PROCESSING. He has served on theSpeech Communications Technical Committee for the Acoustical Society ofAmerica (2000-03), and is serving as a member of the ISCA (Inter. SpeechCommunications Association) Advisory Council. His research interests spanthe areas of digital speech processing, analysis and modeling of speech andspeaker traits, speech enhancement, feature estimation in noise, robust speechrecognition with emphasis on spoken document retrieval, and in-vehicle inter-active systems for hands-free human-computer interaction. He has supervised67 Ph.D./M.S. thesis candidates (33 Ph.D., 34 MS/M.A.), was recipient ofThe 2005 University of Colorado Teacher Recognition Award as voted onby the student body, author/co-author of 529 journal and conference papersand 11 textbooks in the field of speech processing and language technology,coauthor of the textbook Discrete-Time Processing of Speech Signals, (IEEEPRESS, 2000), co-editor of DSP for In-Vehicle and Mobile Systems (Springer,2004), Advances for In-Vehicle and Mobile Systems: Challenges for Interna-tional Standards (Springer, 2006), In-Vehicle Corpus and Signal Processingfor Driver Behavior (Springer, 2008), and lead author of the report “TheImpact of Speech Under ‘Stress’ on Military Speech Technology,” (NATORTO-TR-10, 2000). He also organized and served as General Chair for ISCAInterspeech-2002, Sept. 16-20, 2002, and Co-Organizer and Technical ProgramChair for IEEE ICASSP-2010, Dallas, TX. He is serving as Co-Chair andOrganizer for IEEE SLT-2014, Dec. 7-10, 2014 in Lake Tahoe, NV.

An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy...

Documents

Transcript of An Investigation into Back-end Advancements for Speaker Recognition in Multi-Session and Noisy...