A Deep Neural Network Approach for Voice Activity Detection in Multi-Room Domestic Scenarios

8
A Deep Neural Network Approach for Voice Activity Detection in Multi-Room Domestic Scenarios Giacomo Ferroni, Roberto Bonfigli, Emanuele Principi, Stefano Squartini, and Francesco Piazza Department of Information Engineering Universit` a Politecnica delle Marche Via Brecce Bianche, 60131, Ancona, Italy {g.ferroni,r.bonfigli,e.principi,s.squartini,f.piazza}@univpm.it Abstract—This paper presents a Voice Activity Detector (VAD) for multi-room domestic scenarios. A multi-room VAD (mVAD) simultaneously detects the time boundaries of a speech segment and determines the room where it was generated. The proposed approach is fully data-driven and is based on a Deep Neural Network (DNN) pre-trained as a Deep Belief Network (DBN) and fine-tuned by a standard error back-propagation method. Six different types of feature sets are extracted and combined from multiple microphone signals in order to perform the classification. The proposed DBN-DNN multi-room VAD (simply referred to as DBN-mVAD) is compared to other two NN based mVADs: a Multi-Layer Perceptron (MLP-mVAD) and a Bidirectional Long Short-Term Memory recurrent neural network (BLSTM- mVAD). A large multi-microphone dataset, recorded in a home, is used to assess the performance through a multi-stage analysis strategy comprising multiple feature selection stages alternated by network size and input microphones selections. The proposed approach notably outperforms the alternative algorithms in the first feature selection stage and in the network selection one. In terms of area under precision-recall curve (AUC), the absolute increment respect to the BLST-mVAD is 5.55%, while respect to the MLP-mVAD is 2.65%. Hence, solely the proposed approach undergoes the remaining selection stages. In particular, the DBN- mVAD achieves significant improvements: in terms of AUC and F- measure the absolute increments are equal to 10.41% and 8.56% with respect to the first stage of DBN-mVAD. KeywordsDeep Belief Network, Deep Neural Network, Long Short-Term Memory, Multi-room Voice Activity Detection, Speech Activity Detection I. I NTRODUCTION The time boundaries knowledge of the human speech portions contained in an audio signal is fundamental in many audio signal processing applications. In Automatic Speech Recognition (ASR) engines, for instance, this information is used to address the words insertion issue which degrades the overall recognition performance [1]. In audio encoding systems, it is successfully exploited to reduce the bandwidth by avoiding the silence frames transmission [2]. Several speech enhancement algorithms such as, dereverberation [3] and acoustic noise suppression [4], require the application of the enhancing procedure solely in correspondence to non-speech frames. The automatic detection of human speech time bound- aries is known as Voice Activity Detection (VAD) or simply speech detection. In a clean signal, i.e., not corrupted by noise or with very high signal-to-noise ratio (SNR), the VAD task can be easily achieved, however, when the signal is corrupted by noise (e.g., babble noise), the speech detection is extremely challenging. Despite of these hurdles, the development of novel and more sophisticated VADs has never ceased due to their fundamental role in more complex systems. A wide variety of VAD approaches have been developing in the last two decades and in the remaining part of this section a brief overview is provided. A special focus is devoted to VAD in multi- room environments, which represents the topic addressed in the present work. Early VADs typically detect speech by means of an en- ergy threshold, pitch and zero-crossing rate rules [5]. Later approaches are based on the likelihood ratio test (LRT) with Gaussian [6], generalized gamma distribution (GΓD) [7] or complex Laplacian [8] assumptions in the discrete Fourier transform domain. An improved LRT-based VAD exploits the multiple observations LRT (MO-LRT) which is computed us- ing a wider context [9]. More sophisticated VADs are based on autoregressive models (AR-GARCH) [10], Gaussian mixture model (GMM) [11] or long-term spectral divergence (LTSD) between speech and noise [12]. Common shallow architectures like Support Vector Machine (SVM) [13], or simple neural networks [14] have been also employed in voice detection. Recently, promising approaches are based on more complex and deep neural networks (DNN) [15]. Robust performance has been achieved in many field of applications by exploiting these VADs techniques. However, a special attention and further efforts deserve the application of VADs in domestic and multi-room environments. In par- ticular, this scenario requires the speech-event localisation in space in addition to its detection in time. The state-of-the-art approaches require many processing-stages to obtain the final decision and they are often based on a combination of multiple decisions (e.g., by means of majority voting techniques). The approach proposed in [16] is composed of two separate parts: the speech detection and the speaker localisation. The speech detection part is performed for each microphone and a majority voting method extracts the final decision. The latter is further combined with the speaker localisation output by means of an additional step (i.e., minimum cost criterion or SVM or simple neural network). Three time-functions are calculated from input audio signal in [17]: the Global Coherence Field (GCF), the signal auto-correlation C and the SNR. Then, by thresholding the maximums of GCF and C, the hypothesis on speech events time boundaries are extracted. Successively, a further process is computed to validate the hypothesis. Multi- microphone decision fusion within a Viterbi decoding frame- work is used in [18]. In particular, the sequence of speech/non- speech events are estimated by the Viterbi algorithm applied on

Transcript of A Deep Neural Network Approach for Voice Activity Detection in Multi-Room Domestic Scenarios

A Deep Neural Network Approach for Voice ActivityDetection in Multi-Room Domestic Scenarios

Giacomo Ferroni, Roberto Bonfigli, Emanuele Principi, Stefano Squartini, and Francesco PiazzaDepartment of Information Engineering

Universita Politecnica delle MarcheVia Brecce Bianche, 60131, Ancona, Italy

{g.ferroni,r.bonfigli,e.principi,s.squartini,f.piazza}@univpm.it

Abstract—This paper presents a Voice Activity Detector (VAD)for multi-room domestic scenarios. A multi-room VAD (mVAD)simultaneously detects the time boundaries of a speech segmentand determines the room where it was generated. The proposedapproach is fully data-driven and is based on a Deep NeuralNetwork (DNN) pre-trained as a Deep Belief Network (DBN)and fine-tuned by a standard error back-propagation method. Sixdifferent types of feature sets are extracted and combined frommultiple microphone signals in order to perform the classification.The proposed DBN-DNN multi-room VAD (simply referred toas DBN-mVAD) is compared to other two NN based mVADs:a Multi-Layer Perceptron (MLP-mVAD) and a BidirectionalLong Short-Term Memory recurrent neural network (BLSTM-mVAD). A large multi-microphone dataset, recorded in a home,is used to assess the performance through a multi-stage analysisstrategy comprising multiple feature selection stages alternatedby network size and input microphones selections. The proposedapproach notably outperforms the alternative algorithms in thefirst feature selection stage and in the network selection one. Interms of area under precision-recall curve (AUC), the absoluteincrement respect to the BLST-mVAD is 5.55%, while respect tothe MLP-mVAD is 2.65%. Hence, solely the proposed approachundergoes the remaining selection stages. In particular, the DBN-mVAD achieves significant improvements: in terms of AUC and F-measure the absolute increments are equal to 10.41% and 8.56%with respect to the first stage of DBN-mVAD.

Keywords—Deep Belief Network, Deep Neural Network, LongShort-Term Memory, Multi-room Voice Activity Detection, SpeechActivity Detection

I. INTRODUCTION

The time boundaries knowledge of the human speechportions contained in an audio signal is fundamental in manyaudio signal processing applications. In Automatic SpeechRecognition (ASR) engines, for instance, this information isused to address the words insertion issue which degradesthe overall recognition performance [1]. In audio encodingsystems, it is successfully exploited to reduce the bandwidthby avoiding the silence frames transmission [2]. Several speechenhancement algorithms such as, dereverberation [3] andacoustic noise suppression [4], require the application of theenhancing procedure solely in correspondence to non-speechframes. The automatic detection of human speech time bound-aries is known as Voice Activity Detection (VAD) or simplyspeech detection. In a clean signal, i.e., not corrupted by noiseor with very high signal-to-noise ratio (SNR), the VAD taskcan be easily achieved, however, when the signal is corruptedby noise (e.g., babble noise), the speech detection is extremelychallenging. Despite of these hurdles, the development of novel

and more sophisticated VADs has never ceased due to theirfundamental role in more complex systems. A wide variety ofVAD approaches have been developing in the last two decadesand in the remaining part of this section a brief overviewis provided. A special focus is devoted to VAD in multi-room environments, which represents the topic addressed inthe present work.

Early VADs typically detect speech by means of an en-ergy threshold, pitch and zero-crossing rate rules [5]. Laterapproaches are based on the likelihood ratio test (LRT) withGaussian [6], generalized gamma distribution (GΓD) [7] orcomplex Laplacian [8] assumptions in the discrete Fouriertransform domain. An improved LRT-based VAD exploits themultiple observations LRT (MO-LRT) which is computed us-ing a wider context [9]. More sophisticated VADs are based onautoregressive models (AR-GARCH) [10], Gaussian mixturemodel (GMM) [11] or long-term spectral divergence (LTSD)between speech and noise [12]. Common shallow architectureslike Support Vector Machine (SVM) [13], or simple neuralnetworks [14] have been also employed in voice detection.Recently, promising approaches are based on more complexand deep neural networks (DNN) [15].

Robust performance has been achieved in many field ofapplications by exploiting these VADs techniques. However,a special attention and further efforts deserve the applicationof VADs in domestic and multi-room environments. In par-ticular, this scenario requires the speech-event localisation inspace in addition to its detection in time. The state-of-the-artapproaches require many processing-stages to obtain the finaldecision and they are often based on a combination of multipledecisions (e.g., by means of majority voting techniques). Theapproach proposed in [16] is composed of two separate parts:the speech detection and the speaker localisation. The speechdetection part is performed for each microphone and a majorityvoting method extracts the final decision. The latter is furthercombined with the speaker localisation output by means ofan additional step (i.e., minimum cost criterion or SVM orsimple neural network). Three time-functions are calculatedfrom input audio signal in [17]: the Global Coherence Field(GCF), the signal auto-correlation C and the SNR. Then, bythresholding the maximums of GCF and C, the hypothesis onspeech events time boundaries are extracted. Successively, afurther process is computed to validate the hypothesis. Multi-microphone decision fusion within a Viterbi decoding frame-work is used in [18]. In particular, the sequence of speech/non-speech events are estimated by the Viterbi algorithm applied on

Mic1

Mic2

Micn

FeatureExtraction

Hangover VADoutputs

n

1

DBNTh

Th

Th

Fig. 1: General block scheme of the proposed data-driven multi-room VAD.n indicates the number of target rooms as well as the number of inputmicrophones used, i.e., one from each of the target rooms. The “Th” blocksindicates the independent thresholding operation.

a combined scores calculated by single-channel event GMMs.The performance achieved by the state-of-the-art multi-roomVADs are not directly comparable to those presented in thiswork due to the mismatch existing among the used datasets,also in terms of amount of data.

Given an input audio signal or an audio stream captured byone microphone, traditional VADs aim to extract the speechportions independently from the type or the intensity of thebackground noise. In order to perform the more complexoperation of identify the room from which that particulardetected speech comes from, further processing are needed.

An early application of the DBN-mVAD is presented in[19], however, in this paper, we present a mVAD parametriza-tion by means of multi-stage procedure targeted to the selectionof features, network sizes and audio channels. A mVAD is ableto simultaneously detect the speech activity in multiple targetrooms by using the microphone signals of these rooms as its in-puts. The proposed mVAD, hence, not only accomplishes boththe tasks of speech detection and room identification, but it alsoperforms them using a fully data-driven approach. It means thatboth decisions are strictly committed to the neural classifierwithout any further processing steps. In particular, the classifierinputs consist of a features vector while a simple hangoverscheme finalises the classifier outputs. The proposed approachis compared to other two neural network based mVADs:a Multi-Layer Perceptron (MLP-mVAD) and a BidirectionalLong Short-Term Memory neural network (BLSTM-mVAD).The analysis consists in a multi-stage strategy composed ofdifferent parameter selections. Experiments were conductedusing a large multi-microphone dataset recorded in a smarthome equipped with 40 microphones distributed in 5 roomsand various scenes are played by different speakers in Italianlanguage.

The outline of the paper is as follows. A detailed presenta-tion of the proposed approach is provided in Section II, whilethe alternatives neural network based mVADs are described inSection III. Dataset information are presented in Section IV,while the experimental setup is shown in Section V. Section VIshows the achieved results and a related discussion. Finally,Section VII concludes this contribution.

II. DEEP BELIEF NETWORK MULTI-ROOM VAD

The general block scheme of the proposed DBN-mVAD isshown in Fig. 1. The feature extraction stage transforms inputaudio signals into real, informative and non-redundant values(i.e., features) which facilitate the subsequent learning task.These features are then fed to the classifier which is based on aDeep Belief Network (DBN). Each classifier output undergoes

TABLE I: Indexed list of features, their dimensionality and the acronym usedduring the experiments. The * indicates that the features are extracted usingthe openSMILE toolkit [20].

Index Name Feature size Acronym1 EVM wH 1 Ev2 PITCH * 1 Pi3 WCLPE 24 Wc4 MFCC12 0 D Z * 26 Mf5 RASTAPLP 0 D A * 54 Ra6 AMS 135 Am

a thresholding operation and a final hangover scheme leads tothe speech activities in each target room.

A. Feature Extraction

The feature extraction stage extracts six different featuresfrom the input signal sampled at 16 kHz. Different framelengths are used whilst the frame sample rate is commonand equal to 100 Hz. The other feature-related parameters arechosen by means of numerous specific evaluations. In thefollowing sections an exhaustive description of each featuretype is provided and a general overview is listed in Table I.

1) Envelope-Variance Measure: The intensity envelope ofa reverberated signal undergoes a smoothing effect and, as aconsequence, the dynamic range of this signal tends to tailoff. The Envelope-Variance Measure (EVM) deals with thiseffect. In our context, the speech coming from other rooms isaffected by a significant reverberation due to multiple pathsbetween the source and microphones inside the room we areinterested in. The original EVM is exploited as a channelselection technique [21]. Here, the extraction process has beenslightly modified in order to achieve a temporal evolution ofthis feature. The original formulation defines a set of sub-bandenvelopes as the time sequences of non-linearly compressedfilter-bank energies (FBE). Similarly to MFCC computation,the speech signal frame energies are computed and the meanvalue is subtracted in the log domain from each sub-band:

x(k, l) = exp[log(x(k, l))− µx(k)], (1)

where x(k, l) is the magnitude of the sub-band mel energy, kis the band index, l is the frame index and µx(k) is the k-thband mean value estimated along the entire speech sub-bandsignal. The variance of a compressed version of Eq. (1) isobtained as follows:

V (k) = var[x(k, l)1/3]. (2)

To obtain a time-varying version of Eq. (2), we compute thevariance using a sliding window of length W shifted alongeach sub-band time sequence:

EVM(k, l) = var[x(k,m)1/3], −W2

+l ≤ m ≤ W

2+l. (3)

Finally, the voice-band frequencies (i.e., 300–4000 Hz) areextracted by discarding the other frequency contents. Thisfeature set will be denoted as EVM wH.

2) Pitch: For the VAD purpose, pitch information is fun-damental since it is an intrinsic feature of the human vocalrange. Here, it is extracted accordingly to the Sub-Harmonic-Summation (SHS) method [22]. It computes Nf shifts of theinput spectrum along the log-frequency axis, each of them isscaled due to a compression factor and summed up leading to

a sub-harmonic summation spectrum. A simple peak pickingmethod and a quadratic curve fitting interpolation are appliedto identify the F0 value. The pitch set is extracted using aframe size of 50 ms.

3) WC-LPE Feature: The Wavelet Coefficient (WC) andLinear Prediction Error (LPE) feature set has been recentlydeveloped and exploited for audio onset detection purpose[23]. WC-LPE carries information on non-stationary parts ofthe signal, thus, it can facilitate the speech boundaries identifi-cation. It is based on a multi-resolution analysis conducted bymeans of the Discrete Wavelet Transformation (DWT) of theinput signal. In the wavelet-domain, a set of Linear PredictionError Filters (LPEFs) is then applied to each sub-band in orderto extract the Forward Prediction Errors (FPE). The latter, thewavelet coefficients and the respective first average derivativesconstitute the feature set. To guarantee a frame alignment withrespect to other feature sets, the reference frequency has beenset to 100 Hz.

4) Mel-Frequency Cepstral Coefficient: MFCC is a well-known set of features widely employed in audio applications[24]. It is composed of 13 cepstral coefficients plus their firstderivatives. Furthermore, the feature are mean normalised andthe frame size is equal to 25 ms.

5) RASTA-PLP: This feature set is another widespreadspeech feature standing for RelAtive SpecTrAl transform -Perceptual Linear Prediction [25]. PLP aims to warp thespectra in order to minimize the differences between speakersbut preserving the main speech. RASTA is an additionalfiltering operation that smooths short-term noise variations andremoves constant offset depending by spectral colouration inthe speech channel. The RASTA-PLP set used in this workis composed of 18 cepstral coefficients including the 0-th oneplus their first and second derivatives. They are extracted usinga frame size of 25 ms.

6) Amplitude Modulation Spectrum: The amplitude modu-lation spectrograms (AMS) is a spectro-temporal feature set. Itcontains information of both center and modulation frequen-cies within each frame. AMS is based on psycho-physical andpsycho-physiological studies on auditory system of mammalsduring the processing of amplitude modulations. These find-ings have been exploited in [26] to binaural noise suppressionscheme. The extraction process exploits the decomposition ofthe input signal in 9 mel-like sub bands. From each band 15coefficients are computed1 leading to 135 features for eachframes.

By combining these feature sets for each frame and indicat-ing the i-th feature set as x(i) (cf. “Index” column of Table I),the feature-vector is xl = {x(1)

l , x(2)l , . . . , x(6)

l }, where l is theframe index. The final feature matrix is obtained by concate-nating xl frame after frame. A final min-max normalisation isperformed to restrict the feature value range to [0,1]:

xl = xl−xminxmax−xmin

, (4)

where

xmin = min1≤l≤L

(xl), xmax = max1≤l≤L

(xl), (5)

xl is an element of the feature-vector at the frame index l andL is the total number of frame in the dataset.

1The code is available here: http://ecs.utdallas.edu/loizou/speech/software.htm

B. DBN-DNN based Classifier

A deep belief network is a probabilistic generative modelcomposed of several layers of stochastic, latent and typicallybinary variables also referred to as hidden units or feature de-tectors. The top two layers form an associative memory and arelinked by undirected, symmetric connections. The other, lowerlayers receive top-down, directed connections from the layerabove. The lowest layer variable states represent the input vec-tor of data. The generative weights, modelling dependenciesbetween the units of one layer and those of the layer above, aretuned by exploiting a greedy layer-by-layer training algorithm[27], also known as pre-training. In particular, it attemptsto learn dependencies among input features throughout DBNlayers. Afterwards, a top-discriminative layer is added and asupervised learning procedure fine-tunes the whole network.With respect to a deep multi-layer perceptron (MLP) network,DBNs can prevent over-fitting and significantly speed-up thediscriminative supervised learning convergence. DBN are usedto initialise the weights of a MLP leading to much betterresults than that achieved by a random initialisation of theMLP weights. A so-built network is called DBN-DNN [28]however, for simplicity we refer to it as DBN.

A DBN is obtained by stacking several simpler learn-ing modules: Restricted Boltzmann Machines (RBM). Therestricted type of Boltzmann Machine does not allow connec-tions among units of the same layer. The typical RBM is shownin Fig. 2. It has two layers of hidden h = {h1, . . . , hJ} andvisible v = {v1, . . . , vI} stochastic units. The weights betweenthe i-th visible unit vi and the j-th hidden unit hj is indicatedby wij , while b and a are respectively the bias vectors ofvisible and hidden layers. Bernoulli-Bernoulli RBMs have boththe layers composed of binary units while Gaussian-BernoulliRBMs have real-valued visible units. RBMs can be straightfor-wardly trained by means of the Contrastive Divergence (CD-k)algorithm which has pointed out a fast method to approximatethe gradient of log likelihood log p(v; θ), where θ is the modelparameters. In particular, k →∞ iterations of Gibbs samplingare needed to obtain the exact value of the gradient. However,one full step is sufficient to obtain a working approximation oflog p(v; θ). After the initialisation of v0 at input data vector,the full step consists in: sampling h0 ∼ p(h|v0), samplingv1 ∼ p(v|h0) and finally sampling h1 ∼ p(h|v1). Hence, bymeans of a full step of the Gibbs sampling method the weightsupdate rule for the RBM is:

∆wij = ε[〈v · h〉1 − 〈v · h〉0], (6)

where ε is the learning rate.

h1 h2 hJ

v1 vI

Hiddenunits

Visibleunits

b

a wi,j

Fig. 2: Restricted Boltzmann Machine.

RBM1 RBM2 DBN

Discriminative Layer

Fig. 3: Deep Belief Network obtained by stacking RBMs.

RBMs are trained using the CD-1 algorithm layer-by-layerleading to a DBN as shown in Fig. 3. Firstly, RBM1 is pre-trained, then the hidden unit activation probabilities of RBM1

became the visible units of RBM2 and the pre-training algo-rithm is applied to RBM2. This process proceeds iterativelyfor each layer in the network. Finally, for classification tasks,a further layer can be added on the top and the supervisedtraining algorithm is applied to fine-tune the network weights.The DBN-mVAD uses a top discriminative layer with n units.The network outputs range between [0,1] and, in order todetect the speech activity, they are individually compared to nthresholds (cf. Fig. 1).

C. Hangover Scheme

A simple post-processing of each DBN decisions is neededin order to handle isolated speech detections and to mitigatethe early non-speech classification introduced by the DBNoutputs thresholding due to slow transitions from speech tonon-speech. This technique is commonly named hangover anda number of different implementation have been developed.The simplest method, used in this work, exploits a counter. Inparticular, if there are at least two consecutive speech frames,the counter is set to a predefined value equal to 8. This valuehas been chosen after performing several experiments. On thecontrary, for each non-speech frame, the counter is decreasedby 1 and the actual non-speech frame is transformed to speechuntil the counter is not zero.

III. OTHER NEURAL NETWORK BASED CLASSIFIERS

For comparison purpose, two additional types of neural net-work are used as fully data-driven classifier (i.e., substitutingthe DBN in Fig. 1): the MLP-mVAD and the BLSTM-mVAD.A general overview is provided in this section.

A. Multi-Layer Perceptron

The MLP is a kind of feed forward artificial neural network(ANN) having one input layer, one or more hidden layers andone output layer. Input signals flow throughout the network inthe forward direction only. To train such a network, a super-vised learning algorithm tunes the parameters (i.e., weightsand biases) by means of a labelled dataset. A well-knownapproach is the steepest descend with error back-propagation.It is composed of two steps: the forward pass and the backwardpass. These steps are iterated for several epochs until a stopcriterion is satisfied. One shortcoming of the MLP, whichcan significantly affect the training process, is the weights

TABLE II: Main differences between the Real and Simulated subsets.

Real SimulatedSource human loudspeaker

Background quiet variousNoise Source Rate low high

Overlapping Events no yes

initialisation, which is usually accomplished by following azero-mean Gaussian distribution. A second relevant drawbackoccurs when the network became deeper, i.e., many hiddenlayers. In particular, it is shown [29] that the effect of theback-propagated error on weight update is negligible at lowestlayers (i.e., close to the input). However, both issues, canbe mitigated by exploiting a layer-by-layer unsupervised pre-training procedure, peculiar of a Deep Belief Network. TheMLP used has n neurons at the output layer (cf. Fig. 1).

B. Bidirectional Long Short-Term Memory Network

A BLSTM-RNN is a recurrent neural network in which theusual non-linear neurons (i.e., sigmoid function) are replacedby the long short-term memory blocks. The LSTM block iscomposed of one or more self connected linear memory cellsand three multiplicative gates. The memory cell maintains theinternal state for a long time while its content is controlled bythe three gates which act as the memory write, read and resetoperations. More details can be found in [30]. The recurrentnature of the network allows a kind of memory in the networkinternal state which is exploited to compute the output of thenetwork. To deal with the future context, an elegant solutionis to duplicate the hidden layers and connect them to the sameinput and output. The input values and corresponding outputtargets are thus given in a forward and backward direction. Totrain such a network, a back-propagation through time (BPTT)algorithm with stochastic gradient descend is used after aweight initialisation. The BLSTM-RNN used here performsa multi-class classification task. Each class corresponds tospeech and non-speech in every target rooms, hence, thenetwork has K = 2n output units. The n mVAD decisionsare obtained by opportunely recombining and thresholding theK outputs values (cf. Fig. 1).

IV. DIRHA DATASET

The dataset, provided by the DIRHA project [31], containssignals recorded in an apartment equipped with 40 micro-phones installed on the walls and the ceiling of each rooms asshown in Fig. 4. In every-day interactions, the majority of thespeech events is expected to occur in the kitchen and livingroom, thus, these rooms are selected as our target rooms. Thewhole dataset is composed of two subsets called Real andSimulated. The main differences are reported in Table II.

In this work only the Simulated dataset is used becauseit contains more data (i.e., 80 minutes of audio, compared to21 minutes of the Real set) and is characterised by highernoise source rate and a wider variety of background noises.In addition, overlapping events (i.e., speech and noises) mayoccur with respect to the Real dataset. The Simulated dataset iscomposed of 80 scenes in Italian language 60 seconds long2.

2This is the publicly available subset of the Simulated DIRHA dataset atthe developing time.

( a) F loorplan ( b) K itchen ( c) Livingroom

K2L

K3C

K1R L1C

L4R

LA4L2R

L3L

KA5

Fig. 4: Layout of the experimental set-up for simulated data.

Each scene consists of localised acoustic and speech eventson which different real background noise, having randomdynamics, are superimposed. Events occur randomly in timeand in space, constrained on the grid for which Room ImpulseResponse (RIR) measurements are available (cf. squares inFig. 4a). The dataset is built by convolving the signals withthe appropriate RIR. On the contrary, the Real set is composedof recordings of real played scenes. The dataset has 44.8 ktotal speech frames (with a frame rate equal to 100 Hz) whichrepresent the 9.3% of the whole dataset.

V. EXPERIMENTS

The analysis of proposed mVAD is conducted by means ofa multi-stage strategy whose steps are the following:

1) First feature selection2) Network size selection3) Second feature selection4) Microphone selection5) Third feature selection

The first feature selection stage consist of 63 tests (i.e., allthe combination of the feature sets described in Section II-A)in which the network size is fixed to two hidden layers of10 units each and the input feature set is varied. The networksize selection stage aims to find the best network topology (i.e.,number of layers and number of units per layer) given the bestfeature set. A second feature selection is explored in order toassess or consolidate the best feature set with the new besttopology. Finally, different pairs of microphones are evaluatedas network input and a third feature selection is applied onthe best pair to finalise the analysis. The 1)-3) stages are ranusing two specific microphones: one from the kitchen (K2L,cf. Fig. 4b) and one from the living room (L1C, cf. Fig. 4c).The choice of these two microphone relies on the fact that theyare separated by the dividing-wall between the two rooms.

The experiments are conducted by means of the n-foldcross-validation technique to reduce the performance variance:the dataset is divided in n subsets having approximately thesame amount of each of the class occurrences. Then, eachof the subset is held out in turn for testing, training on theremaining part. Here, n = 10 and a validation set is alsoemployed during the training, thus, 8-1-1 fold(s) respectivelycompose the training, validation and test sets.

TABLE III: Comparison of training algorithm parameters. BP stands for“backpropagation” and BPTT indicates the “backpropagation through time”.

mVAD Type Training algorithm Weight initialisation Epochs

MLP BP Gaussian distr. Early Stoppingµ = 0, σ = 0.1

DBN CD-1 + BP DBN 200 +pre-training Early Stopping

BLSTM BPTT Gaussian distr. Early Stoppingµ = 0, σ = 0.1

The metrics used to evaluate the VAD performance arePrecision (P), Recall (R) and F-measure (F). They are com-puted by summing two separated confusion matrices, i.e., onefor each room. In particular, focusing on a room, a falsepositive is a frame erroneously detected as speech when thespeech signal is absent in that room. For example, a noiseor speech frame coming from the other room classified asspeech. On the other side, if the frame is correctly classifiedas non-speech it represents a true negative. A true positive isa frame correctly detected as speech when the speech signalis present in the room, while it represents a false negative if itis wrongly classified as non-speech. Precision and Recall arefurther exploited to create the P-R curve (PRC) by varying eachthreshold (cf. Section II-B) from 0 to 1. The area under thePRC (AUC) is the metric used to identify the best performanceachieved. For the best performing approach, additional VAD-specific metrics are reported: false alarm rate (FA), deletionrate (Del) and overall speech activity detection (SAD).

Del =NdelNsp

, FA =NfaNnsp

, SAD =Nfa + βNdelNnsp + βNsp

, (7)

where Ndel, Nfa, Nsp and Nnsp are the total number of dele-tions, false alarms, speech and non-speech frames, respectively.The term β = Nnsp/Nsp acts as regulator term for the classunbalancing.

The training algorithm depends on the classifier type.Table III shows a comparison of the training algorithm param-eters. In addition, the DBN pre-training is performed for 200epochs, the momentum is set to 0.9 in BLSTM and MLP basedmVAD, whilst it is equal to 0.3 and 0.8 for DBN-mVAD pre-training and fine-tuning, respectively. The learning rate is equalto 10−6, 10−7 and 0.1

Training Set Length for BLSTM-, MLP- andDBN-mVAD, respectively. Finally, the early stopping criterionstops the training after 20 consecutive epochs with no error

Am Ra Pi Ev Wc Mf

RaAm

RaAm

RaEv

PiAm

AmEv

WcAm

MfAm

PiRa

WcEv

WcEv

PiWc

RaWc

PiEv

MfRa

MfEv

MfW

cPiMf

PiMf

RaAm

EvPiRa

AmMfRaA

mRa

WcAm

PiAm

EvMfAmEv

PiWcAm

PiWcAm

MfW

cAm

WcAmEv

PiMfAm

PiRa

EvPiRa

Wc

RaWcEv

PiWcEv

PiWcEv

MfRaW

cPiMfRa

PiMfW

cMfW

cEv

MfRaE

vPiMfEv

RaWcAmEv

RaWcAmEv

PiRa

WcAm

PiRa

AmEv

PiMfRaA

mMfRaW

cAm

PiWcAmEv

MfRaA

mEv

PiMfW

cAm

PiMfW

cAm

PiMfAmEv

MfW

cAmEv

PiMfRaW

cPiRa

WcEv

MfRaW

cEv

PiMfW

cEv

PiMfRaE

vPiMfRaE

vPiRa

WcAmEv

PiMfRaA

mEv

PiMfW

cAmEv

PiMfRaW

cAm

MfRaW

cAmEv

PiMfRaW

cEv

PiMfRaW

cAmEv

PiMfRaW

cAmEv

8070

AUC  (%

)

605040302010    0

73.47

Fig. 5: AUC (%) of DBN-mVAD for each combination of n features, where n = {1, . . . , 6}. Feature legend: Am = AMS, Ra = RASTA-PLP, Pi = Pitch, Ev= EVM wH, Wc = WC-LPE, Mf = MFCC.

decrement. These values are chosen as a result of intensiveexperiments. Two GPU-based toolkits have been employed forthe experiments: currennt3 for BLSTM-mVAD and a customversion of GPUMLib4 for DBN-m/MLP-mVAD.

VI. RESULTS

The results are shown in this section accordingly with themulti-stage strategy adopted (cf. Section V). In particular, thefirst feature selection and the network size selection stages areconducted for the three mVADs, whilst the results regardingthe remaining stages are reported solely for the proposed DBN-based approach as a consequence of the achieved results.

A. First Feature Selection

The first analysis consists of varying the input featureset by keeping fixed the neural network size. In particular,two hidden layers of 10 units each are used for the threeneural networks. Fig. 5 shows the AUC values versus theinput feature set related to the DBN-mVAD. Specifically, foreach combination of n features, where n ranges between 1 and6, the precision/recall curve is built by varying 40 hangoverthresholds. The points connected in Fig. 5 represent the resultsfor a feature set composed of the same number of subsets, n.

The best feature set resulting from Fig. 5 belongs to thegroup of n = 3 feature subsets: PiMfEv (i.e., Pitch, MFCCand EVM wH). It is composed of 56 features/frame andleads to an AUC = 73.47%. The same analysis is conductedfor the BLSTM-based and the MLP-based mVADs. The bestfeature sets are composed of MfWcAm (i.e., MFCC, WC-LPE and AMS, 320 features/frame) and MfRaWcAmEv (i.e.,MFCC, RASTA-PLP, WC-LPE, AMS and EVM wH, 444features/frame), respectively. In terms of AUC, the BLSTM-mVAD reaches 53.84% while the MLP-mVAD achieves51.39%. The DBN-mVAD performance is significantly higherwith respect to the best of the others mVADs (i.e., BLSTM-mVAD, absolute AUC improvement of 19.63%).

B. Network Size Selection

A further analysis is conducted by varying the neuralnetwork size. In particular, the best feature set for each mVADis kept fix and used as input of more than 250 neural networks

3http://sourceforge.net/projects/currennt/4http://gpumlib.sourceforge.net/

having different topology. However, only the topologies whichachieved AUC > 50% are reported for the sake of conciseness.This selection begins evaluating several networks having onehidden layer of different sizes. Following, networks with twoand three hidden layers of different sizes are progressivelyadded and tested.

Fig. 6 shows four graphs containing a series of PRCsrelated to the three mVADs. PRCs are collected by takinginto account the first hidden layer size: the PRCs of networkshaving 10, 20 and 25, 30, and 40 units are respectively shownin the top-left, top-right, bottom-left and bottom-right plots.Fig. 6 straightforwardly shows a significant gap among thePRCs of the DBN approach and the other two mVADs. Theresulting best BLSTM-mVAD has two hidden layers of 40LSTM units each and achieves AUC = 70.21%. The bestMLP-mVAD AUC is 73.11% and it is composed of a singlehidden layer of 20 neurons. Despite the relevant improvementsof the two mVADs with respect to their AUC in Section VI-A,the highest performance is attained by the DBN-mVAD. Inparticular, a DBN with two hidden layers of 25 and 15 unitsleads to AUC = 75.76% which corresponds to an absoluteincrement of 2.29%, 5.55% and 2.65% with respect to theinitial DBN, the best BLSTM and the best MLP topologies,respectively.

C. Second Feature Selection

In the light of the results achieved by the proposed ap-proach with respect to the other approaches, the remaininganalysis stages are conducted solely on the DBN-mVAD. Inorder to obtain more reliable results on the best performingfeature set given the new topology (i.e., two hidden layers of25,15 units) a second feature selection analysis is performed.The result is not reported for the sake of conciseness, however,it confirmed that the feature set selected in the first analysis,i.e., PiMfEv, gives the highest AUC.

D. Microphone Selection

The fourth stage of the proposed analysis aims to evaluatedifferent microphones as input for the DBN-mVAD. A set of 7more microphones is considered in addition to the two initiallyselected: K1R, K3C and KA5 in the kitchen (cf. Fig. 4b) andL2R, L3L, L4R and LA4 in the living room (cf. Fig. 4c). Allthe possible pairs of those microphones have been evaluated.For the sake of conciseness, only the results that achieve an

TABLE IV: Result of microphone selection stage. The first row indicates thepair of microphones initially chosen. The * indicates significant improvement.The “th” column contains a single value because the difference between roomthresholds are negligible.

Microphones AUC (%) ∆AUC P (%) R (%) F (%) thK2L-L1C 75.76 - 66.9 74.6 70.5 0.57K1R-LA4 78.1 9.73 70.5 77.0 73.6 0.60K3C-LA4 78.11 9.74 69.6 77.6 73.4 0.62KA5-L2R 80.09 11.97 75.5 74.3 74.9 0.65KA5-L3L 78.31 9.97 74.0 74.2 74.1 0.65KA5-L4R 79.03 10.79 71.8 76.0 73.8 0.67KA5-LA4 82.11* 14.13 75.4* 73.6 74.5* 0.65KA5-L1C 78.25 9.90 68.9 75.7 72.1 0.57K2L-L2R 76.23 7.51 70.3 73.6 71.9 0.60K2L-LA4 79.54 11.36 73.9 73.8 73.9 0.67

AUC greater than that of the initial microphones pair areshown in Table IV. From the table, the best combination ofmicrophones are the pair KA5 and LA4. Indeed, the relativeAUC increment is 14.13% with respect to the initial pairselected. The improvement is due to the significant incrementof the Precision +8.5%. The microphones composing the bestpair are both installed on the ceiling and, although they areomnidirectional they seems to better capture the speech oftheir own rooms to the detriment of speech coming from otherrooms.

E. Third Feature Selection

The last stage of analysis consists of a third feature selec-tion using the network topology found in Section VI-B and thepair of microphones KA5-LA4. The results show that a furthersignificant performance improvement can be achieved usingthe feature set composed of PiMfRaWcEv (i.e., Pitch, MFCC,RASTA-PLP, WC-LPE and EVM-wH, 176 features/frame).Hence, the DBN-classifier that achieves the best performancehas 176 input units, two hidden layers of 25,15 units and oneoutput layer with two units. The corresponding AUC is equal to83.88% which correspond to an absolute +1.77% with respectto the highest AUC in Section VI-D. Given the best PRC, thethresholds for each DBN outputs are chosen by maximisingthe F-measure. In particular, the thresholds are both equal to0.75 and the resulting performance are P=77.26%, R=77.55%and F=77.40%. The statistical significance of the F-measureincrement, i.e., +3.75%, has been assessed by the one-tailedz-test with the significance level p < 0.0005 [32].

VII. CONCLUSION

We have presented a data-driven voice activity detectionapproach based on a deep belief network and applied ona multi-room domestic environment (DBN-mVAD). Further-more, the DBN-mVAD also identifies the room from whichthe speech segments have been uttered without additionalprocessing. Regarding the experiments, an optimization pro-cedure consisting of five stages is performed and results arecompared to two mVADs based on a multi-layer perceptronand on a bidirectional long short-term memory neural network.Throughout the multi-stage analysis, we firsty identified thebest feature set for the three mVADs by keeping fixed thenetwork topology. Then, the feature set is kept fixed and avariety of network topologies has been evaluated. The DBN-mVAD resulted the best performing one with absolute AUCimprovements of +5.55% and +2.65% with respect to the

BLSTM-mVAD and MLP-mVAD. For this reason, the remain-ing optimisation stages were conducted solely on the DBN-mVAD. After a second feature selection, performed using thebest DBN topology, the microphone selection was conductedand a different pair of microphones led to a relative ∆AUCof about +14.13% respect to the initial pair chosen. Finally,a third feature selection led to the best performance, i.e.,AUC = 83.88%. Compared to the first stage of the DBN-mVAD, an absolute AUC increment of 10.41% has beenachieved whilst, in terms of F-measure, the increment is equalto 8.56%. The corresponding SAD achieved by the best DBN-mVAD configuration is 7.4%, whilst the deletion rate and falsealarm rate are respectively equal to 6.1% and 8.6%.

Future efforts concerns: the exploitation of a larger datasetspecially for enhancing the pre-training; the introduction ofsignal-processing techniques, such as beamforming to reducethe signal intensity of the speech coming from non-targetrooms; the employment of more than two microphones asnetwork inputs.

ACKNOWLEDGEMENT

This research is part of the HDOMO 2.0 project foundedby the National Research Centre on Ageing (INRCA) inpartnership with the Government of the Marche region underthe action “Smart Home for Active and Healthy Ageing”.

REFERENCES

[1] E. Principi, S. Squartini, F. Piazza, D. Fuselli, and M. Bonifazi, “Adistributed system for recognizing home automation commands anddistress calls in the italian language,” in Proc. of Interspeech, Lyon,France, 2013, pp. 2049–2053.

[2] ITU, “A silence compression scheme for G.729 optimized for terminalsto conforming V.70,” ITU-T Rec. G. 729, Annex B, 1996.

[3] C. Kim, K. K. Chin, M. Bacchiani, and R. M. Stern, “Robust speechrecognition using temporal masking and thresholding algorithm,” inProc. of Interspeech, Singapore, Singapore, Sep. 14-18 2014, pp. 2734–2738.

[4] S. Boll, “Suppression of acoustic noise in speech using spectral sub-traction,” IEEE Trans. Acoust., Speech, Signal Process., vol. 27, no. 2,pp. 113–120, 1979.

[5] K.-H. Woo, T.-Y. Yang, K.-J. Park, and C. Lee, “Robust voice activitydetection algorithm for estimating noise spectrum,” Electronics Letters,vol. 36, no. 2, pp. 180–181, 2000.

[6] J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voiceactivity detection,” IEEE Signal Process. Lett., vol. 6, no. 1, pp. 1–3,1999.

[7] J. W. Shin, J.-H. Chang, H. S. Yun, and N. S. Kim, “Voice activitydetection based on generalized gamma distribution,” in Proc. of ICASSP,Philadelphia, PA, Mar. 18-25 2005, pp. 781–784.

[8] J.-H. Chang, N. S. Kim, and S. K. Mitra, “Voice activity detection basedon multiple statistical models,” IEEE Trans. Signal Process., vol. 54,no. 6, pp. 1965–1976, 2006.

[9] J. Ramırez, J. C. Segura, C. Benıtez, L. Garcıa, and A. Rubio, “Sta-tistical voice activity detection using a multiple observation likelihoodratio test,” IEEE Signal Process. Lett., vol. 12, no. 10, pp. 689–692,2005.

[10] S. Mousazadeh and I. Cohen, “AR-GARCH in presence of noise:parameter estimation and its application to voice activity detection,”IEEE Trans. Audio, Speech, Language Process., vol. 19, no. 4, pp.916–926, 2011.

[11] J. Wu and X.-L. Zhang, “An efficient voice activity detection algorithmby combining statistical model and energy detection,” EURASIP Journalon Advances in Signal Processing, vol. 2011, no. 1, pp. 1–10, 2011.

40 50 60 70 80 90 10040

50

60

70

80

90

100

Recall (%)

Pre

cis

ion

(%

)

DBN (10,15) AUC=75.22%DBN (10,15,4) AUC=69.29%DBN (10) AUC=73.89%BLSTM (10,10) AUC=53.84%BLSTM (10) AUC=58.44%MLP(10) AUC=71.80%MLP(10,10) AUC=51.38%

40 50 60 70 80 90 10040

50

60

70

80

90

100

Recall (%)

Pre

cis

ion

(%

)

DBN(20,4) AUC=75.47%DBN(20,4,8) AUC=72.37%DBN(20) AUC=74.63%DBN(25,15) AUC=75.76%DBN(25,15,4) AUC=74.59%BLSTM(20) AUC=67.51%BLSTM(20,30) AUC=62.49%MLP(20,4) AUC=57.30%MLP(20) AUC=73.11%

40 50 60 70 80 90 10040

50

60

70

80

90

100

Recall (%)

Pre

cis

ion (

%)

DBN(30,20) AUC=75.11%DBN(30,20,8) AUC=73.07%DBN(30) AUC=75.71%BLSTM(30) AUC=63.98%BLSTM(30,40) AUC=66.93%MLP(30,10) AUC=58.19%MLP(30) AUC=62.01%

40 50 60 70 80 90 10040

50

60

70

80

90

100

Recall (%)

Pre

cis

ion (

%)

DBN(40,20) AUC=75.45%DBN(40,20,10) AUC=72.10%DBN(40,4) AUC=68.45%BLSTM(40) AUC=69.84%BLSTM(40,40) AUC=70.21%MLP(40,4) AUC=62.84%MLP(40,10) AUC=57.80%

Fig. 6: Comparison of Precision-Recall Curves (PRCs) of the three mVADs having different sizes of neural classifier. The curves are grouped by the networkfirst layer size, s: the top-left graph shows PRCs of s = 10, the top-right PRCs are related to s = 20 and s = 25, the bottom-left graph contains PRCs ofs = 30 units and the bottom-right of s = 40 units. The red dashed curve is the most performing in each plot.

[12] J. Ramırez, J. C. Segura, C. Benıtez, A. De La Torre, and A. Rubio,“Efficient voice activity detection algorithms using long-term speechinformation,” Speech communication, vol. 42, no. 3, pp. 271–287, 2004.

[13] J. W. Shin, J.-H. Chang, and N. S. Kim, “Voice activity detection basedon statistical models and machine learning approaches,” ComputerSpeech & Language, vol. 24, no. 3, pp. 515–530, 2010.

[14] T. Hughes and K. Mierle, “Recurrent neural networks for voice activitydetection,” in Proc. of ICASSP, Vancouver, BC, Canada, Mar. 26-312013, pp. 7378–7382.

[15] X.-L. Zhang and D. Wang, “Boosted deep neural networks and multi-resolution cochleagram features for voice activity detection,” in Proc.of Interspeech, Singapore, Singapore, Sep. 14-18 2014, pp. 1534–1538.

[16] Y. Tachioka, T. Narita, S. Watanabe, and J. Le Roux, “Ensemble inte-gration of calibrated speaker localization and statistical speech detectionin domestic environments,” in Proc. of HSCMA, 2014, Florence, Italy,May 12-14 2014, pp. 162–166.

[17] A. Brutti, M. Ravanelli, P. Svaizer, and M. Omologo, “A speech eventdetection and localization task for multiroom environments,” in Proc.of HSCMA, 2014, Florence, Italy, May 12-14 2014, pp. 157–161.

[18] P. Giannoulis, A. Tsiami, I. Rodomagoulakis, A. Katsamanis,G. Potamianos, and P. Maragos, “The athena-RC system for speechactivity detection and speaker localization in the dirha smart home,” inProc. of HSCMA, 2014, Florence, Italy, May 12-14 2014, pp. 167–171.

[19] G. Ferroni, R. Bonfigli, E. Principi, S. Squartini, and F. Piazza, “NeuralNetworks Based Methods for Voice Activity Detection in a Multi-roomDomestic Environment,” in Proc. of EVALITA, Pisa, Italy, Dec. 11 2014,pp. 153–158.

[20] F. Eyben, F. Weninger, F. Groß, and B. Schuller, “Recent Developmentsin openSMILE, the Munich Open-Source Multimedia Feature Extrac-tor,” in Proc. of ACM Int. Conf. on Multimedia, Barcelona, Spain, Oct.21-25 2013, pp. 835–838.

[21] M. Wolf and C. Nadeu, “Channel selection measures for multi-microphone speech recognition,” Speech Communication, vol. 57, pp.170–180, 2014.

[22] D. J. Hermes, “Measurement of pitch by subharmonic summation,” Thejournal of the acoustical society of America, vol. 83, no. 1, pp. 257–264,1988.

[23] E. Marchi, G. Ferroni, F. Eyben, L. Gabrielli, S. Squartini, andB. Schuller, “Multi-resolution Linear Prediction Based Features forAudio Onset Detection with Bidirectional LSTM Neural Networks,”in Proc. of ICASSP, Florence, Italy, May 4-9 2014, pp. 2183–2187.

[24] S. B. Davis and P. Mermelstein, “Comparison of parametric repre-sentations for monosyllabic word recognition in continuously spokensentences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4,pp. 357–366, 1980.

[25] H. Hermansky and N. Morgan, “RASTA processing of speech,” IEEETrans. Speech Audio Process., vol. 2, no. 4, pp. 578–589, 1994.

[26] B. Kollmeier and R. Koch, “Speech enhancement based on physiologi-cal and psychoacoustical models of modulation perception and binauralinteraction,” J. Acoust. Soc. Am., vol. 95, no. 3, pp. 1593–1602, 1994.

[27] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deepbelief nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.

[28] L. Deng, “A tutorial survey of architectures, algorithms, and applicationsfor deep learning,” APSIPA Trans. on Signal and Information Process.,vol. 3, 2014.

[29] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, “Gradientflow in recurrent nets: the difficulty of learning long-term dependen-cies,” in A field guide to dynamical recurrent neural networks, S. C.Kremer and J. F. Kolen, Eds. Wiley-IEEE Press, 2001.

[30] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[31] L. Cristoforetti, M. Ravanelli, M. Omologo, A. Sosi, A. Abad,M. Hagmuller, and P. Maragos, “The DIRHA simulated corpus,” inProc. of LREC, vol. 5, Reykjavik, Iceland, May 26-31 2014.

[32] M. D. Smucker, J. Allan, and B. Carterette, “A comparison of statisticalsignificance tests for information retrieval evaluation,” in Proc. of ACMConf. on Information and Knowledge Management, Lisboa, Portugal,Nov. 6-9 2007, pp. 623–632.