Advanced Fusion-Based Speech Emotion Recognition ... - MDPI

Citation: Maji, B.; Swain, M.;

Mustaqeem. Advanced Fusion-Based

Speech Emotion Recognition System

Using a Dual-Attention Mechanism

with Conv-Caps and Bi-GRU

Features. Electronics 2022, 11, 1328.

https://doi.org/10.3390/

electronics11091328

Academic Editors: Dorota Kaminska,

Gholamreza Anbarjafari, Frane Urem,

Rui Raposo and Mário Mário

Vairinhos

Received: 16 March 2022

Accepted: 20 April 2022

Published: 22 April 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Copyright: © 2022 by the authors.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

electronics

Article

Advanced Fusion-Based Speech Emotion Recognition SystemUsing a Dual-Attention Mechanism with Conv-Caps andBi-GRU FeaturesBubai Maji 1, Monorama Swain 1 and Mustaqeem 2,*

1 Department of Electronics and Communication Engineering, Silicon Institute of Technology,Bhubaneswar 751024, India; [email protected] (B.M.); [email protected] (M.S.)

2 Interaction Technology Laboratory, Department of Software, Sejong University, Seoul 05006, Korea* Correspondence: [email protected]

Abstract: Recognizing the speaker’s emotional state from speech signals plays a very crucial role inhuman–computer interaction (HCI). Nowadays, numerous linguistic resources are available, but mostof them contain samples of a discrete length. In this article, we address the leading challenge in SpeechEmotion Recognition (SER), which is how to extract the essential emotional features from utterancesof a variable length. To obtain better emotional information from the speech signals and increasethe diversity of the information, we present an advanced fusion-based dual-channel self-attentionmechanism using convolutional capsule (Conv-Cap) and bi-directional gated recurrent unit (Bi-GRU)networks. We extracted six spectral features (Mel-spectrograms, Mel-frequency cepstral coefficients,chromagrams, the contrast, the zero-crossing rate, and the root mean square). The Conv-Cap modulewas used to obtain Mel-spectrograms, while the Bi-GRU was used to obtain the rest of the spectralfeatures from the input tensor. The self-attention layer was employed in each module to selectivelyfocus on optimal cues and determine the attention weight to yield high-level features. Finally, weutilized a confidence-based fusion method to fuse all high-level features and pass them through thefully connected layers to classify the emotional states. The proposed model was evaluated on theBerlin (EMO-DB), Interactive Emotional Dyadic Motion Capture (IEMOCAP), and Odia (SITB-OSED)datasets to improve the recognition rate. During experiments, we found that our proposed modelachieved high weighted accuracy (WA) and unweighted accuracy (UA) values, i.e., 90.31% and87.61%, 76.84% and 70.34%, and 87.52% and 86.19%, respectively, demonstrating that the proposedmodel outperformed the state-of-the-art models using the same datasets.

Keywords: affective computing; bi-directional gated recurrent unit; convolutional capsule network;confidence-based fusion; speech emotion recognition

1. Introduction

The evolution of innovative technologies has made our daily life more comfortable.Consequently, speech is the most convenient communication method with increasinginteractions between humans and machines. People can understand other ways of com-municating emotions relatively easily, such as body gestures [1], facial expressions [2],image classification [3], and successfully detecting emotional states (e.g., happiness, anger,sadness, and neutrality). Speech-based emotion recognition tasks are receiving increasingattention in many areas, such as forensic sciences [4], customer support call review andanalysis [5], mental health surveillance, intelligent systems, and quality assessments ofeducation [6]. Deep learning technology has accelerated the recognition of emotions fromspeech, but the research on SER still has deficiencies, such as shortages of training dataand inadequate model performance [7,8]. However, emotion classification from speech inreal-time is difficult due to several dependencies, such as speakers, cultures, genders, ages,and dialects [9].

Electronics 2022, 11, 1328. https://doi.org/10.3390/electronics11091328 https://www.mdpi.com/journal/electronics

https://doi.org/10.3390/electronics11091328


https://creativecommons.org/

https://creativecommons.org/licenses/by/4.0/

https://creativecommons.org/licenses/by/4.0/

https://www.mdpi.com/journal/electronics

https://www.mdpi.com


https://www.mdpi.com/journal/electronics

https://www.mdpi.com/article/10.3390/electronics11091328?type=check_update&version=1

Electronics 2022, 11, 1328 2 of 18

Most importantly, the feature extraction step in an SER task can effectively fill the gapbetween speech samples and the corresponding emotional states. Various hand-craftedfeatures, such as pitch, duration, format, intensity, and linear prediction coefficient (LPC),have been used in SER so far [10,11]. However, these hand-crafted features suffer froma limited capability and low accuracy in building an efficient SER system [12] becausethey are at a lower level and may not be discriminative enough to predict the respectiveemotional category [13]. Nowadays, spectral features are more popular and usable thantraditional hand-crafted features as they extract more emotional information by consid-ering the time and frequency [14]. Due to these advantages, scholars have carried outa considerable amount of research using spectral features [15–17]. However, these fea-tures cannot adequately express the emotions in an utterance. Therefore, to overcomethe limitations of traditional hand-crafted features and spectral features, high-level deepfeature representations need to be extracted using efficient deep learning algorithms forSER from speech signals [12]. Over the last few years, researchers have introduced differentdeep learning algorithms using various discriminative features [18–20]. At present, deeplearning methods still use several low-level descriptor (LLD) features, which are differentfrom traditional SER features [21]. So, extracting more detailed and relevant emotionalinformation from speech is the first issue we have to address.

Newly emerging deep learning methods may provide possible solutions to this issue.Among them, two of the most common are Convolutional Neural Networks (CNNs) [22]and Recurrent Neural Networks (RNNs), such as long short-term memory (LSTM) andGated Recurrent Units (GRUs) [23,24]. A model with an RNN and a CNN was built torepresent high-level features from low-level data for SER tasks [25]. The CNN represents ahigh-level feature map, while the RNN extracts long-term temporal contextual informationfrom the low-level acoustic features to identify the emotional class [26]. Several stud-ies [14,27] have been successfully carried out to learn features in speech signal processingusing CNNs and RNNs.

Moreover, most speech emotion databases contain only utterance-level class labels, andnot all parts of utterances hold emotional information. They contain unvoiced parts, shortpauses, background sounds, transitions between phonemes, and so on [28]. Unfortunately,CNNs and RNNs are not able to efficiently deal with this situation and analyze acousticfeatures extracted from voices [29]. So, we must distinguish the emotionally relevant partsand determine whether the speech frame contains voiced or unvoiced parts. Capsulenetworks (CapNets) have been used to overcome the drawback of CNNs in capturingspatial information [30]. CapNets have been utilized in various tasks and demonstrated tobe effective [31–33]. A capsule-based network was developed for emotion classificationtasks [30]. Meanwhile, we applied a self-attention mechanism, emphasizing the capture ofprominent features [34]. Because of this, we moved to the self-attention mechanism fromthe traditional attention mechanism.

Although these models with deep learning approaches are efficient and adaptableto practical situations, there remain some significant challenges to overcome. These areas follows:

• The key challenge is that the durations of the speech signals are not equal. Thetraditional machine learning model truncates and pads the data to obtain input featureswith a fixed shape, affecting the model efficiencies;

• The local features obtained by the CNN may not contain relevant data because someemotional information may be hidden in utterances with a variety of durations. Pre-vious studies [35,36] consecutively combined networks for emotion recognition. Thesuperiority of these models lies in the fact that they are very simple to design forsegment-level features. However, in consecutive models, the inheritance relationshipwithin the structures may lose some emotional information and may not capture theglobal contextual features.

Therefore, a novel system is required to independently learn the local and globalcontextual emotional features in speech utterances of varied length. We were inspired by

Electronics 2022, 11, 1328 3 of 18

the RNNs and capsule networks that have been used in many fields to achieve high accuracyand better generalizability, especially in time-series data. To reduce the redundancy andenhance the complementarity, we propose a novel fusion-based dual-channel self-attentiondeep learning model using a combination of a CNN and a capsule network (Conv-Cap)and a Bi-GRU (shown in Figure 1), as presented in Figure 2. The proposed model effectivelyuses the advantages of different features in parallel inputs to increase the performance ofthe system. The final issue is to design a suitably sized input array to meet the needs ofother models’ performance.

Electronics 2022, 11, x FOR PEER REVIEW 4 of 19

obtain emotional information from spectrograms to identify the speech emotion [43]. A spectrogram is a two-dimensional visualization of a speech signal, is used in 2D-CNN models to extract high-level discriminative features, and has become more prevalent in this era [44]. Zhao et al. [45] extracted features using different spectrogram dimensions using CNNs and passed them through to an LSTM network. The LSTM network was used to learn global contextual information from the resulting features of the CNN. Kwon et al. [46] employed a new method to determine the best sequence segments using Radial Basis Functions (RBFs) and KNN. The authors used the selected vital sequence of the spectro-gram and extracted features using a CNN. The obtained features were normalized and classified using bi-directional long short-term memory (BiLSTM). The researchers found that 2D CNN-LSTM networks work much better than traditional classifiers. In addition to 2D CNN [47] networks, some studies have also used 1D CNNs and achieved satisfying performance in SER. For example, in [48] Mustaqeem et al. proposed a multi-learning framework using a one-dimensional dilated CNN and a bi-directional GRU sequentially. The authors utilized the residual skip connection to learn the discriminative features and the long-term contextual dependencies. In this way, the authors improved the local learned features from the speech signal for SER. In [17], a new model was built using five spectral features (the MFCC, the chromagram, the Mel-scale spectrogram, the spectral contrast, and the Tonnetz representation). The authors used a 1D CNN to learn the fea-tures for the classification results. However, transfer learning methods can also train pre-trained networks, such as AlexNet [3] and VGG [49], for SER using spectrograms.

Moreover, some researchers concluded that LSTM performs better when using a CNN to extract high-level features. The LSTM model is well suited to addressing emo-tional analysis problems, but it is still time-consuming and difficult to train in parallel for large datasets. Cho et al. [24] employed a GRU, which has a shorter training time, has fewer parameters than the LSTM network, and can capture global contextual features. The GRU network is an advanced type of RNN and similar to LSTM [50]. It is easier and less complex to train than the LSTM network, which can help to increase the training effi-ciency. It has only two ports: an update port and a reset port. The function of the update port is the same as that of the LSTM networks forget port and input port, which make decisions on deleted information and newly stored information, respectively. The reset port decides how to combine the previous information with the newly stored information and determines the degree to which past information will be forgotten [51].

Figure 1. The Bi-GRU architecture with input, hidden, and output layers.

For time step 푡, the input feature sequence of 푛-th utterance 푥 , (where 푥 is the fea-ture vector of the utterance) is encoded into forward (ℎ , ) and backward (ℎ , ) hidden states. Then, the two hidden states are used together at the same time and the output ℎ , is jointly calculated by them sequentially, making the result more robust. The formula for the calculation is as follows:

Figure 1. The Bi-GRU architecture with input, hidden, and output layers.


self-attention mechanism to concentrate on the more salient emotional information. We separately trained each module with separate inputs. To achieve a higher probability for each class, we applied fusion techniques at the decision level. The details of our proposed method are described in the following sections.

3.1. Data Preprocessing A feature extraction step is essential to building a successful machine learning model.

Features may be accomplices to a well-trained model, whereas irrelevant features would considerably obstruct the training process [17]. We utilized the Librosa audio library for the feature extraction process [63]. We extracted six different types of spectral features: Mel-spectrogram, Mel-frequency cepstral coefficient, chromagram, contrast, zero-cross-ing rate (ZCR), and root mean square (RMS) features. Here, the Conv-Cap module uses the Mel-spectrogram feature, while the rest of the features (the MFCC, chromagram, con-trast, ZCR, and RMS features) pass through the Bi-GRU module. The short-term Fourier transform (STFT) with a window length of 32 ms was used to obtain the Mel-spectro-grams. We used Mel-spectrograms with 64 Mel bins with speech signals with a sampling rate of 16 kHz for each utterance. The scaled Mel-spectrograms were resized in the time dimension to the appropriate input size of the model. On the other hand, the total dimen-sion of the spectral LLD features was 1 × 61.

3.2. Module Architecture 3.2.1. Convolutional Neural Network

CNNs have successfully been employed in several emotion-related tasks, such as au-tomatic speech recognition (ASR) and emotion identification [41]. We initially used pre-trained networks such as AlexNet [3] instead of fine-tuning the CNN. The pre-trained AlexNet [3] model required a large amount of data because it was trained on a dataset containing millions of images. In our experiment, the pre-trained model did not provide good accuracy because of the limited amount of speech data. So, we created our own CNN network. We randomly ran models with various numbers of convolutional layers, kernel layers and filters, and other hyper-parameters several times and selected the optimal number of convolutional layers and other parameters based on the values that gave more efficient and stable results.

The output of the fine-tuned CNN architecture is then passed through to the CapNet module. A key contribution of our work is applying deep features to each capsule state to learn the temporal representation of different abstract features. The CNN network con-sists of three convolutional layers and two max-pooling layers, as shown in Figure 2.

Figure 2. A visual representation of the proposed speech emotion recognition system using capsule and bi-directional GRU networks with a dual self-attention mechanism and a decision-level fusion technique for final classification.

Figure 2. A visual representation of the proposed speech emotion recognition system using capsuleand bi-directional GRU networks with a dual self-attention mechanism and a decision-level fusiontechnique for final classification.

The contributions and innovations of this study can be summarized as follows:

• We propose an efficient speech emotion recognition architecture utilizing a modifiedcapsule network and Bi-GRU modules in two streams with a self-attention mechanism.Our model is capable of learning spatial and temporal cues through low-level features.It automatically models the temporal dependencies. Therefore, we aim to contributeto the SER literature by using different spectral features with an effective parallel inputmode and then fusing the sequential learning features;

• We propose and explore for the first time a Conv-Cap and Bi-GRU feature-learning-based architecture for SER systems. To the best of our knowledge, the proposedmethods are novel. We pass spectrograms through the Conv-Cap network and Mel-frequency cepstral coefficients, chromagrams, the contrast, the zero-crossing rate,and the root mean square through the Bi-GRU network in parallel to enhance the

Electronics 2022, 11, 1328 4 of 18

feature learning process in order to capture more detailed emotional information andto improve baseline methods;

• We propose a dual-channel self-attention strategy to selectively focus on importantemotional cues and ensure the system’s performance using discriminative features.We use a novel learning strategy in attention layers, which utilize the outputs of theproposed capsule network and sequential network simultaneously and capture theattention weight of each cue. Moreover, the proposed SER system’s time complexity ismuch lower than that of the sole model;

• We demonstrate the significance of a fusion strategy by using a confidence-basedfusion technique that ensures the best outcomes for these integrated, separately learnedfeatures by using a fully connected network (FCN). We demonstrate the effectivenessand efficiency of our proposed method on the IEMOCAP and EMO-DB databases andour own Odia speech corpus (SITB-OSED). We conducted extensive experimentation andobtained results that outperformed state-of-the-art approaches for emotion recognition.

The remainder of this paper is arranged as follows. A review of the literature on SER isgiven in Section 2. Section 3 provides the details of our proposed system. Section 4 presentsthe datasets, the experimental setup, and a comparative analysis that demonstrates themodel’s efficacy and reliability. Finally, our conclusions and future research directions areprovided in Section 5.

2. Literature Review

SER is an energetic field of research, and researchers have designed numerous strate-gies over the last few years. With the use of advanced deep learning strategies, researchershave mostly applied complex hand-crafted features (e.g., eGeMaps, ComParE, IS09, etc.)for SER with conventional machine learning approaches such as k-nearest neighbor (kNN),support vector machine (SVM), the Gaussian Mixed Model (GMM), and the Hidden MarkovModel (HMM) [37–39]. Therefore, scholars have been greatly inspired by the growinguse of deep learning methods for SER tasks. In [40], the authors implemented the firstdeep learning model for SER. Then, in [41], the authors utilized a CNN model to learnthe salient emotional features. Later, in [42], the authors implemented a CNN model toobtain emotional information from spectrograms to identify the speech emotion [43]. Aspectrogram is a two-dimensional visualization of a speech signal, is used in 2D-CNNmodels to extract high-level discriminative features, and has become more prevalent in thisera [44]. Zhao et al. [45] extracted features using different spectrogram dimensions usingCNNs and passed them through to an LSTM network. The LSTM network was used tolearn global contextual information from the resulting features of the CNN. Kwon et al. [46]employed a new method to determine the best sequence segments using Radial Basis Func-tions (RBFs) and KNN. The authors used the selected vital sequence of the spectrogramand extracted features using a CNN. The obtained features were normalized and classifiedusing bi-directional long short-term memory (BiLSTM). The researchers found that 2DCNN-LSTM networks work much better than traditional classifiers. In addition to 2DCNN [47] networks, some studies have also used 1D CNNs and achieved satisfying perfor-mance in SER. For example, in [48] Mustaqeem et al. proposed a multi-learning frameworkusing a one-dimensional dilated CNN and a bi-directional GRU sequentially. The authorsutilized the residual skip connection to learn the discriminative features and the long-termcontextual dependencies. In this way, the authors improved the local learned features fromthe speech signal for SER. In [17], a new model was built using five spectral features (theMFCC, the chromagram, the Mel-scale spectrogram, the spectral contrast, and the Tonnetzrepresentation). The authors used a 1D CNN to learn the features for the classificationresults. However, transfer learning methods can also train pre-trained networks, such asAlexNet [3] and VGG [49], for SER using spectrograms.

Moreover, some researchers concluded that LSTM performs better when using a CNNto extract high-level features. The LSTM model is well suited to addressing emotionalanalysis problems, but it is still time-consuming and difficult to train in parallel for large

Electronics 2022, 11, 1328 5 of 18

datasets. Cho et al. [24] employed a GRU, which has a shorter training time, has fewerparameters than the LSTM network, and can capture global contextual features. The GRUnetwork is an advanced type of RNN and similar to LSTM [50]. It is easier and less complexto train than the LSTM network, which can help to increase the training efficiency. It hasonly two ports: an update port and a reset port. The function of the update port is the sameas that of the LSTM networks forget port and input port, which make decisions on deletedinformation and newly stored information, respectively. The reset port decides how tocombine the previous information with the newly stored information and determines thedegree to which past information will be forgotten [51].

For time step t, the input feature sequence of n-th utterance xn,t (where x is the feature

vector of the utterance) is encoded into forward (→h n,t) and backward (

←h n,t) hidden states.

Then, the two hidden states are used together at the same time and the output↔h n,t is

jointly calculated by them sequentially, making the result more robust. The formula for thecalculation is as follows: →

h n,t =→

GRU(Wx,→h

xn,t + b→h) (1)

←h n,t =

←GRU(W

x,←h

xn,t + b←h) (2)

↔h n,t = (

→h n,t ⊕

←h n,t) (3)

where W→h

and W←h

are the weight matrix form of the forward and backward directions ofthe GRU layer and b→

hand b←

hare the bias of the forward and backward directions of the

GRU layer, respectively. Compared with the LSTM network, the number of parameters, thecomplexity of the GRU model, and the experimental costs are reduced [50].

In addition, combinations of CNN and LSTM models were proposed by Trigeorgiset al. [35] to solve the problem of extracting context-aware emotion-related features andobtain a better representation from the raw audio signal. They did not use any hand-craftedfeatures, such as linear prediction coefficients and MFCCs. The CNN and LSTM networkcombination has received a great deal of attention compared with the FCN network forvarious tasks. Similarly, Tzirakis et al. [52] used CNN and LSTM networks to capture thespatial and temporal features of audio data.

Furthermore, the attention mechanism is often used with deep neural networks.Chorowski et al. [53] first proposed a sample attention mechanism for speech recogni-tion. Recurrent neural networks and attention mechanism approaches have made out-standing achievements in emotion recognition and allowed us to focus on obtainingmore emotional information [54]. Rajamani et al. [55] adopted a new approach usingan attention-mechanism-based Bi-GRU model to learn contextual information and affectiveinfluences from previous utterances to help recognize the current utterance’s emotion. Theauthors used low-level descriptor (LLD) features such as MFCCs, the pitch, and statistics.Zhao et al. [56] proposed a method that can automatically learn spatiotemporal represen-tations of speech signals. However, the self-attention mechanism has distinct advantagesover the attention mechanism [57]. It can determine the different weights in the framewith different emotional intensities and the autocorrelation between frames [58]. Theself-attention mechanism is an upgraded version of the attention mechanism. The formulafor the attention algorithm is as follows:

Attention(Q, K, V) = So f tmax(

QKT√

Dk

)(4)

In Equation (4), Q, K, and V denote the queries, keys, and values of the input matrix,and the dimension of the keys is denoted Dk. When Q = K = V, it is a self-attentionprocess. Self-attention models can learn semantic information, such as word dependenciesin sentences, more easily.

Electronics 2022, 11, 1328 6 of 18

Although CNNs have provided extremely successful results in the computer visionfield and signal classification, they have some limitations [59]. Recently, capsule networks(CapNets) have been proposed to overcome the drawback of CNNs in capturing spatialinformation. CNNs cannot learn the relationship between the parts of an object and thepositions of objects in an image. Additionally, traditional CNNs will lose some semanticinformation during the forward propagation of the pooling layer, making it difficult toidentify the spatial relationship. To overcome these issues, Hinton et al. built capsulenetworks to maintain the positions of objects within the image and model the spatialrelationships of the object’s properties [30]. A capsule contains a couple of neurons and isconveyed in the form of vectors. The vector’s length exhibits the probability of the activitydescribed by the capsule and also captures the internal representation parameters of theactivity. The capsule and a squashing function are used to obtain the vector form of eachcapsule. The output of the capsule k is defined as qt

k for each time step t.

qtk = d(xk) (5)

where the input to the capsule k is xk and the squashing function is d(.). The squashingfunction is represented as

d(xk) =‖xk‖2

1 + ‖xk‖2xk‖xk‖

(6)

Here, all inputs of the capsule xk are the weighted summation of the prediction vector,represented as uk|l , except for the first capsule layer. The prediction vector is calculated bythe lower layer of the capsule, which is multiplied by a weight matrix Wlk, and the outputof the capsule ul . Then, the impact of the perspective is modeled by matrix multiplication.

xk = ∑l

clkuk|l (7)

where clk defines the coupling coefficients, which are calculated by the routing softmaxfunction as expressed in Equation (8).

clk =exp(alk)

∑j exp(

al j

) (8)

Wu et al. [32] implemented a new approach using a capsule network for SER tasks. Theauthors use spectrograms as the inputs of their model. The capsules are used to minimizethe information loss in feature representation and retain more emotional information, whichare important in SER tasks. In 2019, Jalal et al. [60] implemented a hybrid model basedon BLSTM, a 1D Conv-Cap, and capsule routing layers for SER. Ng and Liu [61] used acapsule-network-based model to encode spatial information from speech spectrogramsand analyze the performance under various loss functions on several datasets.

Moreover, fusion strategies are used to fuse the different types of features to makea decision on the classification of different modeling methods and reach a more sensibleconclusion. Su et al. [62] developed a model with frame-level acoustic features using anRNN network and an SVM model with HSFs as inputs. The confidence score was deter-mined using the probabilities of the RNN and SVM models. The confidence scores of thesetwo models were averaged for the final classification of each class as the fusion confidencescore. Yao et al. [21] used the decision-level fusion method after obtaining the outputs fromthree classifiers (HSF-DNN, MS-CNN, and LLD-RNN) for final class prediction.

Considering the limited number of speech samples, we also adopted a decision-levelfusion method in this study to fulfill the goal of developing a high-performance SER system.We developed a new approach for SER tasks to efficiently process emotional informationthrough the Conv-Cap and Bi-GRU modules from speech features. After that, we used theself-attention layer in each module to explore the autocorrelation of phonemes betweenthe features. In addition, due to the limited amount of data, we simultaneously applied

Electronics 2022, 11, 1328 7 of 18

different learning strategies to manage and classify emotional states. Finally, we integrateda confidence-based fusion approach with the probability score of two models to identifythe final emotional state. To the best of our knowledge, this is the first time the proposedframework has been implemented in emotion recognition, and we demonstrate that it ismore efficient than other state-of-the-art methods.

3. Proposed Methodology

The main architecture of our proposed fusion-based dual-channel self-attention modelusing Conv-Cap and Bi-GRU networks is illustrated in Figure 2. This architecture re-quires two different parallel inputs (spectral LLD features and Mel-spectrograms), andthey are input simultaneously into the proposed model. Specifically, we passed the Mel-spectrograms through to the Conv-Cap module and the spectral LLDs (Mel-frequencycepstral coefficients, a chromagram, the contrast, the zero-crossing rate, and the root meansquare) to the Bi-GRU module. Our proposed dual-channel model extracts informationsequentially from Mel-spectrogram and spectral LLD features. Moreover, we employed aself-attention mechanism to concentrate on the more salient emotional information. Weseparately trained each module with separate inputs. To achieve a higher probability foreach class, we applied fusion techniques at the decision level. The details of our proposedmethod are described in the following sections.

3.1. Data Preprocessing

A feature extraction step is essential to building a successful machine learning model.Features may be accomplices to a well-trained model, whereas irrelevant features wouldconsiderably obstruct the training process [17]. We utilized the Librosa audio library forthe feature extraction process [63]. We extracted six different types of spectral features:Mel-spectrogram, Mel-frequency cepstral coefficient, chromagram, contrast, zero-crossingrate (ZCR), and root mean square (RMS) features. Here, the Conv-Cap module uses the Mel-spectrogram feature, while the rest of the features (the MFCC, chromagram, contrast, ZCR,and RMS features) pass through the Bi-GRU module. The short-term Fourier transform(STFT) with a window length of 32 ms was used to obtain the Mel-spectrograms. We usedMel-spectrograms with 64 Mel bins with speech signals with a sampling rate of 16 kHz foreach utterance. The scaled Mel-spectrograms were resized in the time dimension to theappropriate input size of the model. On the other hand, the total dimension of the spectralLLD features was 1 × 61.

3.2. Module Architecture3.2.1. Convolutional Neural Network

CNNs have successfully been employed in several emotion-related tasks, such asautomatic speech recognition (ASR) and emotion identification [41]. We initially usedpre-trained networks such as AlexNet [3] instead of fine-tuning the CNN. The pre-trainedAlexNet [3] model required a large amount of data because it was trained on a datasetcontaining millions of images. In our experiment, the pre-trained model did not providegood accuracy because of the limited amount of speech data. So, we created our ownCNN network. We randomly ran models with various numbers of convolutional layers,kernel layers and filters, and other hyper-parameters several times and selected the optimalnumber of convolutional layers and other parameters based on the values that gave moreefficient and stable results.

The output of the fine-tuned CNN architecture is then passed through to the CapNetmodule. A key contribution of our work is applying deep features to each capsule state tolearn the temporal representation of different abstract features. The CNN network consistsof three convolutional layers and two max-pooling layers, as shown in Figure 2.

We directly applied the Mel-spectrograms as inputs into the CNN network to captureeach syllable’s temporal dependency information. After the first convolutional layer, therectifier linear unit (Relu) activation function is used with batch normalization and dropout

Electronics 2022, 11, 1328 8 of 18

layers. The first convolutional layer consists of 256 filters with a kernel size of 5 × 5 anda stride of 2 × 2. The second and third convolutional layers consist of 128 and 64 filterswith a kernel size of 5 × 5 and 3 × 3, respectively. After the first and second convolutionallayers, we employed a max-pooling layer with a size of 2 × 2. To prevent the model fromoverfitting, we used a dropout rate of 0.2. Batch normalization confirmed that the averageof the feature map was close to 0 and the standard deviation was relative to 1. The finalCNN layer was extracted with a feature representation size that the CapNet could handle.The details of the capsule operation are described in the next section.

3.2.2. Capsule Network

In this study, output matrices of the final CNN layer were applied to form a capsuleand a squashing function to obtain the vector form of each capsule to minimize the loss ofsemantic information. We set the iteration number of the routing algorithm to 3. Dynamicrouting can retain the local spatial features through the feature transfer method [64]. Thecapsule’s output is squashed, and the vector representations of the features learned by thatcapsule are presented above in Equations (5) and (6). The routing capsule layer is thenlinked to the previous layer via the transformation matrices in the same way as the fullyconnected layer in the CNN. In this layer, the numbers of capsules are equal to the numbersof emotion classes with a length of 16 dimensions, and the output layer of each capsuleis connected to the previous layer’s capsule. The previous layer’s capsule determines theprediction of the next layer’s capsule’s output, which is calculated using the output of thesecapsules using the above Equations (7) and (8). The output layer of the capsules determinesthe posterior probabilities.

3.2.3. Bi-Directional Gated Recurrent Unit

RNNs can incorporate all of the preamble vocabulary in the language informationset and handle the sequence length of the signal. However, RNNs have the vanishingand exploding gradient problems. LSTM and GRU networks have “gates” that allowthem to selectively affect the information in every state of the model to overcome theabove problems.

We consider a Bi-GRU network over the standard one-directional GRU because theBi-GRU network can handle more sequential data with two directions. A Bi-GRU hastwo GRU layers that pass the input independently through to the forward and reversedirections as shown in Figure 1.

The proposed dual-channel model uses a Bi-GRU layer to learn global semanticinformation from the LLD input matrix. We used 256 hidden units in each unidirectionalGRU layer and spliced the output into the self-attention layer. Thus, the output node ateach time step carries all of the past and future information of the present instant in theinput sequence.

3.2.4. Self-Attention Mechanism

After the operations of the Conv-Cap and Bi-GRU modules, the output of each mod-ule is passed through to a self-attention layer to reduce the redundancy of its externalinformation in order to capture a better internal correlation between the features. After theself-attention method is applied, we obtain a one-dimensional high-level feature vector aspresented below:

Q1 = K1 = V1 = Hsa1 for the Conv-Cap layer and Q2 = K2 = V2 = Hsa2 for theBi-GRU layer.

3.2.5. Confidence-Based Fusion Method

The outputs from the two self-attention layers are then passed through the fullyconnected layers and determine the confidence score for each module, which is denotedC1(s) for the Conv-Cap module and C2(s) for the Bi-GRU module. Then, we concatenateC1(s) and C2(s) to obtain a fusion confidence score, which is denoted C f (s). While each

Electronics 2022, 11, 1328 9 of 18

classifier used different features for specific utterances as inputs, we included two modulesto enhance the recognition rate. Specifically, we used the confidence-based decision-levelfusion method using the sum of confidence scores mentioned in [21]. The confidence scoreswere calculated individually from the outputs of the two classifiers using the Softmaxfunction of the fully connected layer and then classified.

Let Cn(s) denote the confidence score for the corresponding emotional class s, wheren is the set of two classifiers (the Conv-Cap module and the Bi-GRU module). C1(s) andC2(s) are summed and represent the fusion confidence score C f (s) of a given emotioncategory s, which can be determined by the Equation (9).

C f (s) = ∑n

Cn(s) = C1(s) + C2(s) (9)

The emotion class with the highest fusion confidence score will be output as the finalpredicted result, p = argmax (C f (s)).

4. Experimental Setup and Results4.1. Dataset Description

To analyze the effect of the proposed model, we performed speaker-independent SERtasks on the IEMOCAP, EMO-DB, and Odia datasets. The Odia database [65], also knownas the Odia Speech Emotion dataset (SITB-OSED), was recorded by the Silicon Instituteof Technology in Bhubaneswar, India. It was recorded by 12 actors (6 male and 6 female)expressing six different emotions (anger, fear, surprise, sadness, happiness, and disgust).The database contains 7317 Odia utterances in total. The average length of each audio file is4 s, the sampling frequency is 22.05 kHz, and a 16-bit quantization rate is used. The numberof utterances in each class and their participation % are described in Table 1. The databasewill be publicly available at https://www.speal.org/sitb-osed/ (accessed on 24 June 2021).

Table 1. The number of utterances in each class and their participation (%) in the SITB-OSED dataset.

Emotion No. of Utterances Participation (%)

Happiness 1197 16.36Surprise 1187 16.22Anger 1197 16.36

Sadness 1309 17.89Fear 1196 16.34

Disgust 1231 16.82

The IEMOCAP [66] dataset contains five sessions and a total of 12 h of audio–videodata in English. Each session was recorded by one male and one female speaker in bothscripted and improvised scenarios. This database contains 10,039 utterances with anaverage length of 4.5 s and a sample rate of 16 kHz. There are ten discrete emotions(neutral, happiness, excitement, frustration, disgust, sadness, surprise, anger, fear, andother). In the experiments, we merged the excitement emotion into the happiness emotionand considered four emotions (happiness, neutral, anger, and sadness) for a fair comparisonwith prior studies. The final dataset contained a total of 5531 utterances, and a detaileddescription of the four emotions is provided in Table 2.

Table 2. The number of utterances in each class and their participation (%) in the IEMOCAP dataset.


Neutral 1708 30.88Happiness 1636 29.58

Anger 1103 19.94Sadness 1084 19.60

https://www.speal.org/sitb-osed/

Electronics 2022, 11, 1328 10 of 18

We also used the EMO-DB [67] dataset, which has been extensively utilized by re-searchers in the SER field, in order to make consistent comparisons with previous work.This database contains 535 utterances by 10 German speakers (5 males and 5 females) withseven emotion categories (anger, fear/anxiety, boredom, disgust, sadness, happiness, andneutral) with a sampling rate of 48 kHz. The average length of each sample is approxi-mately 2 to 3 s. Table 3 presents detailed information on each class. For our experiments,all three datasets were down-sampled to 16 kHz.

Table 3. The number of utterances in each class and their participation (%) in the EMO-DB dataset.


Happiness 71 13.27Anger 127 23.74

Sadness 62 11.59Neutral 79 14.76

Fear 69 12.90Boredom 81 15.14Disgust 46 8.60

4.2. Experimental Setup

The proposed dual-channel structure has several other hyper-parameters. We opti-mized some of the hyper-parameters to obtain the optimal model.

To train the model, we performed a five-fold cross-validation technique. We split thedatasets in the ratio of 80:05:15 (%). A total of 80% of the data were used for training, 5%of the data were used for validation, and the remaining 15% of the data were used for themodel evaluation. We calculated the UA and WA and constructed a confusion matrix toanalyze the system’s performance.

The model used a batch size of 16 and a maximum of 100 epochs, including earlystopping callbacks with a patience parameter value of 10 [68]. If the training process did notget better after 10 epochs, it automatically stopped and saved the best model for evaluation.The Adam [69] optimizer was used with a learning rate of 0.0001 and a decay rate of10−6. Our experimental configuration was an NVIDIA QUADRO P620 GPU with 16 GBof memory using the Python version 3.7.10 environment, TensorFlow 2.3.0, and the Keras2.4.0 library.

4.3. Experimental Results

To illustrate the effect of our dual-channel model, experiments were carried out ontwo modules with different parallel inputs (Conv-Cap and Bi-GRU). Each module of themodel individually identified emotions using different inputs, with the other parametersremaining unchanged. Table 4 provides detailed evaluations of the proposed and individualmodules in terms of WA and UA on the three databases. Note that the proposed classifierwas more robust in recognizing emotions than the other individual classifiers. The WAof the proposed model was 90.31%, 76.84%, and 87.52%, and the UA of the proposedmodel was 87.61%, 70.34%, and 86.19%, on the EMO-DB, IEMOCAP, and Odia datasets,respectively. The proposed model exhibited a significant improvement in performance.

The Conv-Cap classifier was observed to be slightly more potent in recognizingemotions than the Bi-GRU classifier, except on the IEMOCAP dataset. This may havebeen because the duration of the speech samples in the IEMOCAP dataset is about 1 s to15 s. A CNN requires a specific input size, so a fully convolutional layer may not learnproper emotional information through a sample with a long duration. In contrast, ourBi-GRU module can learn more time-related information due to the increased speech framenumber. Using this advantage, the Bi-GRU module performed better than the Conv-Capmodule on the IEMOCAP dataset. To further develop the performance, we used paralleltraining with multiple features as inputs to balance the gap in the emotional information

Electronics 2022, 11, 1328 11 of 18

between the two modules. Finally, it was able to learn all of the emotional information of acomplete utterance.

Table 4. A comparison of the performance of different architectures of the proposed model in termsof weighted accuracy (WA) and unweighted accuracy (UA) using different datasets.

Database Model WA (%) UA (%)

Conv-Cap 86.35 82.57EMO-DB Bi-GRU 83.44 78.73

Proposed model 90.31 87.61

Conv-Cap 68.27 62.75IEMOCAP Bi-GRU 70.21 65.68


Conv-Cap 83.69 82.30SITB-OSED Bi-GRU 85.02 83.84


Furthermore, we constructed a confusion matrix to examine the executions of eachmodule and the proposed model. Figure 3 shows the predicted results of the confusionmatrix on the EMO-DB dataset for the different modules. According to the experimentalresults, we passed two different input features through to the dual-channel architecture thatmay contain some special emotional information, enabling the proposed model to capturethe contextual information better. Based on a comparison (Figure 3a–c), we discoveredthat the proposed model achieved a much better recognition rate than the single modules.The proposed model can capture the emotional information of features better. We canconclude from the confusion matrix that every emotional state was recognized with agood recognition rate (higher than 85%). The highest recognition accuracy was achievedfor anger, fear, happiness, and neutral. In contrast, boredom, disgust, and sadness wereclassified with a moderate recognition accuracy that was similar to that of the two singlemodules or classifiers. In this way, the proposed model was able to draw on the advantagesof the parallel architecture and pay more attention to specific emotions of different features.


Figure 3. The confusion matrix on the EMO-DB dataset. (a) The Conv-Cap module with a UA of 82.57%, (b) the Bi-GRU module with a UA of 78.73%, and (c) the proposed model with a UA of 87.61%.

Figure 4a–c show that, on the IEMOCAP dataset, in the happiness class, the recogni-tion rate all of the modules is low (around 40% to 50%). All of the algorithms found hap-piness difficult to detect at a reasonable performance rate.

However, the dual-channel model had the highest rates of neutral, sadness, and an-ger recognition. Regarding the rate of happiness recognition, the method proposed in this paper occupies the middle position. In addition, it is known that, for all types of algo-rithms, this dataset yields a high recognition rate for neutral and anger and a low recog-nition rate for sadness and happiness. This is because the training dataset contains a large number of anger samples, so the features of anger are easier to learn. In each emotion category, the number of samples is not balanced.

Figure 4. The confusion matrix on the IEMOCAP dataset. (a) The Conv-Cap module with a UA of 62.75%, (b) the Bi-GRU module with a UA of 65.68%, and (c) the proposed model with a UA of 70.34%.

Another reason for this result may be that actors can express anger and sadness in a good way while performing. Previous works have already reported [32,70] that happiness is difficult to identify, which may be due to the merging of the happiness samples with the excitement samples and its unique emotional characteristics, i.e., the emotion class of happiness depends on contextual contrastive information to a greater extent than the other emotion classes.

Figure 3. The confusion matrix on the EMO-DB dataset. (a) The Conv-Cap module with a UA of82.57%, (b) the Bi-GRU module with a UA of 78.73%, and (c) the proposed model with a UA of 87.61%.

Figure 4a–c show that, on the IEMOCAP dataset, in the happiness class, the recognitionrate all of the modules is low (around 40% to 50%). All of the algorithms found happinessdifficult to detect at a reasonable performance rate.

Electronics 2022, 11, 1328 12 of 18


Figure 3. The confusion matrix on the EMO-DB dataset. (a) The Conv-Cap module with a UA of 82.57%, (b) the Bi-GRU module with a UA of 78.73%, and (c) the proposed model with a UA of 87.61%.

Figure 4a–c show that, on the IEMOCAP dataset, in the happiness class, the recogni-tion rate all of the modules is low (around 40% to 50%). All of the algorithms found hap-piness difficult to detect at a reasonable performance rate.

However, the dual-channel model had the highest rates of neutral, sadness, and an-ger recognition. Regarding the rate of happiness recognition, the method proposed in this paper occupies the middle position. In addition, it is known that, for all types of algo-rithms, this dataset yields a high recognition rate for neutral and anger and a low recog-nition rate for sadness and happiness. This is because the training dataset contains a large number of anger samples, so the features of anger are easier to learn. In each emotion category, the number of samples is not balanced.

Figure 4. The confusion matrix on the IEMOCAP dataset. (a) The Conv-Cap module with a UA of 62.75%, (b) the Bi-GRU module with a UA of 65.68%, and (c) the proposed model with a UA of 70.34%.

Another reason for this result may be that actors can express anger and sadness in a good way while performing. Previous works have already reported [32,70] that happiness is difficult to identify, which may be due to the merging of the happiness samples with the excitement samples and its unique emotional characteristics, i.e., the emotion class of happiness depends on contextual contrastive information to a greater extent than the other emotion classes.

Figure 4. The confusion matrix on the IEMOCAP dataset. (a) The Conv-Cap module with a UA of62.75%, (b) the Bi-GRU module with a UA of 65.68%, and (c) the proposed model with a UA of 70.34%.

However, the dual-channel model had the highest rates of neutral, sadness, and angerrecognition. Regarding the rate of happiness recognition, the method proposed in thispaper occupies the middle position. In addition, it is known that, for all types of algorithms,this dataset yields a high recognition rate for neutral and anger and a low recognition ratefor sadness and happiness. This is because the training dataset contains a large number ofanger samples, so the features of anger are easier to learn. In each emotion category, thenumber of samples is not balanced.

Another reason for this result may be that actors can express anger and sadness in agood way while performing. Previous works have already reported [32,70] that happinessis difficult to identify, which may be due to the merging of the happiness samples withthe excitement samples and its unique emotional characteristics, i.e., the emotion class ofhappiness depends on contextual contrastive information to a greater extent than the otheremotion classes.

Finally, as can be seen from Figure 5a–c, the confusion matrix on the SITB-OSEDdataset, the accuracy of the predictions obtained by the proposed dual-channel method ismuch higher than that of the individual modules. Figure 5a,b show the confusion matrix ofthe Conv-Cap module and the Bi-GRU module. The Bi-GRU module performed signifi-cantly better than the Conv-Caps module in all emotions except sadness. The reason forthis may be that the actors did not express this emotion as well as the other emotions whilerecording, and perhaps the model was not able to capture relevant emotional information.Figure 5c shows the performance of our proposed model, which gives better recognitionresults for anger, disgust, happiness, surprise, and fear than the individual modules (aclassification accuracy of 86.22%, 85.12%, 84.02%, 87.21%, and 83.58%, respectively). Theproposed model identified the sadness emotion at a similar rate to the individual modules.From the experimental results, it is evident that the dual-channel model provides betterrecognition performance than a cascade connection with combined features.

4.4. Comparative Analysis

To further investigate our proposed approach, its recognition accuracy was comparedwith that of several state-of-the-art methods. Tables 5–7 present a detailed comparisonbetween our proposed methodology and previously reported results on the IEMOCAP,EMO-DB, and SITB-OSED databases. The results from the tables indicate that the proposedsystem can minimize the gap in emotional features through the CapNet and the self-attention mechanism, and the Bi-GRU can improve the diversity of the information. Notethat some previous works only employ the UA, which better reflects imbalances amongemotional states when evaluating recognition performance. However, we have presentedboth the WA and the UA for a fair comparison.

Electronics 2022, 11, 1328 13 of 18


Finally, as can be seen from Figure 5a–c, the confusion matrix on the SITB-OSED da-taset, the accuracy of the predictions obtained by the proposed dual-channel method is much higher than that of the individual modules. Figure 5a and b show the confusion matrix of the Conv-Cap module and the Bi-GRU module. The Bi-GRU module performed significantly better than the Conv-Caps module in all emotions except sadness. The reason for this may be that the actors did not express this emotion as well as the other emotions while recording, and perhaps the model was not able to capture relevant emotional infor-mation. Figure 5c shows the performance of our proposed model, which gives better recognition results for anger, disgust, happiness, surprise, and fear than the individual modules (a classification accuracy of 86.22%, 85.12%, 84.02%, 87.21%, and 83.58%, respec-tively). The proposed model identified the sadness emotion at a similar rate to the indi-vidual modules. From the experimental results, it is evident that the dual-channel model provides better recognition performance than a cascade connection with combined fea-tures.

Figure 5. The confusion matrix on the SITB-OSED dataset. (a) The Conv-Cap module with a UA of 82.30%, (b) the Bi-GRU module with a UA of 83.84%, and (c) the proposed model with a UA of 86.19%.

4.4. Comparative Analysis To further investigate our proposed approach, its recognition accuracy was com-

pared with that of several state-of-the-art methods. Tables 5–7 present a detailed compar-ison between our proposed methodology and previously reported results on the IE-MOCAP, EMO-DB, and SITB-OSED databases. The results from the tables indicate that the proposed system can minimize the gap in emotional features through the CapNet and the self-attention mechanism, and the Bi-GRU can improve the diversity of the infor-mation. Note that some previous works only employ the UA, which better reflects imbal-ances among emotional states when evaluating recognition performance. However, we have presented both the WA and the UA for a fair comparison.

Table 5 shows the comparison between our proposed model and seven recently pub-lished baseline methods on the EMO-DB dataset. From Table 5, we can see that our pro-posed model outperforms the baseline methods on the EMO-DB dataset in terms of WA, and the recognition rate is 8.57%, 7.01%, 4.89, 4.36%, 4.21%, 3.87%, and 0.30% higher com-pared with [15,17,29,47,58,71,72], respectively. We also reported a UA of 87.61%, which is higher than the UA of five baseline models, i.e., the baseline models with a UA of 85.57%, 84.53%, 82.82%, 82.10%, and 82.06% [15,29,36,46,58].

Figure 5. The confusion matrix on the SITB-OSED dataset. (a) The Conv-Cap module with a UA of82.30%, (b) the Bi-GRU module with a UA of 83.84%, and (c) the proposed model with a UA of 86.19%.

Table 5. Performance (%) comparison between our proposed method and state-of-the-art methodson the EMO-DB dataset.

Database Method (Refs.) WA (%) UA (%)

Sun et al. (2015) [71] 81.74 -Issa et al. (2020) [17] 86.10 -

Chen et al. (2018) [36] - 82.82Mustaqeem et al. (2020) [46] - 85.57

EMO-DB Li et al. (2021a) [29] 85.95 82.06Jiang et al. (2019) [15] 86.44 84.53Li et al. (2021b) [58] 83.30 82.10

Chen et al. (2021) [72] 85.42 -Mustaqeem et al. (2021) [47] 90.01 -


Table 6. Performance (%) comparison between our proposed method and state-of-the-art methodson the IEMOCAP dataset.


IEMOCAP

Lee et al. (2015) [72] 62.80 63.90Satt et al. (2017) [70] 68.80 59.40Li et al. (2018) [73] 71.80 68.10

Wu et al. (2019) [32] 72.73 59.71Yao et al. (2020) [21] 57.10 58.30Issa et al. (2020) [17] 64.30 -

Meyer et al. (2021) [26] - 64.50Shirian et al. (2021) [74] 64.19 60.31

Rajamani et al. (2021) [55] 66.90 68.30Mustaqeem et al. (2021) [47] 73.01 -


Table 7. Performance (%) comparison between our proposed method and a state-of-the-art methodon the SITB-OSED.


SITB-OSED Swain et al. (2021) [65] 62.36 -


Table 5 shows the comparison between our proposed model and seven recently pub-lished baseline methods on the EMO-DB dataset. From Table 5, we can see that our

Electronics 2022, 11, 1328 14 of 18

proposed model outperforms the baseline methods on the EMO-DB dataset in terms ofWA, and the recognition rate is 8.57%, 7.01%, 4.89, 4.36%, 4.21%, 3.87%, and 0.30% highercompared with [15,17,29,47,58,71,72], respectively. We also reported a UA of 87.61%, whichis higher than the UA of five baseline models, i.e., the baseline models with a UA of 85.57%,84.53%, 82.82%, 82.10%, and 82.06% [15,29,36,46,58].

Chen et al. [36] used an attention-based convolutional recurrent neural network tolearn discriminative features for SER. Jiang et al. [15] used a parallel connection usingCNN and LSTM networks that take frame-level features and 3D log Mel-spectrogramssimultaneously. This shows the advantages of the parallel model, which provides the bestperformance. In 2021, Li et al. [29] proposed a directional self-attention mechanism-basedBLSTM model to decode features in two directions for SER. Li et al. [58] introduced anew spatiotemporal and frequency-based cascaded attention network with large-marginlearning. This cascaded attention network focused on helping the model extract an adequateamount of emotional information from a large spectrogram and demonstrated satisfactoryperformance in SER.

Secondly, we compared the effectiveness of our proposed method with that of ninebaseline methods on IEMOCAP. Chen et al. [72] utilized a Bi-LSTM network for SERwith 32 features (including the F0, the zero-crossing rate, the voice probability, the 12-dimensional MFCCs, and their first-order derivatives) as inputs. Another different ap-proach used a cascade combination of a CNN and LSTM with raw audio spectrograms asinputs [70]. Li et al. [73] used CNN networks with two groups of filters to extract a 2Dspectrogram and then fed it to the subsequent convolutional layers. The attention pool-ing method was employed to learn the ultimate emotional representation. Wu et al. [32]developed a model based on a recurrent CapNet that reviews the spatial relationship ofactivities in spectrograms for SER. Recently, Meyer et al. [26] implemented a model witha CNN in combination with a BLSTM layer and fully connected layers. The authors in-tegrated a multi-kernel-width CNN, applied a 3D log Mel-spectrogram as an input, andintroduced an effective training method. Shirian et al. [74] proposed an efficient GraphConvolution Network (GCN) architecture with the Interspeech2009 emotion feature set. In2021, Rajamani et al. [55] employed a novel attention-based rectified linear unit (AReLU)activation function with GRU and Bi-GRU cells. The authors used the pitch, MFCCs, andstatistics in each frame of an utterance for SER.

Table 6 shows that the recognition results of the proposed model are impressivecompared with all of the above-mentioned baseline models on the IEMOCAP dataset.

Our model clearly outperforms the nine baseline methods in terms of WA (19.74%,12.54%, 9.94%, 8.04%, 14.04%, 5.04, 12.65%, 4.11%, and 3.83% higher values). Additionally,the recognition accuracy is 10.94% and 10.63% higher than that in [32,70], respectively,whereas the UA is 5.84% and 2.04% lower than that in [26,55], respectively.

Finally, our proposed model was also verified on the SITB-OSED dataset. Table 7compares the performance of the proposed model with that of only one recently publishedbaseline model on the SITB-OSED dataset because only one such work has been donepreviously in the Odia language. In our previous work [65], we achieved a WA of 62.36%by combining CNN and GRU networks and using only 1440 audio samples. For thisexperiment, we used 7317 recently created utterances to improve the model’s stability andaccuracy. The proposed model achieved a WA of 87.52% and a UA of 86.19%.

5. Conclusions and Future Research Directions

Most SER models are unable to distinguish between the features that contribute toemotion recognition and the ones that do not. In order to use more emotional information,we simply passed the different features through a hybrid architecture based on the Conv-Cap and Bi-GRU modules at the same time. The proposed model, which extracts Mel-spectrograms and other spectral LLDs, as well as spatial and temporal cues, outperformstraditional machine learning algorithms using the strengths of feature extraction. Eachmodule focuses on a specific emotion class using a dual self-attention mechanism. The

Electronics 2022, 11, 1328 15 of 18

Conv-Cap module is capable of considering temporal information, can consider the spatialrelationships of activities within the speech segments, and can create an utterance-levelfeature representation for emotion recognition. At the same time, the Bi-GRU networkaddresses the temporal dynamic of speech as sequential data. It generates better contextualrepresentations using the bi-directional layer. Our proposed model outperforms severalbaseline models in terms of both weighted and unweighted accuracy, demonstrating theefficiency of the CapNet on speech data. The results reveal that the proposed frameworkcan better learn high-level abstractions of emotion-related features and overcome theproblem with conventional machine learning models using an identical feature input. Theexperimental results demonstrate good performance with a weighted accuracy and anunweighted accuracy of 90.31% and 87.61%, 76.84% and 70.34%, and 87.52% and 86.19% onthe IEMOCAP, EMO-DB, and SITB-OSED datasets, respectively.

In the future, we aim to extend the CapNet-based hybrid algorithm with differentspeech features to further enhance its ability to capture emotion-related characteristics thatvary across time. We also aim to apply this architecture to the Odia language with moredialects and other languages in the future. Moreover, we aim to implement the proposedframework in real-time applications for emotion recognition and gender identification.

Author Contributions: Conception, design, analysis, interpretation of the data, and writing of themanuscript, B.M.; critical revision of the manuscript for important intellectual content, funding, andapproval of the final version, M.S. and M.; supervision of the project, M.S. All authors have read andagreed to the published version of the manuscript.

Funding: This work was supported by the DST, Govt. of India, under grant reference no. DST/ICPS/CLUSTER/Data Science/2018/General, Date: 7 January 2019.

Acknowledgments: The authors express their gratitude to the Department of Science and Technology(DST) and the Silicon Institute of Technology, who provided us with an excellent academic environ-ment for this research work. We also thank J. Talukdar for improving the quality of the manuscript.

Conflicts of Interest: The authors state that they have no conflict of interest.

References1. Wu, J.; Zhang, Y.; Zhao, X. A generalized zero-shot framework for emotion recognition from body gestures. arXiv 2020,

arXiv:2010.06362.2. Alreshidi, A.; Ullah, M. Facial emotion recognition using hybrid features. Informatics 2020, 7, 6. [CrossRef]3. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings

of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–8 December 2012;pp. 1097–1105.

4. Roberts, L.S. A Forensic Phonetic Study of the Vocal Responses of Individuals in Distress. Ph.D. Thesis, University of York, York,UK, 2012.

5. Chakraborty, R.; Pandharipande, M.; Kopparapu, S.K. Knowledge-based framework for intelligent emotion recognition inspontaneous speech. Procedia Comput. Sci. 2016, 96, 587–596. [CrossRef]

6. Vogt, T.; André, E. Improving automatic emotion recognition from speech via gender differentiation. In Proceedings of the FifthInternational Conference on Language Resources and Evaluation (LREC’06), Genoa, Italy, 22–28 May 2006; pp. 1123–1126.

7. Ishaq, M.; Kwon, S. Short-Term Energy Forecasting Framework Using an Ensemble Deep Learning Approach. IEEE Access 2021,9, 94262–94271.

8. Mustaqeem; Kwon, S. 1D-CNN: Speech Emotion Recognition System Using a Stacked Network with Dilated CNN Features.Comput. Mater. Contin. 2021, 67, 4039–4059. [CrossRef]

9. Latif, S.; Qayyum, A.; Usman, M.; Qadir, J. Cross lingual speech emotion recognition: Urdu vs. western languages. In Proceedingsof the 2018 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 17–19 December 2018.

10. Eyben, F.; Scherer, K.S.; Schuller, B.W.; Sundberg, J.; Andre, E.; Busso, C.; Devillers, L.Y.; Epps, J.; Laukka, P.; Narayanan, S.S.; et al.The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Trans. Affect. Comput.2016, 7, 190–202. [CrossRef]

11. Swain, M.; Routray, A.; Kabisatpathy, P. Databases, features and classifiers for speech emotion recognition: A review. Int. J. SpeechTechnol. 2018, 21, 93–120. [CrossRef]

12. Jahangir, R.; Teh, Y.W.; Hanif, F.; Mujtaba, G. Deep learning approaches for speech emotion recognition: State of the art andresearch challenges. Multimed. Tools Appl. 2021, 80, 23745–23812. [CrossRef]

http://doi.org/10.3390/informatics7010006

http://doi.org/10.1016/j.procs.2016.08.239

http://doi.org/10.32604/cmc.2021.015070

http://doi.org/10.1109/TAFFC.2015.2457417

http://doi.org/10.1007/s10772-018-9491-z

http://doi.org/10.1007/s11042-020-09874-7

Electronics 2022, 11, 1328 16 of 18

13. Zhang, S.; Zhao, X.; Tian, Q. Spontaneous Speech Emotion Recognition Using Multiscale Deep Convolutional LSTM. IEEE Trans.Affect. Comput. 2019, 1–10. [CrossRef]

14. Bertero, D.; Fung, P. A first look into a convolutional neural network for speech emotion detection. In Proceedings of theIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017;pp. 5115–5119.

15. Abdel-Hamid, L. Egyptian Arabic speech emotion recognition using prosodic, spectral and wavelet features. Speech Commun.2020, 122, 19–30. [CrossRef]

16. Issa, D.; Demirci, M.F.; Yazici, A. Speech emotion recognition with deep convolutional neural networks. Biomed. Signal Process.Control 2020, 59, 101894. [CrossRef]

17. Badshah, A.M.; Rahim, N.; Ullah, N.; Ahmed, J.; Muhammad, K.; Lee, M.Y.; Kwon, S.; Baik, S.W. Deep features-based speechemotion recognition for smart affective services. Multimed. Tools Appl. 2019, 78, 5571–5589. [CrossRef]

18. Dangol, R.; Alsadoon, A.; Prasad, P.W.C.; Seher, I.; Alsadoon, O.H. Speech Emotion Recognition Using Convolutional NeuralNetwork and Long-Short Term Memory. Multimed. Tools Appl. 2020, 79, 32917–32934. [CrossRef]

19. Senthilkumar, N.; Karpakam, S.; Gayathri Devi, M.; Balakumaresan, R.; Dhilipkumar, P. Speech emotion recognition based onBi-directional LSTM architecture and deep belief networks. Mater. Today Proc. 2021, 57, 2180–2184. [CrossRef]

20. Yao, Z.; Wang, Z.; Liu, W.; Liu, Y.; Pan, J. Speech emotion recognition using fusion of three multi-task learning-based classifiers:HSF-DNN, MS-CNN and LLD-RNN. Speech Commun. 2020, 120, 11–19. [CrossRef]

21. Abdul Qayyum, A.B.; Arefeen, A.; Shahnaz, C. Convolutional Neural Network (CNN) Based Speech-Emotion Recognition.In Proceedings of the IEEE International Conference on Signal Processing, Information, Communication and Systems, Dhaka,Bangladesh, 28–30 November 2019; pp. 122–125.

22. Tzinis, E.; Potamianos, A. Segment-based speech emotion recognition using recurrent neural networks. In Proceedings of the2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), San Antonio, TX, USA, 23–26October 2017; IEEE: Manhattan, NY, USA, 2017; pp. 190–195.

23. Cho, K.; Merrienboer, B.V.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representationsusing RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078.

24. Neumann, M.; Vu, N.T. Attentive convolutional neural network based speech emotion recognition: A study on the impact ofinput features, signal length, and acted speech. arXiv 2017, arXiv:1706.00612.

25. Meyer, P.; Xu, Z.; Fingscheidt, T. Improving Convolutional Recurrent Neural Networks for Speech Emotion Recognition. InProceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Virtual, 19–22 January 2021; pp. 365–372.

26. Qamhan, M.A.; Meftah, A.H.; Selouani, S.A.; Alotaibi, Y.A.; Zakariah, M.; Seddiq, Y.M. Speech Emotion Recognition usingConvolutional Recurrent Neural Networks and Spectrograms. In Proceedings of the 2020 IEEE Canadian Conference on Electricaland Computer Engineering (CCECE), 30 August–2 September 2020; pp. 1–5.

27. Mao, S.; Ching, P.C.; Lee, T. Enhancing Segment-Based Speech Emotion Recognition by Deep Self-Learning. arXiv 2021,arXiv:2103.16456v1.

28. Li, D.; Liu, J.; Yang, Z.; Sun, L.; Wang, Z. Speech emotion recognition using recurrent neural networks with directional self-attention. Expert Syst. Appl. 2021, 173, 114683. [CrossRef]

29. Sabour, S.; Frosst, N.; Hinton, G. Dynamic routing between capsules. In Proceedings of the 31st International Conference onNeural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3856–3866.

30. Zhang, B.W.; Xu, X.F.; Yang, M.; Chen, X.J.; Ye, Y.M. Cross-domain sentiment classification by capsule network with semanticrules. IEEE Access 2018, 6, 58284–58294. [CrossRef]

31. Wu, L.; Liu, S.; Cao, Y.; Li, X.; Yu, J.; Dai, D.; Ma, X.; Hu, S.; Wu, Z.; Liu, X.; et al. Speech emotion recognition using capsulenetworks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton,UK, 12–17 May 2019; pp. 6695–6699.

32. Duarte, K.; Rawat, Y.S.; Shah, M. VideoCapsuleNet: A simplified network for action detection. Advances in Neural InformationProcessing Systems. arXiv 2018, arXiv:1805.08.08162.

33. Chen, Q.; Huang, G. A novel dual attention-based BLSTM with hybrid features in speech emotion recognition. Eng. Appl. Artif.Intell. 2021, 102, 104277. [CrossRef]

34. Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? End-to-end speechemotion recognition using a deep convolutional recurrent network. In Proceedings of the 2016 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 5200–5204.

35. Chen, M.; He, X.; Yang, J.; Zhang, H. 3-D convolutional recurrent neural networks with attention model for speech emotionrecognition. IEEE Signal Process. Lett. 2018, 25, 1440–1444. [CrossRef]

36. Mustafa, M.B.; Yusoof, M.A.M.; Don, Z.M.; Malekzadeh, M. Speech emotion recognition research: An analysis of research focus.Int. J. Speech Technol. 2018, 21, 137–156. [CrossRef]

37. Koolagudi, S.G.; Maity, S.; Kumar, V.A.; Chakrabarti, S.; Rao, K.S. IITKGP-SESC: Speech database for emotion analysis. Commun.Comput. Inf. Sci. 2009, 40, 485–492. [CrossRef]

38. Akçaya, M.B.; Oguz, K. Speech emotion recognition: Emotional models, databases, features, preprocessing methods, supportingmodalities, and classifiers. Speech Commun. 2020, 116, 56–76. [CrossRef]

http://doi.org/10.1109/TAFFC.2019.2947464

http://doi.org/10.1016/j.specom.2020.04.005

http://doi.org/10.1016/j.bspc.2020.101894

http://doi.org/10.1007/s11042-017-5292-7

http://doi.org/10.1007/s11042-020-09693-w

http://doi.org/10.1016/j.matpr.2021.12.246


http://doi.org/10.1016/j.eswa.2021.114683

http://doi.org/10.1109/ACCESS.2018.2874623

http://doi.org/10.1016/j.engappai.2021.104277

http://doi.org/10.1109/LSP.2018.2860246

http://doi.org/10.1007/s10772-018-9493-x

http://doi.org/10.1007/978-3-642-03547-0_46


Electronics 2022, 11, 1328 17 of 18

39. Han, K.; Yu, D.; Tashev, I. Speech emotion recognition using deep neural network and extreme learning machine. In Proceedingsof the Fifteenth Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014;pp. 223–227.

40. Mao, Q.; Dong, M.; Huang, Z.; Zhan, Y. Learning salient features for speech emotion recognition using convolutional neuralnetworks. IEEE Trans. Multimed. 2014, 16, 2203–2213. [CrossRef]

41. Khamparia, A.; Gupta, D.; Nguyen, N.G.; Khanna, A.; Pandey, B.; Tiwari, P. Sound classification using convolutional neuralnetwork and tensor deep stacking network. IEEE Access 2019, 7, 7717–7727. [CrossRef]

42. Mustaqeem, M.; Kwon, S. Speech Emotion Recognition Based on Deep Networks: A Review. In Proceedings of the KoreaInformation Processing Society Conference, Seoul, Korea, 14 May 2021.

43. Mustaqeem; Kwon, S. A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors 2020, 20, 183.44. Zhao, Z.; Zheng, Y.; Zhang, Z.; Wang, H.; Zhao, Y.; Li, C. Exploring spatio-temporal representations by integrating attention-based

bidirectional-LSTM-RNNs and FCNs for speech emotion recognition. In Proceedings of the INTERSPEECH, Hyderabad, India,2–6 September 2018; pp. 272–276.

45. Mustaqeem; Muhammad, S.; Kwon, S. Clustering-based speech emotion recognition by incorporating learned features and deepbilstm. IEEE Access 2020, 8, 79861–79875. [CrossRef]

46. Mustaqeem; Kwon, S. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach.Expert Syst. Appl. 2021, 167, 114177. [CrossRef]

47. Tursunov, A.; Choeh, J.Y.; Kwon, S. Age and Gender Recognition Using a Convolutional Neural Network with a SpeciallyDesigned Multi-Attention Module through Speech Spectrograms. Sensors 2021, 21, 5892. [CrossRef] [PubMed]

48. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556.49. Fayek, H.M.; Lech, M.; Cavedon, L. Evaluating deep learning architectures for Speech Emotion Recognition. Neural Netw. 2017,

92, 60–68. [CrossRef] [PubMed]50. Zhu, Z.; Dai, W.; Hu, Y.; Li, J. Speech emotion recognition model based on Bi-GRU and Focal Loss. Pattern Recogn. Lett. 2020, 140,

358–365. [CrossRef]51. Tzirakis, P.; Trigeorgis, G.; Nicolaou, M.A.; Schuller, B.W.; Zafeiriou, S. End-to-end multimodal emotion recognition using deep

neural networks. IEEE J. Sel. Top. Signal Process 2017, 11, 1301–1309. [CrossRef]52. Chorowski, J.; Bahdanau, D.; Serdyuk, D.; Cho, K.; Bengio, Y. Attention-Based Models for Speech Recognition. arXiv 2015,

arXiv:1506.07503.53. Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention.

In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans,LA, USA, 5–9 March 2017; pp. 2227–2231.

54. Rajamani, S.T.; Rajamani, K.T.; Mallol-Ragolta, A.; Liu, S.; Schuller, B. A novel attention-based gated recurrent unit and itsefficacy in speech emotion recognition. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP), Virtual, 6–11 June 2021; pp. 6294–6298.

55. Zhao, Z.; Bao, Z.; Zhao, Y.; Zhang, Z.; Cummins, N.; Ren, Z.; Schuller, B. Exploring deep spectrum representations via attention-based recurrent and convolutional neural networks for speech emotion recognition. IEEE Access 2019, 7, 97515–97525. [CrossRef]

56. Ishaq, M.; Son, G.; Kwon, S. Utterance-Level Speech Emotion Recognition using Parallel Convolutional Neural Network with Self-Attention Module. In Proceedings of the 1st International Conference on Next Generation Computing Systems-2021, Coimbatore,India, 26–27 March 2021; Volume 1, pp. 1–6.

57. Li, S.; Xing, X.; Fan, W.; Cai, B.; Fordson, P.; Xu, X. Spatiotemporal and frequential cascaded attention networks for speechemotion recognition. Neurocomputing 2021, 448, 238–248. [CrossRef]

58. Toraman, S.; Tuncer, S.A.; Balgetir, F. Is it possible to detect cerebral dominance via EEG signals by using deep learning? Med.Hypotheses 2019, 131, 109315. [CrossRef]

59. Jalal, M.A.; Loweimi, E.; Moore, R.K.; Hain, T. Learning temporal clusters using capsule routing for speech emotion recognition.In Proceedings of the INTERSPEECH, Graz, Austria, 15–19 September 2019; pp. 1701–1705.

60. Ng, A.J.B.; Liu, K.H. The Investigation of Different Loss Functions with Capsule Networks for Speech Emotion Recognition. Sci.Program. 2021, 2021, 9916915. [CrossRef]

61. Su, B.H.; Yeh, S.L.; Ko, M.Y.; Chen, H.Y.; Zhong, S.C.; Li, J.L.; Lee, C.C. Self- assessed affect recognition using fusion of attentionalBLSTM and static acoustic features. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 536–540.

62. McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Librosa: Audio and music signal analysis inpython. In Proceedings of the Forteenth Python in Science Conference, Austin, TX, USA, 6–12 July 2015; pp. 18–24.

63. Chen, Z.; Qian, T. Transfer Capsule Network for Aspect Level Sentiment Classification. In Proceedings of the 57th AnnualMeeting of the Association for Computational Linguistic, Florence, Italy, 28 July–2 August 2019; pp. 547–556.

64. Swain, M.; Maji, B.; Das, U. Convolutional Gated Recurrent Units (CGRU) for Emotion Recognition in Odia Language. In Pro-ceedings of the IEEE EUROCON 19th International Conference on Smart Technologies, Lviv, Ukraine, 6–8 July 2021; pp. 269–273.

65. Busso, C.; Bulut, M.; Lee, C.C.; Kazemzadeh, A.; Mower, E.; Kim, S.; Chang, J.N.; Lee, S.; Narayanan, S.S. Iemocap: An Interactiveemotional dyadic motion capture database. Lang. Resour. Eval. 2008, 42, 335–359. [CrossRef]

66. Burkhardt, F.; Paeschke, A.; Rolfes, M.; Sendlmeier, W.F.; Weiss, B. A database of german emotional speech. In Proceedings of theINTERSPEECH, Lisbon, Portugal, 4–8 September 2005; pp. 1517–1520.

http://doi.org/10.1109/TMM.2014.2360798



http://doi.org/10.1016/j.eswa.2020.114177

http://doi.org/10.3390/s21175892

http://www.ncbi.nlm.nih.gov/pubmed/34502785

http://doi.org/10.1016/j.neunet.2017.02.013

http://www.ncbi.nlm.nih.gov/pubmed/28396068

http://doi.org/10.1016/j.patrec.2020.11.009

http://doi.org/10.1109/JSTSP.2017.2764438


http://doi.org/10.1016/j.neucom.2021.02.094

http://doi.org/10.1016/j.mehy.2019.109315

http://doi.org/10.1155/2021/9916915

http://doi.org/10.1007/s10579-008-9076-6

Electronics 2022, 11, 1328 18 of 18

67. Loughrey, J.; Cunningham, P. Using Early Stopping to Reduce Overfitting in Wrapper-Based Feature Weighting; Department ofComputer Science, Trinity College Dublin: Dublin, Ireland, 2005.

68. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980.69. Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep learning on spectrograms. In Proceedings

of the INTERSPEECH, Stockholm, Sweden, 20–24 August 2017; pp. 1089–1093.70. Sun, Y.; Wen, G.; Wang, J. Weighted spectral features based on local Hu moments for speech emotion recognition. Biomed. Signal

Process. Control 2015, 18, 80–90. [CrossRef]71. Chen, S.; Zhang, M.; Yang, X.; Zhao, Z.; Zou, T.; Sun, X. The Impact of Attention Mechanisms on Speech Emotion Recognition.

Sensors 2021, 21, 7530. [CrossRef]72. Lee, J.; Tashev, I. High-level feature representation using recurrent neural network for speech emotion recognition. In Proceedings

of the INTERSPEECH, Dresden, Germany, 6–10 September 2015; pp. 1537–1540.73. Li, P.; Song, Y.; McLoughlin, I.; Guo, W.; Dai, L. An attention pooling based representation learning method for speech emotion

recognition. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 3087–3091.74. Shirian, A.; Guha, T. Compact graph architecture for speech emotion recognition. In Proceedings of the 2021 IEEE International

Conference on Acoustics, Speech and Signal Processing (ICASSP), Virtual, 6–12 June 2021; pp. 6284–6288.

http://doi.org/10.1016/j.bspc.2014.10.008

http://doi.org/10.3390/s21227530

Advanced Fusion-Based Speech Emotion Recognition ... - MDPI

Documents

Transcript of Advanced Fusion-Based Speech Emotion Recognition ... - MDPI