Transformation of Emotions using Pitch as a Parameter for Kannada Speech

8
Transformation of Emotions using Pitch as a Parameter for Kannada Speech 1 Geethashree.A and Dr. D.J Ravi 2 1 Asst. Professor, Vidyavardhaka college of engineering, Mysore [email protected] 2 Professor, Vidyavardhaka college of engineering, Mysore [email protected] Abstract— Various researches have been done in the transformation of emotion. It is well known that prosodic features play an important role to express emotion, many research have been conducted to analyze prosodic features and to construct emotional conversion rules. However, for Kannada not many studies have been done. In this paper the proposed emotional conversion is using linear modification model (LMM) based techniques. LMM is one of the prosody conversion methods which utilize a small amount of expressive data and which have been evaluated for a target emotion: Sadness & Fear. Firstly the Kannada speech database is constructed to study the effects of change of emotion. Emotion is an important element in expressive speech synthesis and is investigated by many researchers. This paper describes the methods to optimize the database for analysis and study. The pitch information is extracted from the database for different emotions like fear and Sadness. Pitch analysis is done on the database using the extracted pitch points, and a general algorithm is devised for the change of neutral state to emotional state. To perform the experiments, two expressive styles- Fear and Sad are done with Neutral. Index Terms— Emotion conversion, Emotional Speech database, LMM based emotion conversion. I. INTRODUCTION Emotion plays an important role in day-to-day interpersonal human interactions. Recent findings have suggested that emotion is integral to our rational and intelligent decisions. A successful solution to this challenging problem would enable a wide range of important applications. Correct assessment of the emotional state of an individual could significantly improve quality of emerging, natural language based human-computer interfaces [1, 3, 6]. It helps us to relate with each other by expressing our feelings and providing feedback. This important aspect of human interaction needs to be considered in the design of human–machine interfaces [2, 4]. To recognize emotions, it needs to know not only what information a user conveys but also how it is being conveyed. Speech signals convey not only words and meanings but also emotions. In terms of acoustics, involve variation in syllable length, loudness, pitch, and the formant frequencies of speech sounds [1, 5, 9]. To build interfaces that are more in tune with the users’ needs and preferences, it is essential to study how emotion modulates and enhances the verbal and nonverbal channels in human communication. Emotional speech can provide invaluable clues in forensics. The pitch contour is almost flat for neutral. The average pitch value for emotional speech is higher compared DOI: 03.AETS.2014.5.209 © Association of Computer Electronics and Electrical Engineers, 2014 Proc. of Int. Conf. on Recent Trends in Signal Processing, Image Processing and VLSI, ICrtSIV

Transcript of Transformation of Emotions using Pitch as a Parameter for Kannada Speech

Transformation of Emotions using Pitch as a

Parameter for Kannada Speech

1Geethashree.A and Dr. D.J Ravi2 1Asst. Professor, Vidyavardhaka college of engineering, Mysore

[email protected] 2Professor, Vidyavardhaka college of engineering, Mysore

[email protected]

Abstract— Various researches have been done in the transformation of emotion. It is well known that prosodic features play an important role to express emotion, many research have been conducted to analyze prosodic features and to construct emotional conversion rules. However, for Kannada not many studies have been done. In this paper the proposed emotional conversion is using linear modification model (LMM) based techniques. LMM is one of the prosody conversion methods which utilize a small amount of expressive data and which have been evaluated for a target emotion: Sadness & Fear. Firstly the Kannada speech database is constructed to study the effects of change of emotion. Emotion is an important element in expressive speech synthesis and is investigated by many researchers. This paper describes the methods to optimize the database for analysis and study. The pitch information is extracted from the database for different emotions like fear and Sadness. Pitch analysis is done on the database using the extracted pitch points, and a general algorithm is devised for the change of neutral state to emotional state. To perform the experiments, two expressive styles- Fear and Sad are done with Neutral. Index Terms— Emotion conversion, Emotional Speech database, LMM based emotion conversion.

I. INTRODUCTION

Emotion plays an important role in day-to-day interpersonal human interactions. Recent findings have suggested that emotion is integral to our rational and intelligent decisions. A successful solution to this challenging problem would enable a wide range of important applications. Correct assessment of the emotional state of an individual could significantly improve quality of emerging, natural language based human-computer interfaces [1, 3, 6]. It helps us to relate with each other by expressing our feelings and providing feedback. This important aspect of human interaction needs to be considered in the design of human–machine interfaces [2, 4]. To recognize emotions, it needs to know not only what information a user conveys but also how it is being conveyed. Speech signals convey not only words and meanings but also emotions. In terms of acoustics, involve variation in syllable length, loudness, pitch, and the formant frequencies of speech sounds [1, 5, 9]. To build interfaces that are more in tune with the users’ needs and preferences, it is essential to study how emotion modulates and enhances the verbal and nonverbal channels in human communication. Emotional speech can provide invaluable clues in forensics. The pitch contour is almost flat for neutral. The average pitch value for emotional speech is higher compared DOI: 03.AETS.2014.5.209 © Association of Computer Electronics and Electrical Engineers, 2014

Proc. of Int. Conf. on Recent Trends in Signal Processing, Image Processing and VLSI, ICrtSIV

2

to neutral speech. The intensity level of a word is lowest for sadness and highest for anger. The phoneme level analysis on duration shows that it is the vowels that capture the emotional variation more compared to consonants. [13, 12]. Pitch is strongly correlated with the fundamental frequency of the sound. It occupies a central place in the study of prosodic attributes as it is the perceived fundamental frequency of the sound [10, 11, 15]. It differs from the actual fundamental frequency due to overtones inherent in the sound. Pitch Contours are essentially curves that trace the pitch of the sound with time. This paper extracts and analysis patterns in the pitch points of utterances when spoken in the Normal, fear and sad emotion states, as compared to when spoken in the Normal emotion state. The observed patterns and norms are then used to devise an algorithm for the conversion of utterances from the normal emotion state to the fear and sad emotion state [8, 14,16]. The paper is organized in five sections, the second and third section details the creation and optimization of the emotional speech database used in the analysis; the fourth and fifth section gives patterns observed in the pitch, and finally the description and implementation of the algorithm which was devised for the conversion from neutral state to emotional state.

II. RELATED WORK

Pitch features from expressive speech have been extensively analyzed during the last few years. From these studies, it is well known that the pitch contour presents distinctive patterns for certain emotional categories. In an exhaustive review, Juslin and Laukka reported some consistent results for the pitch contour across 104 studies on vocal expression. For example, they concluded that the pitch contour is higher and more variable for emotions such as anger and happiness and lower and less variable for emotions such as sadness. Despite having a powerful descriptive value, these observations are not adequate to quantify the discriminative power and the variability of the pitch features. The results obtained by Lieberman and Michaels indicate that the fine structure of the pitch contour is an important emotional cue. Using human perceptual experiments, they showed that the recognition of emotional modes such as bored and pompous decreased when the pitch contour is smoothed. Therefore, they concluded that small pitch fluctuations, which are usually neglected, convey emotional information. Wang et al. compared the pitch declination conveyed in happy and neutral speech in Mandarin. Using four-word sentences, they studied the pitch patterns at the word level. They concluded that the declination in happy speech is less than in neutral speech and that the slope of the F0 contour is higher than neutral speech, especially at the end of the sentence. Paeschke et al. also analyzed the pitch shape in expressive speech. They proposed different pitch features that might be useful for emotion recognition, such as the steepness of rising and falling of the pitch, and direction of the pitch contour. Scherer et al. explained these results by making the distinction between linguistic and paralinguistic pitch features. The authors suggested that gross statistics from the pitch are less connected to the verbal context, so they can be independently manipulated to express the emotional state of the speaker (paralinguistic). The authors also argued that the pitch shape (i.e., rise and fall) is tightly associated with the grammatical (linguistic) structure of the sentence. As an aside, similar interplay with pitch has been observed in facial expressions.

III. EMOTIONAL SPEECH DATABASE

A database consisting of a total of 400 kannada words and sentences was prepared. The database used for the purpose of analysis. The first step was to record the voice of each words and sentences. The recordings of all the words and sentences were done using Praat [3,7]. These words and sentences were recorded at a sample rate of 44100 Hz with a single (Mono) channel. The recordings were done in a silent room with minimal noise. For pitch analysis minimum and maximum pitch is taken as 75 Hz and 500 Hz. Speakers recorded the words and sentences, with each word and sentence being spoken in four different emotions like neutral, happiness, fear and sadness and then optimize it. These words and sentences are emotionally rich and can be spoken in all four other emotions. Once the speaker is in emotionally charged mood, each speaker was asked to record the words and sentences with full emotion and stored in the computer. As a part of the optimization process, the files were subject to three steps:

1. Normalization 2. Noise Reduction 3. Auto Trim

3

To normalize is to adjust the volume so that the loudest peak is equal to the maximum signal that can be used in digital audio. Two methods were used for mitigating the background noise from the sound one is the manual spectral subtraction was done on each sound file by hand picking the noise regions and the second is the sound was subject to a noise gate to clean up the sound. As a last part of the optimization process, the files were trimmed so that silent regions at the beginning and end of each utterance were eliminated. The resulting database had an average duration for each sentence utterance.

IV. ANALYSIS

In our experiment the pitch values of each word and sentences recorded in normal and emotional speech were extracted using praat [3,7]. Each recorded word and sentence has three parts i.e. starting, middle and end using the fact that the pitch values should be continuous. Then the pitch values of each word and sentence from starting and ending point as well as maximum and minimum pitch value between starting and end points of particular word and sentence was extracted. But, it may be noticed that happiness pitch value at starting point, may be same as either maximum or minimum pitch value or same is true for end value, whether it is starting, middle or end. Each recorded word and sentence has maximum and minimum pitch. On comparing the emotional speech with the normal speech, it was found that both were different. When happiness was considered it was found to have high pitch as compared to neutral. Then manually each pitch value of emotional speech with respect to neutral speech pitch value was noted down. Then the pitch value starting point to ending point as well as maximum and minimum pitch value was extracted. The database constructed was used to study and analyze patterns and similarities in the pitch contour when one sentence was spoken in happiness emotion, as compared to when the same sentence was spoken in the neutral emotion. For the purpose of the analysis, pitch points were computed for the utterances and a pattern was found. An algorithm for the conversion from neutral state to emotional state was implemented. The database files pertaining to neutral and emotional state were stored as sequentially numbered files so that scripting could be used to automate the process of pitch point’s extraction. For computing the pitch points, the sound files were loaded in Praat, and then sent to its “Manipulation” editor. The pitch information of the utterance was extracted from that manipulation as a “PitchTier” object which was used to write the pitch information to a text file. The text file was named sequentially with the same name as the sound file, with a .txt extension. The whole process was repeated for all the sound files using a script specifically written for this purpose. So after running the script, individual text files were generated which the pitch information had contained in them. Each text file had all the pitch points of its corresponding sound file. The text file had a set of tab separated values, corresponding to time and pitch. For the purpose of analysis, each utterance was partitioned into three sections Start, Middle and End part of the sentence. Also, in each part, the pitch points pertaining to the maximum pitch and the minimum pitch were computed. Figure1, 2 & 3 represents Graphical Pitch Tier of neutral, sadness (Recorded) & fear (Recorded).

Figure 1: Illustration of Pitch patterns for neutral sentence

4

Figure 2: Illustration of Pitch patterns for sadness (Recorded) sentence

Figure 3: Illustration of Pitch patterns for fear (Recorded) sentence

The Table I shows pitch values of a sentence in neutral, sadness (Recorded) & fear (Recorded) emotion. The Table II gives the pitch value of each word in neutral speech of a sentence. The Table III gives maximum, minimum, start and end value of a pitch of different neutral sentence.

V. EMOTION CONVERSION ALGORITHM AND RESULT

The following algorithm proposes the conversion Process.

A. Algorithm used for the conversion of sentence into sadness It is based on the distribution of the pitch points between the Maximum and Minimum pitch frequencies and the difference between Maximum and Minimum pitch frequencies. 1. If the difference between Maximum and Minimum pitch frequencies is less than 140Hz, irrespective of distribution of pitch frequencies, shift entire pitch tier by 20 to 60Hz. 2. If difference between Maximum and Minimum pitch frequencies is greater then 140Hz, a. If more number of pitch points are concentrated towards minimum pitch frequency then increase all those points to half of the Maximum pitch frequency.

5

TABLE I. PITCH VALUES OF A SENTENCE

IN DIFFERENT EMOTIONS (HZ

Pitch values Neutral Fear Sadness

pv1 98.85 183.34 122.8

pv2 111.40 176.48 102.6

pv3 91.90 212.59 175.4

pv4 133.85 188.15 99.6

pv5 82.4 255.47 106.2

pv6 102.9 192.73 82.0

pv7 79.9 239.29 140.75

pv8 121.0 160.55 251.4

pv9 104.6 135.75 269.6

pv10 125.6 383.91 173.7

pv11 90.69 207.7 90.53

pv12 74.95 145.66 84.31

TABLE II. PITCH VALUES OF EACH WORD OF NEUTRAL SENTENCE ( HZ)

Pitch Value (Nanu)

(Oduudu)

(rathri)

(Veleyale)

pv1 98.13 100.11 130.8 91.54

pv2 95.46 122.7 114.0 83.87

pv3 100.06 117.07 104.5 86.63

pv4 107.32 134.26 123.6 84.37

pv5 109.72 134.43 117.6 81.64

pv6 106.78 108.17 113.6 75.49

pv7 109.59 101.05 271.7 80.00

pv8 109.75 87.86 270.8 76.62

pv9 104.69 81.83 277.4 78.69

pv10 94.73 103.51 125.5 76.24

Pv11 96.09 95.94 123.5 75.52

Pv12 92.96 81.95 92.8 -

b. If more number of pitch points are concentrated towards Maximum pitch frequency, then identify the minimum in that group and rise all the minimum pitch points to that level and the pitch points concentrated towards Max pitch frequency are increased to Maximum level. c. If pitch points are distributed equally between Maximum & Minimum pitch frequency, identify the centre frequency, rise the pitch points lying below the centre frequency to centre frequency and the points above the centre frequency are increased to their respective peak levels. 3. After the conversion the difference between Maximum and Minimum pitch frequency should be less than 140Hz and original pitch tier shape should be maintained between 50% -70%. Figure 4 & 5 illustrates the pitch tier of sadness emotion (transformed from neutral).

B. Algorithm used for the conversion of sentence into Fear 1. Consider the sadness converted sentence. 2. Apply word detection algorithm to the sadness converted sentence. 3. Identify the set of peak pitch points in each word, shift these points by 30 – 40Hz

6

TABLE III. PITCH VALUES OF DIFFERENT NEUTRAL SENTENCE (HZ)

SI No Sentence Min. Max. Start End

1 (I study during night)

75 134.9 98.9 75

2 (There is no doubt in it.)

75.5 139.8 108.8 80.4

3

(Take your son)

75.3 184.6 141.5 75.3

4 (I will go with him)

75.4 153.1 152.6 76.8

5 (Aravinda is my student)

83.6 307.7 111.1 84.8

6 (Tiger which lives in forest doesn’t fear)

75.0 159.3 159.3 81.4

7

(Do u believe these people’s words)

80.1 237.6 141.4 124.5

8 (You are the eldest son )

76.8 206.6 79.8 79.1

Figure 4: Illustrates the pitch tier of sadness emotion (transformed from neutral) for the sentence

Figure 6 & 7 illustrates the pitch tier of transformation of emotion from neutral to fear.

7

Figure 5: Illustrates the pitch tier of sadness emotion (transformed from neutral) for the sentence

Figure 6: Illustrates the pitch tier of fear emotion (transformed from neutral → sadness →fear ) for the sentence

Figure 7: Illustrates the pitch tier of fear emotion (transformed from neutral → sadness →fear) for the sentence

8

VI. CONCLUSION AND FUTURE WORK

This paper presents an analysis of different emotions based on pitch contour with the goal of finding the emotion from the neutral emotion.

i. An algorithm for emotion conversion been described, to obtain sadness speech the pitch is varied by 20 to 60Hz.

ii. And to obtain fear speech from sad the pitch points are varied by 20 to 40Hz in sad speech. The alignment of pitch value by linguistic rules has not been considered, the future work will focus on the linguistic rules for emotion conversion. But the proposed system has scope for further refinements. The pitch factor has been considered for our experiment; the effect of other factors like Spectrum, Duration etc can be further investigated. The experiment has been performed on small utterances and not adequate in terms of numbers. The database should be enhanced to achieve the perfectness.

REFERENCES

[1] Jianhua Tao, Yongguo Kang, and Aijun Li, “Prosody Conversion From Neutral Speech to Emotional Speech”, IEEE transactions on Audio, Speech, and Language Processing, vol. 14, no. 4, July 2006

[2] Takashi & Norman D. Cook, “Identifying Emotion in Speech Prosody Using Acoustical Cues of Harmony”, INTERSPEECH, ISCA,DBLP (2004)

[3] Paul Boersma and David Weenink. (2009, November) Praat: doing phonetics by computer. [Online]. URL “http://www.fon.hum.uva.nl/praat/”

[4] Sendlmeier, W.F., Kienast M. and Paeschke, A. “F0 contours in Emotional Speech.” Technische University, Berlin, Proc. ICPhS, 1999.

[5] Mozziconacci, S.J.L and Hermes D.J. “Role of Intonational Patterns in Conveying Emotion in Speech.” ICPhS 1999, 1999 - Citeseer.

[6] S J Mozziconacci, and D J Hermes, “ Role of intonation Patterns in conveying emotion in speech”, ICPHS 99, pp. 2001-4, 1999.

[7] Ingmar Steiner, “Automatic Speech Data Processing with Praat” lecture notes 2007, URL http://www.coli.uni-saarland.de/~steiner/praat/lecturenotes.pdf.

[8] Mimmi Forsell, “Acoustic Correlates of Perceived Emotions in Speech” Master of Science Thesis Stockholm, Sweden 2007

[9] Carlo Drioli, G.Tisato, P.Cosi, and Fabio Tesser, “Emotions and Voice Quality: Experiments with Sinusoidal Modeling”, In VOQUAL'03, 127-132

[10] D.Lolive, N.Barbot, O.Boeffard by “Pitch and Duration Transformation with Non-parallel Data”, Speech Prosody 2008, URL http://www.sprosig.isle.illinois.edu

[11] Kwon O W, Chan K L, Hao J, et al. “Emotion Recognition by Speech Signals”, Eurospeech, Geneva, Switzerland, 2003.

[12] D.J.Ravi and Sudarshan Patilkulkarni, “Kannada Text to Speech Synthesis Systems: Emotion Analysis” international conference on natural language processing(ICON-2009)

[13] Rong J, Li G, Chen Y-P P. “Acoustic feature selection for automatic emotion recognition from speech”. J Inf Proc Manag, 2009,

[14] Peerzada Hamid Ahmad. “Transformation of emotions using pitch as a parameter for hindi speech”. Zenith, 2012 [15] Divya Setia, Maninder Suri

and Anurag Jain. “Emotion Conversion in Hindi Language”. indiacom-2010

[16] Anurag jain, S S Agrawal & Nupur Prakash, “Transformation of emotion based on acoustic features of intonation pattern for hindi Speech”. Ietejournal 2011.