Shruti: an embedded text-to-speech system for Indian languages

13
1

Transcript of Shruti: an embedded text-to-speech system for Indian languages

1

Shruti - An Embedded Text-to-Speech System for Indian

Languages

Arijit Mukhopadhyay, Soumen Chakraborty∗, Monojit Choudhury,Anirban Lahiri, Soumyajit Dey, Anupam Basu†‡

July 20, 2005

Abstract

In India the recent increase in the number of people with physical impairments has necessitated

the need for low-cost portable augmentative and alternative communication devices.In this paper

we describe a text to speech system for Indian languages which accepts Text input in two Indian

languages, Hindi and Bengali and produces near natural audio output.This text to speech system

has been ported to two common handheld platforms namely iPaq from Compaq and Cassiopeia from

Casio both running on Microsoft Pocket PC.

1 Introduction

In India the physically impaired population has touched an alarming figure of 8.9 million people of

whom almost 15% suffer from speech and visual impairments. This section of the population depend

solely on the Augmentative and Alternative Communication Techniques [1] [2] for their education and

communication skills. Different tools have been implemented for these people but unfortunately they

are in the English language and are too costly for the Indian population. [3] [4] [5] [6]. In response

to their need we have taken up the task of developing low cost portable communication tools to aid

∗Dept. of Electrical Engineering, Indian Institute of Technology, Kharagpur†Dept. of Computer Science and Engineering, Indian Institute of Technology, Kharagpur‡[email protected]

2

the speech impaired population in India. In this paper we describe an Indian language text-to-speech

system that accepts text inputs in two Indian languages namely Hindi and Bengali and produces

near natural audio output. The system runs on a Compaq iPaq PDA built around the Intel Strong

Arm-1110 processor running Microsoft Pocket PC, a customized version of Microsoft’s operating

system WinCE for mobiles and other handheld devices. The system has also been ported to a Casio

Cassioepeia built around a MIPS 124 processor running Pocket PC. We have built two versions of the

system one which resides on the system memory and another which runs from a Storage Card. The

paper is organized as follows. Section 2 describes the technique used for speech synthesis. Section 3

describes the architecture of the text-to-speech system. The subsequent subsection discusses in details

the technologies involved in designing the text-to-speech system. The next subsection describes the

implementation and the subsequent performance of the system on some common handheld platforms.

The paper concludes with a short description of the future work that can be carried on based on this

system.

2 Speech Synthesis: Techniques

Speech synthesis involves the algorithmic conversion of input text data to speech waveforms. Speech

Synthesizers are characterized by the method used for storage ,encoding and synthesis of the speech.

The synthesis method is determined by the vocabulary size as all possible utterances of the language

need to be modeled. Text to speech systems designed for use in a restricted domain always perform

better than their general purpose counterparts. The design of such general purpose systems are

complicated by the fact that the sound output needs to be closed to natural speech. There are

different approaches to speech synthesis such as rule-based,articulatory modelling and concatenative

technique. The best method for speech synthesis is the articulatory method. [7] [8] [9] [10] In this

method the human larynx, the main speech production system is simulated electronically.But recent

speech research has been directed towards concatenative speech synthesizers due to difficulties in

modeling the human larynx accurately. In concatenative speech synthesis,a set of signal units or

specific waveforms are concatenated. In Indian languages the transitions from consonants to vowels

are different.Phonologically they are also different. A unified voice database suffices for Hindi and

3

Bengali as most of the vowel sounds are common. The only exception in Hindi is the Hindi a as

in kam.5 the same unified database works unchanged for many Indian languages. In the designed

system the waveform concatenation approach along with the Epoch Synchronous Overlap and Add

(ESOLA) [11] technique has been used as a basis for speech synthesis. The characterization of speech

synthesizers is accomplished by the size of the speech units they concatenate, as well as by the method

used for encoding, storage and synthesis of the speech. Since all possible utterances of the language

need to be modeled the size of the vocabulary determines the synthesis method. Text-to-speech

systems working in a limited domain always produce better speech output than those that work in a

general domain. The design of general purpose text-to-speech systems is more difficult than special

purpose ones as the prime challenge in designing is the degree of naturalness of the sound produced. In

this paper we present a general purpose text-to-speech system for two Indian languages namely Hindi

and Bengali. The languages chosen were Hindi and Bengali as they are the most commonly spoken

Indian languages. The most common speech units used in systems designed with the concatenative

synthesis approach are words , syllables, demi-syllable, phonemes, and sometimes even triphones.

In the present system partnemes are mainly used as units. Partnemes are waveforms which are

transitory from consonant to a vowel (CV), from a vowel to a consonant (VC) or only consonant or

only vowel. A partneme that is a vowel, has only one pitch period stored, call the epoch, since only

one epoch can completely characterize a vowel. The partnemes are spliced from 3 utterances taking

into consideration the combination of all vowels and consonants.

3 The Text-to-Speech System

3.1 Basic Architecture

A schematic diagram of the speech synthesis system is shown in 1. The system consists of two main

blocks A and B where block A is the language dependent block and block B is the Indian Language

Phonetic Synthesizer (ILPS). Block A consists of an input device, a natural language processor and

an intonational and prosodic rule base. The system accepts input text in ITRANS a notation for

expressing Indian language scripts through Roman characters. 5 shows the mapping between the

4

Hindi character set and ITRANS. The use of ITRANS was necessitated by the fact that there is no

inbuilt support for Indian language fonts in Microsoft Pocket PC. This problem is solved in WinCE

5.0 which accepts text input in Indian fonts. Language dependent prosodic and intonation rules are

included in this block primarily as a knowledge base. The natural language processor in block A

comprises of a phoneme parser, which uses either a phonological dictionary, syllable and word marker

or a linguistic rule base to produce an output phonemic string. Part B is the synthesizer. The output

speech is produced by taking the output string of phonemes(in ILPS symbols) and information for

intonation and prosody from the basic rule base as input. Thus the output phonemic string is utilized

by block B to produce the speech output.

3.2 Technologies Used

After the user has keyed in the text(in ITRANS) to be spoken the Natural Language Processor

(NLP) [12] takes that text and analyses it to perform grapheme to phoneme conversion. A grapheme

is the actual text whereas the phoneme is a token that directly maps to a signal unit in the voice

database or in the partneme dictionary. This grapheme to phoneme conversion requires morphological

and phonological analysis. [13] Thus the text given as an input to the NLP is converted into a string

of phonemes. [14] [15] Given the name of the city of Kharagpur as input to the NLP, the resulting

output has been shown as an example in Figure 2. In the figure the input grapheme string shown in

English exactly corresponds to the grapheme string shown in Bengali above it. The vowel a does not

have a separate graphemic representation.In the output phonemic string the 3 syllables of Kharagpur

are alternately coloured. A syllable may consist of a consonant vowel consonant sequence as in the

second and third syllables or may just contain a consonant 06 a vowel and sometimes only a vowel.

Phoneme strings will be composed of representatives of consonants (such as kh, R, p etc. in the

figure), that of vowels (like a, u) or that of consonant-vowel or vowel-consonant transitions (like kha,

ag, pu, ur, etc.). Each phoneme maps to a single unit of the voice database. In the example that has

been illustrated, the phoneme pu maps to a signal which is a transitory waveform from the consonant

p to the vowel u. Similarly other kh, kha, etc. will map to particular signals. The concatenation unit

reads the phonemes one by one from the output phoneme string. For each read phoneme, the unit

5

fetches the corresponding sound unit that the phoneme maps to from the voice database and appends

it to the end of the voice output string. To make the speech output more natural the speech units

have to be algorithmically modified before they are being concatenated. The algorithms that have

been used in this system for modifying the signal units have been briefly described:

1. The spectral mismatch between the steady state vowels and the transitory vowels that exist in

Consonant-Vowel or Vowel-Consonant transitions are one of the most common problems faced in

concatenative synthesis. To rectify this problem the steady state vowel is regenerated by interpo-

lation from the given terminal pitch period at both ends in case of a consonant-vowel-consonant

sequence or from one end for a Consonant-Vowel or Vowel-Consonant sequence. Though the

transitory movements of the 3 structures, particularly the formants are non-linear, a linear ap-

proximation yielded acceptable results.

2. Another common problem faced with the simple concatenation technique is the production of a

perceptible mechanical horn like sound over and above the normal quality of the speech. This is

because partneme concatenation produces exactly periodic waves instead of quasi-periodic. But

normal human voice is quasi-periodic. Hence, the vowel or voiced part of the speech is allowed

to vary randomly within a small limit(typically 3-4% of the base frequency) rather than keeping

it at constant frequency to make it sound more natural.

Another problem with the speech output obtained so far from the algorithms described is that the

speech is flat and expressionless even though it is natural. This is because the way of utterance,

i.e. the intonation and prosodic variations, are an integral part of natural speech. So the system

must also take intonation and prosody into consideration. Thus before concatenating 16 units the

base frequencies, amplitude and duration are modified depending on previously observed patterns of

utterances. These patterns are based on certain rules which have been extracted by studying natural

speech. It is worthwhile to note that among frequency, duration and amplitude, it has been observed

that frequency has the most profound impact on the naturalness of synthetic voice. At present a Hindi

and Bengali text-to-speech system have been developed here using the methods discussed above. First

these systems were tested without incorporating the algorithms and they have been again tested after

6

the implementation of these algorithms. Noted improvements in the utterance were noted after the

incorporation of these algorithms.

3.3 Implementation on Handhelds

The text-to-speech system was first developed on a PC running Microsoft Windows 2000. But soon

the need was felt that to disseminate the technology among the speech impaired population a low cost

portable version of the system needed to be developed. It was then decided to develop the system on

two popular handheld platforms the Compaq iPaq built around the SA-1110 processor and the Casio

Cassiopeia built around the MIPS VR4122 processor running 0 Pocket PC. The exact system specs

are shown in Table 1. The development on the handheld platforms was carried out on Embedded

Visual C++ 3.0 which can generate code for handheld platforms running on any of the 4 processors

namely MIPS, ARM, SA and SH3 running on Microsoft Pocket PC. Keeping the portability of the

system in mind two versions of the system were developed, one where the voice and the dictionary

files are kept in an external Storage Card, and the other where all the files are kept on the internal

memory of the device. The application is downloaded on the PDA connected to a desktop PC through

the COM or USB port, using Microsoft ActiveSync.

For easy operation of the device a very simple graphical user interface is designed which is shown in

Figure 3. The user can select the language using the radio buttons in the “Language” group box.

The input text can be keyed in the provided text box. The “Speech” button generates the speech file

corresponding to the text keyed in and plays the audio file generated. Due to memory constraints

the speech output file is deleted after the speech is produced. One of the problems that was faced

was that the code generated by Embedded Visual C++ 3.0 for the ARM platform was not 0. A

workaround was devised and certain changes were made in the source code to make the system run

on the iPaq from Compaq which is based on the SA-1110 from Intel. The change in the code for

the text-to-speech system has been illustrated in the Figure 4. After porting the system on the two

handheld platforms the system was extensively tested using inputs of size and calculating the running

times. The running times of the programs on the Casio Cassiopeia platform built around the MIPS

VR4122 processor are shown in the Table 2.

7

4 Future Work

The Text-to-Speech system has already been ported to two Microsoft Pocket PC based PDAs namely

the Compaq iPaq and the Casio Cassiopeia. The future research focus is being targeted towards the

design of a complete hardware-software system for 0 of Natural Language and Speech Applications

such as the Text-to-Speech system.

5 Conclusion

In this paper a text-to-speech system using concatenative speech synthesis technique has been de-

scribed. This is the first text-to-speech system built specifically for Indian languages. It has been built

for two Indian languages Bengali and Hindi but the system has been designed in such a manner such

that it can be very easily extended to other Indian languages as well. For ease of use and portability

the system has been ported to two existing handheld platforms running on two processor families

namely the Compaq iPaq and the Casio Cassiopeia. Two different versions of the system has been

developed one for use with internal memory and the other for use with a storage card. The current

research work is directed towards using this text-to-speech system to develop other applications easing

communication for the visually handicapped, speech impaired and people suffering from neuro-motor

disorders.

6 Acknowledgements

We want to thank Microsoft Corporation for providing the RFP award. We also wish to thank Mi-

crosoft Laboratory, Indian Institute of Technology Kharagpur, Communication Empowerment Labo-

ratory, Indian Institute of Technology, Kharagpur and Cal2Cal Corporation, Kolkata.

References

[1] C.A. Pennington and K.F. McCoy, Providing Intelligent Language Feedback for Augmentative

Communication Users, Springer-Verlag, 1998.

8

[2] C.A. Pennington and K.F. McCoy, Providing Intelligent Language Feedback for Augmentative

Communication Users, Springer-Verlag, 1998.

[3] P. Green and A.J. Brightman, Independence Day: Designing computer solutions for individuals

and disability, Apple Computer Inc., 1990.

[4] L. Booth A. Newell Morris and I. Ricketts, “Pal: A writing aid for language-impaired users,”

Augmentative and Alternative Communication, vol. 8, 1992.

[5] P.B. Vanderheyden and C.A. Pennington, An augmentative communication interface based on

conversational schemata, Springer-Verlag, 1998.

[6] J. A. VanDyke, “Applying natural language processing to enhance communication,” 1991.

[7] D.H. Klatt, “The klattalk text-to-speech conversion system,” IEEE International Conference on

Acoustics, Speech, Signal Processing, pp. 1589–1592, 1982.

[8] B.S. Atal, “A new model for lpc excitation for producing natural sounding speech at low bit-rate,”

in IEEE-ICASSP, 1997.

[9] S. Maeda, “A digital simulation method for the vocal tract system,” Speech Communication,

vol. 1, no. 3, pp. 199–229, 1982.

[10] G. Fant K.N. Stevense, S. Kasowaki, “An electrial analogue of the vocal tract,” JASA, , no. 25,

pp. 734–742, 1953.

[11] B.B. Chaudhuri Soumen Chowdhury, Ashok K.Datta, “On the design of universal speech syn-

thesis in indian context,” .

[12] James Allen, Natural Language Understanding, 1995.

[13] R. Kaplan and M. Kay, “Regualr models of phonological rule systems,” Computational Linguis-

tics, vol. 20, no. 3, pp. 331–378.

[14] M. Choudhury, “Rule-based grapheme to phoneme mapping for hindi speech synthesis,” in 90th

Indian Science Congress of ISCA, Bangalore, 2003.

9

[15] D. Sen, S. Sen, S.J. Chakraborty, A. Chakraborty, and A. Basu, “An indian language speech

interface for empowering the visually handicapped,” in Proceedings of International Workshop

on Frontiers of Research in Speech and Music, Kanpur, 2003, pp. 113–120.

Table 1: System Specifications

FEATURES CASIO CASIOPEIA COMPAQ iPAQ

Processor MIPS VR4122 Intel SA-1110

CPU Speed 150 MHz 206 MHz

Display Colors 65K colors, 16 bit TFT 4K colors, 12 bit TFT

RAM 32MB 64 MB

Input Method Touch-Screen,Stylus Touch-Screen,Stylus

Connectivity Serial, USB,IrDa Serial,USB,IrDA

Operating System Pocket PC Pocket PC

Table 2: Execution Times on Cassiopeia(in secs)

String Length Execution Time(Bengali) Execution Time(Hindi)

20 4 2

40 7 3

60 7 7

80 9 8

100 9 9

150 10 12

300 Memory Low Memory Low

10

Figure 1: A Schematic Diagram of the Speech Synthesis System

Figure 2: Operation of the NLP

11

Figure 3: The Graphical User Interface

Figure 4: Change in code for ARM platforms

12

Figure 5: The character map in ITRANS for Bengali

13