Post on 16-Jan-2023
Shruti - An Embedded Text-to-Speech System for Indian
Languages
Arijit Mukhopadhyay, Soumen Chakraborty∗, Monojit Choudhury,Anirban Lahiri, Soumyajit Dey, Anupam Basu†‡
July 20, 2005
Abstract
In India the recent increase in the number of people with physical impairments has necessitated
the need for low-cost portable augmentative and alternative communication devices.In this paper
we describe a text to speech system for Indian languages which accepts Text input in two Indian
languages, Hindi and Bengali and produces near natural audio output.This text to speech system
has been ported to two common handheld platforms namely iPaq from Compaq and Cassiopeia from
Casio both running on Microsoft Pocket PC.
1 Introduction
In India the physically impaired population has touched an alarming figure of 8.9 million people of
whom almost 15% suffer from speech and visual impairments. This section of the population depend
solely on the Augmentative and Alternative Communication Techniques [1] [2] for their education and
communication skills. Different tools have been implemented for these people but unfortunately they
are in the English language and are too costly for the Indian population. [3] [4] [5] [6]. In response
to their need we have taken up the task of developing low cost portable communication tools to aid
∗Dept. of Electrical Engineering, Indian Institute of Technology, Kharagpur†Dept. of Computer Science and Engineering, Indian Institute of Technology, Kharagpur‡arijitm@cse.iitkgp.ernet.in
2
the speech impaired population in India. In this paper we describe an Indian language text-to-speech
system that accepts text inputs in two Indian languages namely Hindi and Bengali and produces
near natural audio output. The system runs on a Compaq iPaq PDA built around the Intel Strong
Arm-1110 processor running Microsoft Pocket PC, a customized version of Microsoft’s operating
system WinCE for mobiles and other handheld devices. The system has also been ported to a Casio
Cassioepeia built around a MIPS 124 processor running Pocket PC. We have built two versions of the
system one which resides on the system memory and another which runs from a Storage Card. The
paper is organized as follows. Section 2 describes the technique used for speech synthesis. Section 3
describes the architecture of the text-to-speech system. The subsequent subsection discusses in details
the technologies involved in designing the text-to-speech system. The next subsection describes the
implementation and the subsequent performance of the system on some common handheld platforms.
The paper concludes with a short description of the future work that can be carried on based on this
system.
2 Speech Synthesis: Techniques
Speech synthesis involves the algorithmic conversion of input text data to speech waveforms. Speech
Synthesizers are characterized by the method used for storage ,encoding and synthesis of the speech.
The synthesis method is determined by the vocabulary size as all possible utterances of the language
need to be modeled. Text to speech systems designed for use in a restricted domain always perform
better than their general purpose counterparts. The design of such general purpose systems are
complicated by the fact that the sound output needs to be closed to natural speech. There are
different approaches to speech synthesis such as rule-based,articulatory modelling and concatenative
technique. The best method for speech synthesis is the articulatory method. [7] [8] [9] [10] In this
method the human larynx, the main speech production system is simulated electronically.But recent
speech research has been directed towards concatenative speech synthesizers due to difficulties in
modeling the human larynx accurately. In concatenative speech synthesis,a set of signal units or
specific waveforms are concatenated. In Indian languages the transitions from consonants to vowels
are different.Phonologically they are also different. A unified voice database suffices for Hindi and
3
Bengali as most of the vowel sounds are common. The only exception in Hindi is the Hindi a as
in kam.5 the same unified database works unchanged for many Indian languages. In the designed
system the waveform concatenation approach along with the Epoch Synchronous Overlap and Add
(ESOLA) [11] technique has been used as a basis for speech synthesis. The characterization of speech
synthesizers is accomplished by the size of the speech units they concatenate, as well as by the method
used for encoding, storage and synthesis of the speech. Since all possible utterances of the language
need to be modeled the size of the vocabulary determines the synthesis method. Text-to-speech
systems working in a limited domain always produce better speech output than those that work in a
general domain. The design of general purpose text-to-speech systems is more difficult than special
purpose ones as the prime challenge in designing is the degree of naturalness of the sound produced. In
this paper we present a general purpose text-to-speech system for two Indian languages namely Hindi
and Bengali. The languages chosen were Hindi and Bengali as they are the most commonly spoken
Indian languages. The most common speech units used in systems designed with the concatenative
synthesis approach are words , syllables, demi-syllable, phonemes, and sometimes even triphones.
In the present system partnemes are mainly used as units. Partnemes are waveforms which are
transitory from consonant to a vowel (CV), from a vowel to a consonant (VC) or only consonant or
only vowel. A partneme that is a vowel, has only one pitch period stored, call the epoch, since only
one epoch can completely characterize a vowel. The partnemes are spliced from 3 utterances taking
into consideration the combination of all vowels and consonants.
3 The Text-to-Speech System
3.1 Basic Architecture
A schematic diagram of the speech synthesis system is shown in 1. The system consists of two main
blocks A and B where block A is the language dependent block and block B is the Indian Language
Phonetic Synthesizer (ILPS). Block A consists of an input device, a natural language processor and
an intonational and prosodic rule base. The system accepts input text in ITRANS a notation for
expressing Indian language scripts through Roman characters. 5 shows the mapping between the
4
Hindi character set and ITRANS. The use of ITRANS was necessitated by the fact that there is no
inbuilt support for Indian language fonts in Microsoft Pocket PC. This problem is solved in WinCE
5.0 which accepts text input in Indian fonts. Language dependent prosodic and intonation rules are
included in this block primarily as a knowledge base. The natural language processor in block A
comprises of a phoneme parser, which uses either a phonological dictionary, syllable and word marker
or a linguistic rule base to produce an output phonemic string. Part B is the synthesizer. The output
speech is produced by taking the output string of phonemes(in ILPS symbols) and information for
intonation and prosody from the basic rule base as input. Thus the output phonemic string is utilized
by block B to produce the speech output.
3.2 Technologies Used
After the user has keyed in the text(in ITRANS) to be spoken the Natural Language Processor
(NLP) [12] takes that text and analyses it to perform grapheme to phoneme conversion. A grapheme
is the actual text whereas the phoneme is a token that directly maps to a signal unit in the voice
database or in the partneme dictionary. This grapheme to phoneme conversion requires morphological
and phonological analysis. [13] Thus the text given as an input to the NLP is converted into a string
of phonemes. [14] [15] Given the name of the city of Kharagpur as input to the NLP, the resulting
output has been shown as an example in Figure 2. In the figure the input grapheme string shown in
English exactly corresponds to the grapheme string shown in Bengali above it. The vowel a does not
have a separate graphemic representation.In the output phonemic string the 3 syllables of Kharagpur
are alternately coloured. A syllable may consist of a consonant vowel consonant sequence as in the
second and third syllables or may just contain a consonant 06 a vowel and sometimes only a vowel.
Phoneme strings will be composed of representatives of consonants (such as kh, R, p etc. in the
figure), that of vowels (like a, u) or that of consonant-vowel or vowel-consonant transitions (like kha,
ag, pu, ur, etc.). Each phoneme maps to a single unit of the voice database. In the example that has
been illustrated, the phoneme pu maps to a signal which is a transitory waveform from the consonant
p to the vowel u. Similarly other kh, kha, etc. will map to particular signals. The concatenation unit
reads the phonemes one by one from the output phoneme string. For each read phoneme, the unit
5
fetches the corresponding sound unit that the phoneme maps to from the voice database and appends
it to the end of the voice output string. To make the speech output more natural the speech units
have to be algorithmically modified before they are being concatenated. The algorithms that have
been used in this system for modifying the signal units have been briefly described:
1. The spectral mismatch between the steady state vowels and the transitory vowels that exist in
Consonant-Vowel or Vowel-Consonant transitions are one of the most common problems faced in
concatenative synthesis. To rectify this problem the steady state vowel is regenerated by interpo-
lation from the given terminal pitch period at both ends in case of a consonant-vowel-consonant
sequence or from one end for a Consonant-Vowel or Vowel-Consonant sequence. Though the
transitory movements of the 3 structures, particularly the formants are non-linear, a linear ap-
proximation yielded acceptable results.
2. Another common problem faced with the simple concatenation technique is the production of a
perceptible mechanical horn like sound over and above the normal quality of the speech. This is
because partneme concatenation produces exactly periodic waves instead of quasi-periodic. But
normal human voice is quasi-periodic. Hence, the vowel or voiced part of the speech is allowed
to vary randomly within a small limit(typically 3-4% of the base frequency) rather than keeping
it at constant frequency to make it sound more natural.
Another problem with the speech output obtained so far from the algorithms described is that the
speech is flat and expressionless even though it is natural. This is because the way of utterance,
i.e. the intonation and prosodic variations, are an integral part of natural speech. So the system
must also take intonation and prosody into consideration. Thus before concatenating 16 units the
base frequencies, amplitude and duration are modified depending on previously observed patterns of
utterances. These patterns are based on certain rules which have been extracted by studying natural
speech. It is worthwhile to note that among frequency, duration and amplitude, it has been observed
that frequency has the most profound impact on the naturalness of synthetic voice. At present a Hindi
and Bengali text-to-speech system have been developed here using the methods discussed above. First
these systems were tested without incorporating the algorithms and they have been again tested after
6
the implementation of these algorithms. Noted improvements in the utterance were noted after the
incorporation of these algorithms.
3.3 Implementation on Handhelds
The text-to-speech system was first developed on a PC running Microsoft Windows 2000. But soon
the need was felt that to disseminate the technology among the speech impaired population a low cost
portable version of the system needed to be developed. It was then decided to develop the system on
two popular handheld platforms the Compaq iPaq built around the SA-1110 processor and the Casio
Cassiopeia built around the MIPS VR4122 processor running 0 Pocket PC. The exact system specs
are shown in Table 1. The development on the handheld platforms was carried out on Embedded
Visual C++ 3.0 which can generate code for handheld platforms running on any of the 4 processors
namely MIPS, ARM, SA and SH3 running on Microsoft Pocket PC. Keeping the portability of the
system in mind two versions of the system were developed, one where the voice and the dictionary
files are kept in an external Storage Card, and the other where all the files are kept on the internal
memory of the device. The application is downloaded on the PDA connected to a desktop PC through
the COM or USB port, using Microsoft ActiveSync.
For easy operation of the device a very simple graphical user interface is designed which is shown in
Figure 3. The user can select the language using the radio buttons in the “Language” group box.
The input text can be keyed in the provided text box. The “Speech” button generates the speech file
corresponding to the text keyed in and plays the audio file generated. Due to memory constraints
the speech output file is deleted after the speech is produced. One of the problems that was faced
was that the code generated by Embedded Visual C++ 3.0 for the ARM platform was not 0. A
workaround was devised and certain changes were made in the source code to make the system run
on the iPaq from Compaq which is based on the SA-1110 from Intel. The change in the code for
the text-to-speech system has been illustrated in the Figure 4. After porting the system on the two
handheld platforms the system was extensively tested using inputs of size and calculating the running
times. The running times of the programs on the Casio Cassiopeia platform built around the MIPS
VR4122 processor are shown in the Table 2.
7
4 Future Work
The Text-to-Speech system has already been ported to two Microsoft Pocket PC based PDAs namely
the Compaq iPaq and the Casio Cassiopeia. The future research focus is being targeted towards the
design of a complete hardware-software system for 0 of Natural Language and Speech Applications
such as the Text-to-Speech system.
5 Conclusion
In this paper a text-to-speech system using concatenative speech synthesis technique has been de-
scribed. This is the first text-to-speech system built specifically for Indian languages. It has been built
for two Indian languages Bengali and Hindi but the system has been designed in such a manner such
that it can be very easily extended to other Indian languages as well. For ease of use and portability
the system has been ported to two existing handheld platforms running on two processor families
namely the Compaq iPaq and the Casio Cassiopeia. Two different versions of the system has been
developed one for use with internal memory and the other for use with a storage card. The current
research work is directed towards using this text-to-speech system to develop other applications easing
communication for the visually handicapped, speech impaired and people suffering from neuro-motor
disorders.
6 Acknowledgements
We want to thank Microsoft Corporation for providing the RFP award. We also wish to thank Mi-
crosoft Laboratory, Indian Institute of Technology Kharagpur, Communication Empowerment Labo-
ratory, Indian Institute of Technology, Kharagpur and Cal2Cal Corporation, Kolkata.
References
[1] C.A. Pennington and K.F. McCoy, Providing Intelligent Language Feedback for Augmentative
Communication Users, Springer-Verlag, 1998.
8
[2] C.A. Pennington and K.F. McCoy, Providing Intelligent Language Feedback for Augmentative
Communication Users, Springer-Verlag, 1998.
[3] P. Green and A.J. Brightman, Independence Day: Designing computer solutions for individuals
and disability, Apple Computer Inc., 1990.
[4] L. Booth A. Newell Morris and I. Ricketts, “Pal: A writing aid for language-impaired users,”
Augmentative and Alternative Communication, vol. 8, 1992.
[5] P.B. Vanderheyden and C.A. Pennington, An augmentative communication interface based on
conversational schemata, Springer-Verlag, 1998.
[6] J. A. VanDyke, “Applying natural language processing to enhance communication,” 1991.
[7] D.H. Klatt, “The klattalk text-to-speech conversion system,” IEEE International Conference on
Acoustics, Speech, Signal Processing, pp. 1589–1592, 1982.
[8] B.S. Atal, “A new model for lpc excitation for producing natural sounding speech at low bit-rate,”
in IEEE-ICASSP, 1997.
[9] S. Maeda, “A digital simulation method for the vocal tract system,” Speech Communication,
vol. 1, no. 3, pp. 199–229, 1982.
[10] G. Fant K.N. Stevense, S. Kasowaki, “An electrial analogue of the vocal tract,” JASA, , no. 25,
pp. 734–742, 1953.
[11] B.B. Chaudhuri Soumen Chowdhury, Ashok K.Datta, “On the design of universal speech syn-
thesis in indian context,” .
[12] James Allen, Natural Language Understanding, 1995.
[13] R. Kaplan and M. Kay, “Regualr models of phonological rule systems,” Computational Linguis-
tics, vol. 20, no. 3, pp. 331–378.
[14] M. Choudhury, “Rule-based grapheme to phoneme mapping for hindi speech synthesis,” in 90th
Indian Science Congress of ISCA, Bangalore, 2003.
9
[15] D. Sen, S. Sen, S.J. Chakraborty, A. Chakraborty, and A. Basu, “An indian language speech
interface for empowering the visually handicapped,” in Proceedings of International Workshop
on Frontiers of Research in Speech and Music, Kanpur, 2003, pp. 113–120.
Table 1: System Specifications
FEATURES CASIO CASIOPEIA COMPAQ iPAQ
Processor MIPS VR4122 Intel SA-1110
CPU Speed 150 MHz 206 MHz
Display Colors 65K colors, 16 bit TFT 4K colors, 12 bit TFT
RAM 32MB 64 MB
Input Method Touch-Screen,Stylus Touch-Screen,Stylus
Connectivity Serial, USB,IrDa Serial,USB,IrDA
Operating System Pocket PC Pocket PC
Table 2: Execution Times on Cassiopeia(in secs)
String Length Execution Time(Bengali) Execution Time(Hindi)
20 4 2
40 7 3
60 7 7
80 9 8
100 9 9
150 10 12
300 Memory Low Memory Low
10