An-interactive-Speech-Web-site-in-Arabic-and-English

11
O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3) An interactive Speech Web site in Arabic and An interactive Speech Web site in Arabic and English English O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3) _______________________________________________________________________________ _____________ Abstract: In this paper, we propose our approach to implement an interactive speech web site in Arabic and English; this method relies on the Integration between open source software systems of speech recognition, Text to Speech TTS, and dialogue systems, to build our application. Our application is targeted to enable blind people and children to browse a news website with short stories. It can help to enter the digital world in spite of their difficulties. Keywords: Speech Recognition, Speech Synthesis, JavaScript API, Natural Language Processing, Interactive website. ___________________________________________________________________________________ _________ 1. 1. Introduction: Introduction: The field of voice interaction between man and machine involves more and more researchers. The aim of this work is to enable people with written language interaction difficulties; like blind people or young children, to interact with the machine and have access to the contents arranged in a website with their voices only. Many Open Source packages are given to help building such applications. We will scan some of these packages; choose best ones -in terms of portability, performance and quality- to build our application, together with the help and remarks from the targeted people to ensure ease of use, in both Arabic and English. This paper presents the different components used in such systems (Automatic Speech Recognition (ASR systems), Text-to-speech systems (TTS), and Dialogue systems), then we introduce our implementation method, and obtained results. Finally, we give our conclusion. 2. 2. ASR system: ASR system: The process of speech recognition, means receiving computerized speech, when processing the speech signal to determine the words spoken [1]. A further step of understanding the meaning of these words is needed to 1

Transcript of An-interactive-Speech-Web-site-in-Arabic-and-English

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

An interactive Speech Web site in Arabic andAn interactive Speech Web site in Arabic andEnglishEnglish

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)____________________________________________________________________________________________

Abstract: In this paper, we propose our approach to implement an interactivespeech web site in Arabic and English; this method relies on the Integrationbetween open source software systems of speech recognition, Text to Speech TTS,and dialogue systems, to build our application. Our application is targeted toenable blind people and children to browse a news website with short stories.It can help to enter the digital world in spite of their difficulties.

Keywords: Speech Recognition, Speech Synthesis, JavaScript API, NaturalLanguage Processing, Interactive website.____________________________________________________________________________________________

1.1. Introduction: Introduction:

The field of voice interaction betweenman and machine involves more and moreresearchers. The aim of this work isto enable people with written languageinteraction difficulties; like blindpeople or young children, to interactwith the machine and have access tothe contents arranged in a websitewith their voices only. Many OpenSource packages are given to helpbuilding such applications. We willscan some of these packages; choosebest ones -in terms of portability,performance and quality- to build ourapplication, together with the helpand remarks from the targeted people

to ensure ease of use, in both Arabicand English.This paper presents the differentcomponents used in such systems(Automatic Speech Recognition (ASRsystems), Text-to-speech systems(TTS), and Dialogue systems), then weintroduce our implementation method,and obtained results. Finally, we giveour conclusion.

2.2. ASR system:ASR system:

The process of speech recognition,means receiving computerized speech,when processing the speech signal todetermine the words spoken [1]. Afurther step of understanding themeaning of these words is needed to

1

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

implement voice command associatedwith it. In our application, we willuse voice commands to navigate betweenthe pages of our site. Over the yearsa number of different methodologieshave been proposed for isolated wordand continuous speech recognition.These can usually be grouped in twoclasses: speaker-dependent andspeaker-independent. Speaker dependentmethods usually involve training asystem to recognize each of thevocabulary words uttered single ormultiple times by a specific set ofspeakers [1, 2] while for speakerindependent systems such trainingmethods are generally not applicableand words are recognized by analyzingtheir inherent acoustical properties[3]. Hidden Markov Models (HMM) havebeen proven to be highly reliableclassifiers for speech recognitionapplications and have been extensivelyused with varying amounts of success[4, 5]. In this technique, we build anHMM for each word to recognize, basedon the probabilities of being in thefirst state and the transitionprobabilities of between states andthe inherent probabilities of being ineach states. We usually use fivestates models for isolated wordsrecognition, supposing each wordconsisting roughly of five voices inaverage. Artificial Neural Networks(ANN) technique has also beendemonstrated to be an acceptableclassifier for speech recognition [8],trying to imitate the recognitionprocess in the brain of livingcreatures.

3.3. TTS SystemTTS System

Speech synthesis is the artificialproduction of human speech. A computersystem used for this purpose is calleda speech synthesizer, and can beimplemented in software or hardwareproducts. A text-to-speech (TTS)system converts normal language textinto speech; other systems rendersymbolic linguistic representationslike phonetic transcriptions intospeech [9, 10]. Synthesized speech canbe created by concatenating pieces ofrecorded speech that are stored in adatabase. Systems differ in the sizeof the stored speech units; a systemthat stores phones or diphonesprovides the largest output range, butmay lack clarity. For specific usagedomains, the storage of entire wordsor sentences allows for high-qualityoutput. Alternatively, a synthesizercan incorporate a model of the vocaltract and other human voicecharacteristics to create a completely"synthetic" voice output [11]. Thequality of a speech synthesizer isjudged by its similarity to the humanvoice and by its ability to beunderstood clearly. An intelligibletext-to-speech program allows peoplewith visual impairments or readingdisabilities to listen to writtenworks on a home computer. Manycomputer operating systems haveincluded speech synthesizers since theearly 1990s, for English and someother languages; however, we don'thave many free Arabic TTS available.

2

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

4.4. Dialogue Dialogue SystemsSystems: :

A dialog system or conversationalagent (CA) is a computer systemintended to converse with a human,with a coherent structure. Dialogsystems have employed text, speech,graphics, haptic, gestures and othermodes for communication on both theinput and output channels.There is much different architecturefor dialog systems. What sets ofcomponents are included in a dialogsystem, and how those componentsdivide up responsibilities differsfrom system to system. The dialogmanager is a principal component toany dialog system. It is the componentthat manages the state of the dialog,and dialog strategy. A typicalactivity cycle in a dialog systemcontains the following phases [12]:

I. The user speaks, and the input isconverted to plain text by thesystem's input recognizer/decoder,which may include:a- automatic speech recognizer

(ASR)b- gesture recognizerc- handwriting recognizer

II. The text is analyzed by a Naturallanguage understanding unit (NLU),which may include:a- Proper Name identificationb- part of speech taggingc- Syntactic/semantic parser

III. The semantic information isanalyzed by the dialog manager thatkeeps the history and state of thedialog and manages the general flowof the conversation.

IV. Usually, the dialog managercontacts one or more task managersthat have knowledge of the specifictask domain.

V. The dialog manager produces outputusing an output generator, whichmay include:a- natural language generatorb- gesture generatorc- layout engine

VI. Finally, the output is renderedusing an output renderer, which mayinclude:a- text-to-speech engine (TTS)b- talking headc- robot or avatar

Dialog systems that are based on atext-only interface (e.g. text-basedchat) contain only stages II-V.

5.5. Implementation: Implementation:

In this section, we will present ourapproach used to build our system,with its three main components, Text-to-Speech, Spoken Command Recognition,and Web site Dialogue. The JAD websitewas designed, to be a simple speechinteractive website in Arabic andEnglish. Figure 1 presents thestructure of the system.

Figure (1) The structure of the system

5.1. Text-to-Speech system:

Our system synthesizes English andArabic Texts. To perform the synthesiswe have used:I. Speech SDK 5.1 from Microsoft to

synthesize English Texts [13].

3

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

II. Things where more complicatedfor Arabic. First we have toinput diacritized Arabic texts.These texts are entered to agrapheme to phoneme systems webuilt for this purpose. Thephonemes are then introduced toMBROLA system [14] from TCTSLaboratory1 which concatenatediphones based on the list ofphonemes and produces speech. Infact, work has been done tosynthesize rough prosody basedon punctuation [16].

5.2. Spoken CommandRecognition:

As for the speech recognition we haveused JavaScript library called Annyang[15], which is a free library,performing Multi-Speaker recognition.Annyang is a tiny JavaScript librarythat lets the visitors of a sitecontrol this site with voice commands.Annyang supports multiple languages,but does not support Arabic language.It has no dependencies, weighs just 2KB, and is free to use. As thislibrary does not support ArabicLanguage, we have built an adapterfrom Arabic into Latin characters. Forexample, when the user gives the

command "ط meaning "link" this word راب��is converted to “rabit” and thenexecutes the command voice. The factthat our application, an interactivesite that performs specific voice

1 Théorie des Circuits et Traitement

du Signal laboratory (TCTS Labs),

http://tcts.fpms.ac.be/

commands, we will be using theindividual sounds (phonemes) for theEnglish-language most suited tothe Arabic language, as shown inTable (1):

Phoneme ال�حرف�

ل اب�� ال�مق�Phonem

e ال�حرف�ل اب�� Bال�مق� ب� F Tف� ب� Q Tق� ب� K ك��

Z ج� L Xل ج M مX ج� N Dن� د H ه�ـD د� W وR ر J ي"Z ر� ? ) ة� Sء )ال�همز� س صي"رة� ة� ال�ق� ي" ت�4 ح�رف� ال�صو الأ:S س� A ) حة� ت< Aَ ) ف��s. ص I E) Hَ )ك�سرة�d. ص� U ) مة� Kَ )ض��t. ط لة� ي4" ة� ال�طو ي" ت�4 ح�رف� ال�صو الأ:z. Pط a: ) ل�ف� ا )ا:H ع i: اء ي" )ي�"

) ة� ي" Gص�وت�4 ع� u: و )واو) ة� ي" Table(1): Arabic phonemes used in theص�وت�4

system

In fact, we just have few words torecognize in both English and Arabicversions. The list of the commands isgiven in table 2. In addition, acorrection strategy is adopted asfollows: when the user utters a wordnot listed in the commands the system

4

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

chooses the nearest command to it andasks the user if the command is thecorrect one.

5.3. Web Site Dialogue:

This component handles the process ofinteraction between the user and thesite. It begins by a welcoming phrase(Welcome to Jad dialogue website") andsays the available options, either herecognizes the command, or recognizesa command similar in pronunciation toan available uttered word, or hedoesn't recognize the command at all.Below are some sections of a dialoguebetween the user and our system (named"JAD"):

a. Case 1: the system does notrecognize the voice commandJad: Welcome to Jad dialoguewebsite.User: About France.Jad: Sorry, your command hasnot been found. Please enteranother command, or say help.

a. Case 2: The system recognizesthe voice commandJad: Welcome to Jad dialoguewebsite.User: SUMMARY.Jad: Ok.Jad: Converting text on the pageto Speech.

b. Case 3: The system predicts thecommand, when it is not properlyspoken.Jad: Welcome to Jad dialoguewebsite. User: New.

Jad: if you mean “News”, say“ok”, or enter another command,or say help.

In Arabic we have a Similar cases :a. Case 1:

Jad: "اع�لي ق� اد ال�ت� ع ج�� ي" م�وق�� ا ب��ك� ف� مرح�ب�User: "ط دولي .راب��Jad: و د ا: "4EEدي مEEر ج�� ال ا: ��EEي ادج رج� "�EEك� ي �EEور ع�لي ط�لب �EEم ال�عث ت� ل�م ت�"ط م�ساع�دة� راب�� ل: "ق��ّ "

b. Case 2:

Jad: "اع�لي ق� اد ال�ت� ع ج�� ي" م�وق�� ًا ب��ك� ف� مرح�ب�User: "اسي ط س�ب" .راب��Jad: ك� عرف� ع�لي ط�لب� ّم ال�ت� .ت��Jad: حة� الي ك�لأم ي" ال�صف� ود ف� ص ال�موج�� ل ال�ن� حوب4" .ت��

c. Case 3:

Jad: "اع�لي ق� اد ال�ت� ع ج�� ي" م�وق�� ًا ب��ك� ف� مرح�ب�User: اس�اب� س�ب"Jad: ل ��EEدج و ا: عم ا: ط ن�� ��EEراب ل: ��EEي" قEEاس صEEد س�ب" ق� ا ك�ن�ت� ن�� "اد� "

د دي4" �ي" ج�� مر ص�وت� ا:

5

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

we can move on to the Arabiclanguage through the voice command("Go to Arabic"), and we can goback to English language by voice

command ("ط .("English راب�� 6.6. ResultsResults  : :

The evaluation of our system wasdone by conducting a survey on tenblind people. The survey iscomposed of the followingquestions:

1- Is the idea of the applicationsuits you?

2- Do you find it comfortable touse the application?

3- Do you react with theapplication easily?

4- Is the quality of Arabicsynthesized Speechunderstandable?

5- Is the quality of Englishsynthesized Speechunderstandable?

6- Do you have a problem inexecuting the command you want?

7- Do you feel bothered repeatingthe command back again?

8- What kind of website contentwould you like to be ininteractive speech?

And we got the results below:

6

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

Figure (2): Results of questions 1..5

Figure (2): Results of question 6, 7 Figure (3): Results of question 8

7

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

We notice from the previous figuresthat the general idea of theapplication has suited most of theblind people as they seemed satisfiedusing the application and they foundthat the way the application respondsto them is convenient and easy.However, they all expressed that"Arabic speech synthesis process needsto be improved". And we also noticethat there is no problem neither withthe voice commands implementation, norwith repeating them. Moreover, most ofthem would like all the websites to beinteractive.On the other hand, and as our systemis required to be a Multi-user speechrecognition system, and in order toevaluate the recognition rate, wehave performed an evaluation testwhere, we asked 25 people to pronouncea set of Arabic words used in ourinterface ; and calculated therecognition rate of each word. Theresults for the Arabic words are shownin Table(2(.

WordRates

صة� %75ق��اسي" %90س�ب"

ر� %85م�وح��ط %95راب��ول %90ا:�ي" ت4 ا %90ال�ب�

عم %95ن��

ال�ث� %90ال�ب�Table (2): Recognition Rates of

Arabic words

We remark that the recognition ratesof our system are very good andacceptable.

7.7. Conclusion:

In this paper, we presented ourapproach to use Arabic speechsynthesis and speech recognitiontechniques to build a verbalinteractive website of greatsignificance to the community andespecially to the blind and visuallyimpaired computer users providing themwith the capability of browsingwebsites through speech. However,this work is a first step to make theweb available to the blind, andfurther work has to be done tointegrate the work on automaticprosody generation for Arabic, whichis taking place in HIAST (HigherInstitute for Applied Sciences andTechnology), in order to enhancesynthesized Arabic speech quality.

8.8. REFERENCES [1] M. B. Herscher, R. B. Cox, Anadaptive isolated word speechrecognition system, Proc. Conf. onSpeech Communication andProcessing, Newton, MA, 1972, 89-92.

[2] F. Itakura, Minimum predictionresidual principle applied tospeech recognition, IEEETransaction on Acoustics, Speech

8

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

and Signal Processing, ASSP-23,1975, 67-72.

[3] V. N. Gupta, J. K. Bryan, andJ. N. Gowdy, A speaker-independentspeech recognition system based onlinear prediction, IEEETransactions on Acoustics, Speech,Signal Processing, ASSP-26, 1978,27-33. [4] L. R. Rabiner, A tutorial onHidden Markov Models and selectedapplications in speechrecognition, Proc. IEEE, 77(2),1989, 257-286.

[5] A. Betkowska, K. Shinoda, S.Furui, Robust speech recognitionusing factorial HMMs for homeenvironments, EURASIP Journal onAdvances in Signal Processing,Article ID 20593, 2007, 1-10.

[6] M. S. Rafiee, A. A. Khazaei, Anovel model characteristics fornoise-robust automatic speechrecognition based on HMM, Proc.IEEE Int. Conf. on WirelessCommunications, Networking andInformation Security (WCNIS),2010, 215-218.

[7] R. Low, R. Togneri, Speechrecognition using theprobabilistic neural network,Proc. 5th Int. Conf. on SpokenLanguage Processing, Australia,1998.

[8] Thiang, S. Wijoyo, Speechrecognition using linearpredictive coding and artificialneural network for controllingmovement of mobile robot,International Conference onInformation and ElectronicsEngineering IPCSIT vol.6, 2011,179-183. [10] D. Paul, R. Parekh, Automatedspeech recognition of isolatedwords using neural networks,

International Journal ofEngineering Science and Technology(IJEST), 3(6), 2011, 4993-5000.[9] O. Al-Dakkak, and N. Ghneim,Natural Language Processing,Syrian Virtual University, 2010

[10] Allen, Jonathan; Hunnicutt,M. Sharon; Klatt, Dennis(1987). From Text to Speech: TheMITalk system. CambridgeUniversity Press.[11] Rubin, P.; Baer, T.;Mermelstein, P. (1981). "Anarticulatory synthesizer forperceptual research". Journal ofthe Acoustical Society of America,70 (2): 321–328.[12]  Jufarsky & Martin (2009),Speech and language processing.Pearson InternationalEdition, ISBN 978-0-13-504196-3,Chapter 24.

[13]http://www.microsoft.com/en-us/download/details.aspx?id=10121

[14]http://tcts.fpms.ac.be/synthesis/mbrola.html

[15] http://www.annyangjs.com/

[16] O. Al-Dakkak, N. Ghneim, M.AbouZliekha, S. Al-Moubayed, EmotionInclusion in an Arabic Text-to-Speech,Proceedings of 13th European SignalProcessing Conference EUSIPCO 2005, 4-8September, Antalya, Turkey

Authors’ informationAuthors’ information

9

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

First Oumayma AL DAKKAK: Is aResearch Director in Higher Institute of Applied Sciencesand Technology (HIAST), Tutorat Syrian Virtual University and Lecturer at the Information Technology Engineering Faculty, DamascusUniversity.. She a Doctor of Philosophy degree in Electronic Systems from Institut de la Communication Parlée- Institut National Polytechnique de Grenoble (ICP-INPG)-France. Email: [email protected]

Second Nada GHNEIM: Tutor atthe Syrian Virtual

University. Lecturer at theInformation Technology

Engineering Faculty, DamascusUniversity.

She holds a Doctor ofPhilosophy degree in Language

Sciences (SpeechCommunication) from “Institutde la Communication Parlée”-

Stendhal University,Grenoble, France

Email : [email protected]

Third Issam Salman: is a WebScience master student. He

holds a degree in InformationSystems Engineering from theSyrian Virtual University

Email:[email protected]

10

O. Al-Dakkak(1), N. Ghneim(2), I. Salman(3)

11