The robot musician ‘wabot-2’ (waseda robot-2)

13
143 Sensing in Robotic Control The Robot Musician WABOT-2’ ( wAseda roBOT-2) p Research Group, Waseda University Ichiro Kato *, Sadamu Ohteru * *, Katsuhiko Shirai * * *, Seinosuke Narita * * *, Shigeki Sugano *, Toshiaki Matsushima * *, Tetsunori Kobayashi + and Eizo Fujisawa ++ * Department of Mechanical Engineering * * Department of Applied Physics * * * Department of Electrical Engineering t Dept. Electrical Engg., Hosei University tf Information and Communications Systems Lab., Toshiba Corp. The WABOT-2 is an anthropomorphic robot playing keyboard instruments, developed by the study group of Waseda Univer- sity’s Science and Engineering Department. The WABOT-2 is equipped with hands tapping softly on keys, with legs handling bass keys and expression pedal, with eyes reading a score, and with a mouth and ears to converse with humans. Based on WABOT-2, WASUBOT has been developed by Sumitomo Electric Industries Ltd., whose artistic skill has been demonstrated in performing music at the Japanese Govem- ment Pavillion in Expo’85. The present paper summarizes the WABOT-2’S motion, visual and vocal subsystems as well as its supervisory system and singing voice-tracking subsystem. Keywords: Anthropomorphic robot, Autonomous robot, Mul- tiple Degrees of Freedom, Dexterity, CCD camera, High speed image processing, Speech recognition, Speech synthesis. lchiro Kate received the B.S. degree in Electrical Engineering from Waseda University in 1950. He is currently a professor at the Department of mech- anical Engineering in Waseda Univer- sity. Since 1984, he has been Dean of School of Science and Engineering, Waseda University. Prof. Kato was a Chairman of the Robotics Society of Japan from 1985 to 1986. His interests are Bio-mechatronics, including anthropomorphic robots and North-Holland Robotics 3 (1987) 143-155 Sadamu Ohteru was born in Kyoto, Japan, on May 15, 1921. He received the B.S. degree from Waseda Univer- sity in 1946, and the Eng. D. degree from Tokyo University in 1958. Since 1964 he has been a Professor at the Department of Applied Physics, Waseda University. He has been ac- tively involved in nonlinear magnetics, biological memory, and stochastic computing research in the past thirty years. His present interests include image processing and its applications. Katsuhiko Shlral received the Ph. D degree in electrical engineering from Waseda University in 1968. He is cur- rently a professor at the Department of Electrical Engineering in Waseda University. He is interested in the various problems of man-machine in- terface such as speech processing, nat- ural language processing, computer aided learning systems and so on. Seinosuke Narita received the B.S.E.E., M.S.E.E., and Ph. D. degrees from Waseda University, Tokyo, Japan, in 1960,1962, and 1966, respectively, and was a Fullbright Graduate Exchange Student in the Department of Electri- cal Engineering, Purdue University, West Lafayette, IN, from 1962 to 1963. From 1970 to 1971, he was a Visit- ing Scholar in the Department of Electrical and Computer Engineering, Clarkson College of Technology, Potsdam, NY. Since 1963 he has been a Faculty Member in the Department of Electrical Engineer- ing, Waseda University, where he is currently a Professor. His research interests include robot control, parallel processing, decentralized control, computer integrated manufacturing, simulation, distributed computer control systems, and super- computing. Dr. Narita was a Chairman of IEEE Computer Society’s Tokyo Chapter from 1979 to 1981 and is currently a Chairman of IEEE Chapter Operations Committee in Tokyo Section. 0167-8493/87/$3.50 0 1987, Elsevier Science Publishers B.V. (North-Holland)

Transcript of The robot musician ‘wabot-2’ (waseda robot-2)

143

Sensing in Robotic Control

The Robot Musician ‘ WABOT-2’

( wAseda roBOT-2) p Research Group, Waseda University Ichiro Kato *, Sadamu Ohteru * *, Katsuhiko Shirai * * *, Seinosuke Narita * * *, Shigeki Sugano *, Toshiaki Matsushima * *, Tetsunori Kobayashi + and Eizo Fujisawa ++ * Department of Mechanical Engineering

* * Department of Applied Physics

* * * Department of Electrical Engineering

t Dept. Electrical Engg., Hosei University

tf Information and Communications Systems Lab., Toshiba Corp.

The WABOT-2 is an anthropomorphic robot playing keyboard

instruments, developed by the study group of Waseda Univer-

sity’s Science and Engineering Department.

The WABOT-2 is equipped with hands tapping softly on keys,

with legs handling bass keys and expression pedal, with eyes

reading a score, and with a mouth and ears to converse with

humans. Based on WABOT-2, WASUBOT has been developed by

Sumitomo Electric Industries Ltd., whose artistic skill has been

demonstrated in performing music at the Japanese Govem-

ment Pavillion in Expo’85.

The present paper summarizes the WABOT-2’S motion, visual

and vocal subsystems as well as its supervisory system and

singing voice-tracking subsystem.

Keywords: Anthropomorphic robot, Autonomous robot, Mul-

tiple Degrees of Freedom, Dexterity, CCD camera,

High speed image processing, Speech recognition,

Speech synthesis.

lchiro Kate received the B.S. degree in Electrical Engineering from Waseda University in 1950. He is currently a professor at the Department of mech- anical Engineering in Waseda Univer- sity. Since 1984, he has been Dean of School of Science and Engineering, Waseda University.

Prof. Kato was a Chairman of the Robotics Society of Japan from 1985 to 1986.

His interests are Bio-mechatronics, including anthropomorphic robots and

North-Holland

Robotics 3 (1987) 143-155

Sadamu Ohteru was born in Kyoto, Japan, on May 15, 1921. He received the B.S. degree from Waseda Univer- sity in 1946, and the Eng. D. degree from Tokyo University in 1958. Since 1964 he has been a Professor at the Department of Applied Physics, Waseda University. He has been ac- tively involved in nonlinear magnetics, biological memory, and stochastic computing research in the past thirty years. His present interests include image processing and its applications.

Katsuhiko Shlral received the Ph. D degree in electrical engineering from Waseda University in 1968. He is cur- rently a professor at the Department of Electrical Engineering in Waseda University. He is interested in the various problems of man-machine in- terface such as speech processing, nat- ural language processing, computer aided learning systems and so on.

Seinosuke Narita received the B.S.E.E., M.S.E.E., and Ph. D. degrees from Waseda University, Tokyo, Japan, in 1960,1962, and 1966, respectively, and was a Fullbright Graduate Exchange Student in the Department of Electri- cal Engineering, Purdue University, West Lafayette, IN, from 1962 to 1963.

From 1970 to 1971, he was a Visit- ing Scholar in the Department of Electrical and Computer Engineering, Clarkson College of Technology, Potsdam, NY. Since 1963 he has been

a Faculty Member in the Department of Electrical Engineer- ing, Waseda University, where he is currently a Professor. His research interests include robot control, parallel processing, decentralized control, computer integrated manufacturing, simulation, distributed computer control systems, and super- computing.

Dr. Narita was a Chairman of IEEE Computer Society’s Tokyo Chapter from 1979 to 1981 and is currently a Chairman of IEEE Chapter Operations Committee in Tokyo Section.

0167-8493/87/$3.50 0 1987, Elsevier Science Publishers B.V. (North-Holland)

144 # Research Group / Robot Musician 'Wabot-2'

language processing.

Shigeki Sugano received the B.S. and M.S. degrees in Mechanical Engineer- ing from Waseda University in 1981 and 1983, respectively. Since 1986, he has been an Assistant at the Depart- ment of Mechanical Engineering, Waseda University.

His research interests include an- thropomorphic manipulators and compliance control.

Toshiaki Matsushima received the B.S. and M.S. degrees in applied physics from Waseda University, Tokyo, Japan, in 1983 and 1985 respectively. Since 1986, he has been an Assistant at the Department of Applied Physics, Waseda University. His research inter- ests are computer vision, music infor- mation processing and its applica- tions.

Tetsunori Kobayashi was born in Tokyo, Japan in 1957. He received the B.S., M.S. and Ph.D. degrees from Waseda University, Tokyo, Japan, in 1980, 1982 and 1985, respectively, all in electrical engineering. From 1983 to 1985 he was an Assistant at Waseda University. Since 1985 he has been with Hosei University. He is currently an Assistant Professor in the Depart- ment of Electrical Engineering, Hosei University. His research interests in- clude speech processing and natural

full-scale a n t h r o p o m o r p h i c robot in the world. It consis ted of the l imb-cont ro l system, the vision system and the conversa t ion system. The WABOT-1 had the funct ion to communica te with a person in Japanese and to measure the dis tances and the di rect ions to the objects by its external receptors: ar t i f icial ears and eyes, and its art if icial mouth. The WABOT-1 walked with his lower l imbs and g r ipped and t r anspor ted the objects with his hands with the tact i le sensors. It was es t imated that the WABOT-1 had the same menta l faculty as a one- and-a -ha l f year old child.

In 1980, four labora tor ies gathered again and commenced the WABOT-2 project . Playing a key- b o a r d ins t rument was set up as intel l igent work which the WABOT-2 a imed to realize, since an art is t ic act ivi ty like p lay ing a keyboard ins t rument would require human- l ike intel l igence and dexter- ity. Therefore, the WABOT-2 will be def ined as a "spec ia l i s t robo t " ra ther than a versati le robot like the WABOT-I.

The robot music ian WABOT-2 can converse with a person, read a normal musical score with his eye and p lay an average tune on an electronic organ. The WABOT-2 is also able to accompany a person singing while he listens to the person singing.

WABOT-2 is composed of the fol lowing five subsys tems:

1. Limb Control System

Eizo Fujisawa received B.E. and M.E. degrees in electrical engineering from Waseda University in 1984 and 1986, respectively.

He joined Toshiba Corporation in 1986, and is now a research specialist in the Information and Communica- tion Systems Laboratory. He is a member of the Institute of Electronics, Information and Communication En- gineers of Japan, and he is also a member of the Information Processing Society of Japan.

This system consists of a n th ropomorph i c l imbs and their cont ro l system. When ' score ' informa- t ion is input , movements of the four l imbs are de te rmined au tomat ica l ly and the fingers of bo th hands strike the upper and lower keyboards of the electronic organ. The left foot works on the bass keyboards while the right foot opera tes the expres- sion pedal .

2. Vision System

Introduct ion: T h e WABOT Projec t

F o u r labora tor ies in the School of Science & Engineer ing of Waseda Univers i ty ga thered to set up the 'B io-Engineer ing G r o u p ' which com- menced the WABOT (wgseda rOBOT) project in 1970.

The WABOT-1 was deve loped in 1973, the first

This system is designed to recognize im- media te ly , with the CCD (charged-coupled device) camera, any p r in ted musical score commonly found on the market .

3. Conversation System

This system is des igned to enable as na tura l a conversa t ion as poss ible between man and the

tL Research Group / Robot Musician "Wabot-2' 145

robot. This system is divided into two main parts - one for speech recognition and the other for speech synthesis.

4. Singing Voice Tracking System

4 6 0

This system determines the pitch and duration of musical notes from a singing voice, and com- pares them with the original score. The deviation is made to the supervisory system, creating accom- paniment in tune with the singer's pitch.

5. Supervisory System

This system performs the interconnections be- tween each of the subsystems, and forms the nucleus that enables them to function as one robot system.

1. Limb Control System

1.1 Introduction

The study of the manipulators began in 1967, incorporating the technological assets gained from developing the active prostheses three years be- fore.

In 1972, the WAM-4 (waseda Automatic Manipu- lator) was developed. The WAM-4 had six degrees or freedom (DOF) in the arm and one DOF in the hand. The right and left hands were both devel- oped as the upper limbs of the intelligent robot WABOT-1. WAM-4 detected objects using a visual sensor and tactile sensors attached to its fingers, and grasped and transferred or shifted objects from one hand to the other by the symmetric bilateral control.

In 1974 a concept for the control adaptable to external constraints named "Torque-Posi t ion Control" was proposed. The development of the WAM-6 consisting of seven DOFS in the arm and two DOFS in the hand was completed in 1979. In 1980, three-dimensional Torque-Position Control was applied to the WAM-6 making it possible to open a door and to paint a curved surface.

In 1981, to develop a dextrous robot, which can move quickly, act intelligently, and play a key- board instrument, was set up as a job which examines a robot 's facility (Fig. 1). In 1983, the WAM-7 having seven DOFS in the arm and fourteen

oN

tAM 7R

O

00

r--I

Fig. 1. C o n s t r u c t i o n o f WABOT-2.

DOFS in the fingers was developed and could play simple tunes. In 1984, four limbs consisting of the WAM-7R (left arm), the WAM-8 (right arm), the WAM-8L (legs) were successively developed, in which four limbs having a total of fifty DOES cooperatively played the electronic organ, tapping fifteen times per second. The software algorithm based on the artificial intelligence to deal with information of a musical score automatically determined the cooperative movement of the fingers and the arms. The WAM-8 consequently had the ability to perform the average tunes.

146 t~ Research Group / Robot Musician "Wabot-2'

1.2 Mechanism Parts

1.2.1 Fingers Since WABOT-2'S design is anthropomorphic

(Fig. 2), it has five fingers like a human being (Table 1). The thumb has two DOFS, and the fingers have three DOFS each, totalling fourteen in each hand. Power from DC motors installed inside the body is transmitted through cables and outer tubes to drive the fingers. Each finger is capable of striking 15 keys a second.

1.2.2 Arms The arms are designed to involve redundant

DOFS much like a human arm, and are equipped with a total of seven DOFs. DC motors were selected as the actuators by virtue of their quick response and easy maintenance, and an absolute type ro- tary encoder and DC tacho-generator are used as sensors. The use of carbon fiber reinforced plastics as a structural material realizes the light weight. The WAM-8 is a new model which uses elbow/shoulder drive actuators having an output larger than that of WAM-7R by about 60%. The maximum movement speed of the WAM-8 wrist is about 1.5 m/sec.

1.2.3 Legs The right and left legs are designed for expres-

sion adjustment and bass keyboard playing, re- spectively, instead of travel. Since the functions necessary for movement are considered equivalent to those of the arm. The leg model has three DOFS, which is the minimum necessary for the foot posi- tioning, and one DOF for operating the pedal and bass keyboard (4 DOFS in total).

1.3 Control Parts

1.3.1 Computer System The WABOT-2 mechanism part has 50 DOFS in

total. No other robot contains so many DOFS, and the robot control method is very important for this reason. The microcomputer system of the WABOT-2'S limb control system has a three level hierarchical structure (Fig. 3).

A UNIX system using a 16-bit Z-8001 micro- computer is positioned at the high level and con- trols musical score data, decides movement plans of fingers, arms and legs from the musical scores, and calculates output trajectories for playing. At

Fig. 2. WABOT-2 at play.

Table 1 Specifications of WABOT-2.

Shape Anthropomorphic

Degrees of freedom Finger 14.2 Arm 7.2 Leg 4.2

Total 50

Length (mm) Breadth of 460 the shoulders Shoulder-waist 520 Leg 880 Stature 1890

Weight (kg) Arm and hand 14.5.2 Leg 8.5.2 Trunk 34 Head (CCD camera) 3

Total 83

Structural material CFRP

I~ Research Group / Robot Musician ' Wabot-2" 147

P :position sensor TG:velocity sensor M :DC motor A :servo amplifier

LEFT

h Z-8094

System 8000

Z-8002

;T LJ.

Z-8094

'",1 arm

X7

synchronizing signal

Z-8094

finger leg XI4 X4

Fig. 3. Computer system.

_[.•' Z-8094

f i n g e r XI4

't z-8002 TT .U.

l z-8094

a r m

X7

I RIGHT

Z-8094

-, ".,.[

leg X4

the middle level, two 16-bit Z-8002 microcom- puters are positioned so that each controls 25 DOFS using the playing trajectory data strings sent from the high-level computer. At the lowest level, fifty 8-bit single-chip microcomputers control the 50 DOFS (one per DOt). These microcomputers activate the software servo system after data inter- polation.

For programming languages, the C language is used mainly.

1.3.2 Automatic Path Planning In playing the instruments, human beings judge

the optimum finger a n d / o r arm motions from scores. It is necessary for this robot, also, to possess the capability of producing the plans of fingers and arms motions from the score informa- tion input.

In algorithm, two evaluation functions are in- troduced. One is to determine the fingering such that rapid arm motions are avoided as much as possible, since motions of robot arms have much more confinement than those of fingers. The other is for adjusting the finger-arm coordination move- ments in the robot operation plan. Thereby, to select an operation plan where movements of the

arm, without coordination with the fingers is avoided and coordination with the fingers is at- tained when the arm moves.

By this algorithm, it becomes possible to auto- matically make a smooth finger and arm motion trajectory plan minimizing arm movements for given musical note strings.

2. Vision System

2.1 Introduction

About ten years ago, Ohteru Laboratory devel- oped the vision system of WABOT-]. It aimed at the development of the eyes loaded on the robot which walks by two legs. The vision system caught sight of the object, and indicated to the robot the object's direction and distance.

Since then, our laboratory has been active in the fields of vision system research: the eye of the robot which has optical sensors in its hand, high- speed parallel pattern recognition system which automatically traces the center of gravity of mov- ing objects, automatic focusing, image restoration utilizing learning machine technique, automated

148 F Research Group / Robot Musician ' Wabot-2'

fingerprint classifier, panoramic video camera with 360 degree view, automated tablet inspection sys- tem, etc.

We started the development of automated re- cognition system of printed music about 5 years ago. At that time, a lot of computer controlled equipment, for example synthesizers, began to be used to play music. However, the music informa- tion for playing was still input manually. Auto- matic recognition technique of music score would be attractive in this field, and at the same time we

were informed that it would be also useful for the translation from printed music to braille. Under these circumstances, we decided to develop a ma- chine that automatically reads printed music through a COD camera, understands its meaning, and outputs necessary information for playing,

Several automated recognition systems for printed music have been reported. However, none of these systems are suitable for the vision of music playing robots, where real time recognition is required under an extremely poor data environ-

CPU BOARD SECTION

SPECIALIZED I HARDWARE SECTION

FRAME MEMORY I SECTION

IMAGEsECTIoNSCANNING i

~ TERMINAL ~ 1

FRONT-EN~ I CPU

FRAME MEMORY :ONTOROLLE (FHC) I" STAFF DETECTOR

T

I I TO WAB(>r

MAIN CPU

> MULTI BUS w ' " ~

DISCRIMINATION CPU

[ [ I/O BUS

l' l T HEAD w BAR

DETECTOR DETECTOR

'1 FRAME MEMORy

(FM)

HIGH SPEED DATA BUS

CC'D CAMERA

Fig. 4. Hardware block diagram.

It Research Group / R o b o t Musician "Wabot-2" 149

ment; a video camera is placed in the robot's head and music sheets are often curved or otherwise deformed on the music stand. In order to convey the vision system output to the robot's playing subsystem directly as a music information, gram- matical and inharmonical contradictions in the pattern discrimination results are checked, using knowledge-base music notation.

The WABOT-2 vision system, which can recog- nize not only commercially available printed music scores but also instant lettering scores in real time, uses an extraction algorithm suitable for the music score structure with special hardware and knowl- edge-base musical comprehension software. The resulting robot music vision is able to recognize one sheet of commercially available printed nursery song music scored for three parts, for an electric piano in 15 seconds, with approximately 100% accuracy.

2.2 System Architecture

The hardware block diagram is illustrated in Fig. 4. It comprises of the image scanning section, with a CCD camera, the frame memory (FM) sec- tion, the specialized hardware section (consisting of 6 parts; a frame memory controller (FMC), a staff detector, three head detectors for three types

of heads, and a bar detector), and the cPu board section, including a main c1,u board, a front-end cPu board, three discrimination cPu boards, and two terminals. These are connected via three different data buses according to type, speed and quantity of information transferred within the sys- tem.

A mechanical scanning CCD camera scans the image at a 2 seconds/frame rate with 2000 × 3000 resolution, in 8 bit gray level. Scanned image data are stored on the 6 Mbit FM under the control of FMC that performs shading correction over the whole view area, and converts the scanned image into binary image by thresholding simultaneously.

Note heads and bars are detected from the normalized image data, which are transferred by FMC at a 333 nSec/pixel rate. Note heads exist only on or between staff lines. Taking the ad- vantage of this location restriction, note heads are detected by a correlation method in a short time and high accuracy. Eight standard patterns for the correlation are prepared for the various note head shapes, and nine pattern sizes for each category can be selected for the normalization adjustment. Considering that the note head is slightly shifted from the position or that a staff is often distorted, AND operation is used for the correlation instead of EX-NOR operation.

I L-I I i__ q T____I.O i

® >

HEAD DETECTION

READ FROM FoM ACCOMPANIED WITH

NORMALIZATION

BAR q DETECTION

STAFF DETECTION

/ - ~ STORE INT0 F.Mo

STAFF PARAMETEHS

NORMALIZE PARAMETERS

~ STORED PICTURE

DISCRIMINATION RIGHT HAND

q DISCRIMINATION LEFT HAND

DISCRIMINATION FOOT

CANDIDATES OF HEAD AND BAR

k P V

ANALYSIS

Fig. 5. Recogn/tion sequence.

150 I~ Research Group / Robot Musician "Wabot-2'

The main cPu handles communications with the robot management block, controls the vision system, and analyzes the output from discrimina- tion opus. The front-end cPu controls the special- ized hardware and preprocessing, such as image scanning. Five cPu boards are connected with the multi-bus and work simultaneously. Two termi- nals are connected with main and front-end cPu board by serial buses. Terminals are used for development and maintenance of software, and for monitoring scanned image data while the vi- sion system is working.

2.3 Recognition Sequence

Fig. 5 shows the recognition sequence. The scanned music score image is converted into bi- nary image data by the thresholding and stored in the frame memory. At the same time, the staffs are detected. The location and size of staffs are important information, not only to determine the score geometry but also to restrict the search area in pattern discrimination for high speed note processing. Therefore, the staff detection carried out at first should be accurate. Staff parameters, which consist of information regarding staff loca- tions, normalize parameters which consist of the staff inclinations, the staff existing area, and the head size, are calculated.

The image is transferred into the discrimination oPUS, while being normalized. At the same time, note heads and bars are detected. The results are also transferred into discrimination oPUS with staff parameters. Each discrimination cpu detects at- tribute symbols, such as dots/hooks, using the transferred data. Finally, discriminated symbols are analyzed by using the musical syntax to gener- ate the recognition output as music information for playing.

3. Speech 1/o System

3.1 Introduction

Since the development of the speech I /O sys- tem 'WABOT-I' in 1973, speech researches on a wide variety of problems including speech produc- tion process, synthesis, and recognition have been done in our laboratory. The speech I / o system 'WABOT-2' has been developed on the basis of

those continuing researches. We can consider vari- ous types of speech I / o systems according to the objectives. The objective of our system is how we can extract efficiently the necessary information from natural conversation between human beings and the robot. In order to achieve this aim, the following problems should be solved: 1. how to realize appropriate conversation, 2. how to be flexible, allowing spoken sentences

in various ways, 3. how to synthesize arbitrary sentences, 4. how to perform sufficient real time processing

for smooth conversation. The development of recent LSI technology en-

ables the performance of a lot of signal processing and analysis like LOt coefficients calculation. It is still difficult to organize a total speech conversa- tion system which operates in real time with high performance. We adopted three kinds of micro- processors for each processing to realize the above abilities.

3.2 Organization of the System

The system is composed of six DSPS (/~PD7720), three SRPS (Speech Recognition Processor - DP Processor, ~PD7764) and six multi-microproces- sors (8086). Eight stage pipelined processings are carried out. In doing this, we obtain 40 MIPS speed. Fig. 6 illustrates the hardware organization of this system.

3.3 Conversation Control

A robot system must receive information of each situation to behave intelligently. The system must have sufficient knowledge to expect the conversation which follows. Here, this knowledge is described by state transition network to control the robot's internal state, according to speech in- put. Fig. 7 shows a fragment of the state transi- tion network. Using this method, it is easy to keep correct conversation, and it also recovers the faults of the result occurring in the speech recognition part.

3.4 Sentence Understanding

In the sentence understanding part, we aimed to allow the speaker flexible speech expression. Now, we consider an example to order the robot

I • Research Group /Robot Musician 'Wabot-2' 151

speech recognizer

Mic i ! ! >

[ ~o~° 1t "~ ~ . . . . ; - - ~ - - ; ~ - ~ . . . . . . . . . . . . . - - * . . . . . . . . . . : - - t l . . . . J

speech synthesizer :

i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

conversation controller

to system

controller

Fig. 6. Hardware organization of WABOT-2/CS.

selection

"Plag g(~__r, ~/~.: repertorg"/~ What do I plag for gou ?"

/~.~wW~._~I:"What can you play ?"

~sPel~rtiamo~n7 ?SO: re , ertorg announcement

SO : "OnCe r~re"

SI: "Play..."

SO:"I play..." CO:"Play. .."

• waiting " performance • finished

SI:(ambiguous)

/ so :"w~t r '

SI:"Yes" / SO:"I play..." CO:"Pla9..."

SI: speech input

SO: speech output

CO: command output

to system controller

Fig. 7, State transition network for conversation control.

152 # Research Group / Robot Musician "Wabot-2'

Fig. 8. Acceptable sentences network in program selection mode.

' t o play program Arabesque'. If the acceptable sentence is only 'Play Arabesque' and the system does not allow the other sentences, the restriction for users is high and the system is hard to use because one must remember all of the commands and procedures. Furthermore, the system must understand the speaker's request in various kinds of expression, l ike ' Please play', ' Would you play?', or only 'Arabesque' . It is very difficult to realize such a recognition system based on usual word recognition systems. In our system, input speech is predicted and we get the next network of sentences with phoneme nodes (Fig. 8) as follows. First, phoneme recognition is performed and phoneme lattices are obtained. Next, using dynamic pro- gramming matching, the uttered sentence is esti- mated by comparing the phoneme lattices of the input sentence with the paths (sentence) in the network. Besides removing irrelevant response and process in high speed, this method appropriately prepares the network which expresses acceptable sentences for each conversational state. The phoneme recognition method in this system adopts a combination technique based on LPC cepstrum coefficients in the units of v c v syllables and Bayes decision theory using the parameters mapped into articulatory domain. The parameters in articula- tory domain are effective to treat the phenomenon of co-articulation.

3.5 The Speech Synthesis Part

3.5.1 Outline of the Speech Synthesis Part In this system, the speech synthesis by rule

which enables speech output of various response

sentences is adopted. For the fundamental speech data, cv-concatenation units (C = Consonant, V = Vowel) are used. The specifications of the sys- tem are shown in Table 2.

3.5.2 The Synthesis Method The procedure for creating response sentences

is shown in Fig. 9. The system receives a code of the sentence from the sentence recognition part, and creates the response sentence by retrieving the word dictionary. This dictionary stores the names of cv syllabic units necessary for the words, vowel durations of each unit and the accent patterns. On the basis of the information, the cv units are controlled and the response sentences are synthe- sized.

3.5.3 Improvement in quality of Synthesized Speech Rules to concatenate the speech units and gen-

erate the prosodic patterns are necessary for the speech synthesis by rule. Some devices in our system are described below.

One is that residual waves are used as excita- tion waves in order to improve clarity of each unit. Almost the same speech waves as the original ones can be obtained by using the residual waves.

Table 2. Specifications of the synthesis system.

Synthesis parameter

D/A Converter Frame length Number of

syllabic units

PARCOR coefficient, 12th order 8-bit quantization

12.5 kHz, 12 bit 128 points (10.24 ms) ¢v type 110, vv type 30 v type 5

tt Research Group / Robot Musician "Wabot-2' 153

Sentence Code

Reference to ~ - - 1 Sentence Dictionary

1 Reference to ~ _ ~

Word Dictionary

Parameter Interpolation

I FO Contour Generation

PARCOR Filter I

Sentence Dictionary • Number of Words in Sentence • Word Codes

Word Dictionary • Pitch Pattern • Number of CV Units in Word • CV Unit names • Phoneme Duration • Number of Frames for Interpolation

I CV Syllabic Units

k- parameters

I_ Residual Waves I - I

I O'AConver,er H Low a-F,',er r q

Fig. 9. Process of speech synthesis.

However, if all residual waves are stored, a large capacity storage area is necessary. Then, the resid- ual waves are directly used for the consonant part and the transitional part from the consonant to the vowel, because the intelligibility of the conso- nant and the smooth transition to the vowel very

delicately affect the quality of the synthesized speech. For the vowel, a part of the residual waves in the steady portion of the vowel are used as the excitation waves in order to control the pitch frequency easily.

Next, we propose a new interpolation method

; ~ Amplifier

M1c.

l Band Pass Filter

Median Filter

' { Coding Detection of

Deviation

Matching

Musical Score l~r

Detection of

Fundamental Frequency

Generation of Transposing

Command

Output e

Fig. 10. Process of singing follow-up.

154 I~ Research Group / Robot Musician "Wabot-2"

to concatenate the cv syllabic units smoothly. The natural synthesized speech cannot be obtained by the mere connection of the units which are uttered separately. Then, the interpolation, which can move the formant frequencies like the movement of those in the original speech, using the sensitiv- ity of the formant frequencies to the k-parameters is adopted.

The important factors to improve the quality of synthesized speech are not only the intelligibility of the speech units but the control of the prosodic patterns. Therefore, the prosodic data stored in the word dictionary are generated on the-basis of the analysis of the real speech so that the synthe- sized speech resembles the human one.

4. Singing Voice Tracking System and Supervisory System

4.1 Singing Voice Tracking System

This function is intended to generate notation codes from a singing voice, compare them with the actual score to detect deviations in singing, and give the supervisory system a tranposition com- mand to follow up the singing. The process of singing follow-up is shown in Fig. 10.

At first, some sudden noises are eliminated from detected frequencies by median filters and notation codes are generated. The interval code, one of the notation codes, is generated from the

pitch and the integrated value of the amplitude, where the logarithm of the pitch is taken. The duration code should be generated carefully. It is difficult to identify the thirty-second note com- pletely because of pitch fluctuations. The system does not provide generated codes. Therefore, the number of the same interval codes detected con- tinuously is used as the sound duration code.

To prevent the resulting notation codes from being subdivided, the sound duration criterion for normal notes is established so that any note below the criterion will be integrated into a normal note.

The difference between the detected interval and the interval code from a score indicates a deviation in singing. Note that in order to obtain the deviations correctly, detected notation codes must be exactly associated with the notation codes that are registered. Notation coding is repeated at a constant frequency determined from the melody speed. Therefore, it is possible to locate any posi- tion of a melody which is being played by estab- lishing the relationship between the total number of coding count from the beginning of singing and the sound duration code registered. But the result- ing notation codes involve codes that are short and unreliable; comparisons between them may occur too frequently. This increases the time for processing and often causes fluctuations of devia- tions. Therefore, the notation codes to be com- pared are limited to the case where the overlap- ping sound length exceeds a given length of time.

I Limb Control System

<,> Vision System

Supervisory I System

<> <>

Conversation System

I Speech

Synthesis Part

Speech Recognition

Part

Singing Voice Tracking System

Fig. 11. System configuration of WABOT-2.

I~ Research Group / Robot Musician "Wabot-2" 155

4.2 Superoisory System

The functions and controls of each subsystem must be examined to prevent problems concerned with command details, musical score information, and state transitions of the robot. Note that the robot must not be seen as a mere group of subsys- tems but as a single system (Fig. 11). And the robot's control method must be examined after fully considering the robot's relationship with its external environment.

In an attempt to realize successful control of the subsystem, a "mimic" WABOT-2 capable of

meeting the operation of each subsystem was con- structed prior to the completion of the actual subsystems. Problems such as communication commands, codes of musical notes, and the di- agnostics for a subsystem malfunction were in- vestigated.

The state transition of the robot is divided into categories according to active operations of the subsystems. The processing is represented by rules as partial and independent knowledge; the func- tion of the robot can be easily expanded as the occasion demands.