Assessing the Applicability of Surface EMG to Tongue Gesture Detection

10
Assessing the Applicability of Surface EMG to Tongue Gesture Detection João Freitas 1,2 , Samuel Silva 2 , António Teixeira 2 , Miguel Sales Dias 1,3 1 Microsoft Language Development Center, Lisboa, Portugal 2 Dep. Electronics Telecommunications & Informatics/IEETA, University of Aveiro, Portugal 3 Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR-IUL, Lisboa Portugal [email protected], {sss,ajst}@ua.pt, [email protected] Abstract. The most promising approaches for surface Electromyography (EMG) based speech interfaces commonly focus on the tongue muscles. De- spite the interesting results in small vocabularies tasks, it is yet unclear which articulation gestures these sensors are actually detecting. To address these com- plex aspects, in this study we propose a novel method, based on synchronous acquisition of surface EMG and Ultrasound Imaging (US) of the tongue, to as- sess the applicability of EMG to tongue gesture detection. In this context, the US image sequences allow us to gather data concerning tongue movement over time, providing the grounds for the EMG analysis. Using this multimodal setup, we have recorded a corpus that covers several tongue transitions (e.g. back to front) in different contexts. Considering the annotated tongue movement data, the results from the EMG analysis show that tongue transitions can be detected using the EMG sensors, with some variability in terms of sensor positioning, across speakers, and the possibility of high false-positive rates. Keywords: tongue gestures, surface electromyography, ultrasound imaging, synchronization, silent speech interfaces. 1 Introduction Automatic Speech Recognition (ASR) based on the acoustic signal alone has several disadvantages, such as performance degradation in the presence of environmental noise. Also, if an ASR interface relies only on the acoustic signal, it becomes inap- propriate for privacy and non-disturbance scenarios and, importantly, it becomes in- adequate for users with speech impairments. These limitations of conventional ASR motivated the Silent Speech Interface (SSI) concept, which is a system that allows for speech communication in the absence of an acoustic signal [1]. One of the most prom- ising technologies used for implementing an SSI is surface ElectroMyoGraphy (EMG), which, according to previous studies, has achieved good performance rates [2]. These studies use EMG electrodes positioned in the facial muscles responsible for

Transcript of Assessing the Applicability of Surface EMG to Tongue Gesture Detection

Assessing the Applicability of Surface EMG

to Tongue Gesture Detection

João Freitas1,2, Samuel Silva2, António Teixeira2, Miguel Sales Dias1,3

1 Microsoft Language Development Center, Lisboa, Portugal 2 Dep. Electronics Telecommunications & Informatics/IEETA, University of Aveiro, Portugal

3 Instituto Universitário de Lisboa (ISCTE-IUL), ISTAR-IUL, Lisboa Portugal

[email protected], {sss,ajst}@ua.pt,

[email protected]

Abstract. The most promising approaches for surface Electromyography

(EMG) based speech interfaces commonly focus on the tongue muscles. De-

spite the interesting results in small vocabularies tasks, it is yet unclear which

articulation gestures these sensors are actually detecting. To address these com-

plex aspects, in this study we propose a novel method, based on synchronous

acquisition of surface EMG and Ultrasound Imaging (US) of the tongue, to as-

sess the applicability of EMG to tongue gesture detection. In this context, the

US image sequences allow us to gather data concerning tongue movement over

time, providing the grounds for the EMG analysis. Using this multimodal setup,

we have recorded a corpus that covers several tongue transitions (e.g. back to

front) in different contexts. Considering the annotated tongue movement data,

the results from the EMG analysis show that tongue transitions can be detected

using the EMG sensors, with some variability in terms of sensor positioning,

across speakers, and the possibility of high false-positive rates.

Keywords: tongue gestures, surface electromyography, ultrasound imaging,

synchronization, silent speech interfaces.

1 Introduction

Automatic Speech Recognition (ASR) based on the acoustic signal alone has several

disadvantages, such as performance degradation in the presence of environmental

noise. Also, if an ASR interface relies only on the acoustic signal, it becomes inap-

propriate for privacy and non-disturbance scenarios and, importantly, it becomes in-

adequate for users with speech impairments. These limitations of conventional ASR

motivated the Silent Speech Interface (SSI) concept, which is a system that allows for

speech communication in the absence of an acoustic signal [1]. One of the most prom-

ising technologies used for implementing an SSI is surface ElectroMyoGraphy

(EMG), which, according to previous studies, has achieved good performance rates

[2]. These studies use EMG electrodes positioned in the facial muscles responsible for

moving the articulators during speech, including electrodes in the upper neck area to

capture possible tongue movements. Using these approaches, features extracted from

the EMG signals are directly applied to the classification problem, in order to distin-

guish between different speech units (i.e. words or phonemes). However, it is yet

unclear what tongue movements are actually being detected, and there is not infor-

mation about different tongue movements during speech. Finally, we don’t know

whether these movements can be correctly identified using surface EMG.

To better understand the capabilities of surface EMG, we need reliable data about

the articulators’ movements, particularly the tongue, which is one of the main articu-

lators in the human speech production process and, for some, the most important of

the articulators in speech [3]. There are several technologies that allow obtaining in-

formation about the tongue movement during speech (e.g. Real-Time Magnetic Reso-

nance Imaging (RT-MRI) [4], Ultrasound (US) [5], among others [6]). Therefore, one

could potentially use this information to provide the grounds for EMG analysis. After

analyzing the pros and cons of several modalities, we decided to use US imaging to

understand if, by using surface EMG sensors positioned in the neck regions, we are

able to detect tongue movements. Thus, by synchronously combining US and the

myoelectric signal of the tongue, we would be able to accurately identify tongue

movements and develop more informed EMG classifiers that could be easily used to

complement a multimodal SSI with an articulatory basis. This way, information about

an articulator which is normally hidden by the lips, in the case of a visual approach, or

is very difficult to extract, as in the case of Ultrasonic Doppler sensing [7], would be

provided. Ideally, one could consider a neckband that detects tongue gestures inte-

grated with other modalities such as video, all of which contribute to a less invasive

approach to the characterization of speech production.

The rest of this paper is organized as follows: Section 2 presents a brief back-

ground about tongue muscles and a summary of the related work on EMG-based SSI

and tongue related studies. Section 3 describes the methodology of this study, namely

the corpus, the acquisition system, the applied synchronization solution and the US

annotation process. Section 4 reports the first results about tongue movement detec-

tion. Finally, section 5 ends the paper with some concluding remarks and future work.

2 Background and Related Work

2.1 Tongue Muscles

The tongue muscles are either classified as intrinsic muscles, which are responsible

for changing the shape of the tongue, or extrinsic muscles, which change the position

of the tongue in the mouth, as well as the shape the tongue to some extent [8]. The

intrinsic muscles are the superior and inferior longitudinal, transverse and vertical.

The extrinsic muscles are the Genioglossus, Hyoglossus, Palatoglossus and Styloglos-

sus. By working together, these muscles create several type of tongue movements

such as: tongue tip elevation, depression and deviation to the left and right, lateral

margins relaxation, central tongue grooving, tongue narrowing, protrusion, retraction,

posterior elevation and body depression. More details about the function and exact

location of these muscles can be found in [3, 8].

2.2 Related Work

As mentioned above, surface EMG is a common approach in SSI. In order to capture

tongue information, one of the most important positions for placing the EMG elec-

trodes is in the areas of neck and beneath the chin [2, 9]. Also, in the area of SSI,

there are several approaches that use US, combined with video, to get a more com-

plete representation of the human speech production process [10]. Other approaches,

able to estimate tongue movements, also used in SSI, include Electromagnetic Articu-

lography (EMA), where magnets glued to the tongue are tracked by magnetic sensors

[11].

In the field of phonetics, other studies using intra-oral EMG electrodes attached to

the tongue have provided valuable information about the temporal and spatial organi-

zation of speech gestures. These studies have analyzed different cases such as vowel

articulation [12] or defective speech gestures. An example is the case of aphasia [13].

However, although an electrode directly placed in the articulator would generate more

accurate information, avoiding some of the muscle cross-talk and superposition, it

would be inadequate and unusable for a natural and non-invasive SSI.

There are also studies of the tongue which use other technologies such as, RT-MRI

[14], cinefluorography [15], US using a headset to permit natural head movement [5]

and Electropalatography (EPG) [6], which allows us to get a very good understanding

of the tongue shapes and movements during speech.

3 Methods

For this study, we started by synchronously acquiring EMG data along with Ultra-

sound imaging, a modality that is able to provide essential information about tongue

movement. For that purpose, we created a corpus where we cover several tongue tran-

sitions in the anterior-posterior axis (e.g. front-back and vice-versa) and also elevation

and depression of several tongue parts. For future studies, as depicted in Fig. 1, we

have also captured three additional modalities: Video, Depth information, and Ultra-

sonic Doppler sensing. After acquiring the data and synchronizing all data streams,

we have determined and characterized the segments that contain tongue movement,

based on the US data.

Fig. 1. Recording session and the respective acquisition setup.

3.1 Corpus

To define the corpus for our experiment we considered the following goals: (1) record

tongue position transitions; and (2) record sequences where the movement of articula-

tors other than the tongue is minimized (lips, mandible and velum). Considering these

goals, we selected several /vCv/ contexts, varying the backness and the closeness of

the vowels, and tongue transitions between vowels using /V1V2V1/ vowel sequences,

as presented in Table 1.

Table 1. List of prompts and its respective context.

Context Prompts (using SAMPA phonetic alphabet)

/vCv/ [aka, iki, uku, eko, EkO, iku, aLa, eLe, uLu, ata, iti, utu, eto, EtO, itu, eso, isu, EsO]

/V1V2V1/ [iui, ouo, EOE, eoe, iei]

These combinations include the tongue transitions in terms of vowel closeness and

backness as well. The selected sequences are composed by the transition /V1V2/ and

its opposite movement /V2V1/. For example, the prompt [iui] is composed of the

tongue transition from [i] to [u] and the transition from [u] to [i]). In order to mini-

mize movement of other articulators than the tongue we have not included bilabial

and labio-dental consonants. Some exceptions can be found in the corpus for rounded

vowels (i.e. vowels that require lip rounding such as [u]) in order to include in corpus

the tongue position associated with these sounds.

Based on pilot recordings, we noticed that the speaker would get uncomfortable af-

ter a long time using the US headset. As such, we focused on the most relevant transi-

tions in order to minimize the length of the recordings. For each prompt in the corpus

three repetitions have been recorded and, for each recording, the speaker was asked to

say the prompt twice, e.g. “iiitiii…iiitiii”, with around 3 seconds of interval, yielding

a total of 6 repetitions per prompt. To facilitate movement annotation we asked the

speakers to sustain each phoneme for at least one second. The prompts were recorded

in a random order. The prompts were presented on a computer display, and the partic-

ipant was instructed to read them when signaled (prompt background turned green).

For each recorded sequence, EMG recording was started before US recording and

stopped after the US was acquired.

The 3 speakers used in this study, were all male native speakers of European Por-

tuguese, with the following ages: 28, 31, and 33 years. No history of hearing or

speech disorders is known for any of them. Each speaker recorded a total of 81 utter-

ances containing 2 repetitions each, giving a total of 486 observations (81 utterances x

3 speakers x 2 repetitions of each utterance).

3.2 Ultrasound Acquisition

The ultrasound setup comprises a Mindray DP6900 ultrasound system with a

65EC10EA transducer, an Expresscard/54 Video capture card, to capture the US vid-

eo, a microphone, connected to a Roland UA-25 external soundcard and a Sync-

BrightUp unit, which allows synchronization between the audio and ultrasound video,

recorded at 30 frames per second. To ensure that the relative position of the ultra-

sound probe and the head is kept during the acquisition session, a stabilization headset

is used [5], securing the ultrasound probe below the participant's chin.

Acquisition is managed by Articulate Assistant Advanced (AAA) [16], which is re-

sponsible for recording the audio and ultrasound video and triggering the Synch-

BrightUp unit. The SyncBrightUp unit, when triggered, introduces synchronization

pulses in the audio signal and, for each of those, a white square on the corresponding

ultrasound video frames of the sequence.

The synchronization between audio and the US video is tuned after the recording

session, in AAA, by checking the pulses in the audio signal and aligning them with

the frames containing bright squares.

3.3 Acquisition of surface EMG and other modalities

The EMG sensor acquisition system (from Plux [17]), uses 5 pairs of EMG surface

electrodes connected to a device that communicates with a computer via Bluetooth.

The sensors were attached to the skin considering an approximate 2.0cm spacing be-

tween the center of the electrodes for bipolar configurations. Before placing the sur-

face EMG sensors, the sensor location was cleaned with alcohol. While uttering the

prompts no other movement besides the one associated with speech production was

made. The EMG channels 1 and 4 used a monopolar configuration (i.e. we placed one

of the electrodes from the respective pair in a location with low or negligible muscle

activity), being the reference electrodes placed on the mastoid portion of the temporal

bone. The positioning of the EMG electrodes was based on previous work (e.g. [2, 9])

and was somewhat limited because of the Ultrasound probe placed beneath the chin,

as depicted in Fig. 2.

In terms of pre-processing, the EMG signal was normalized, 50 Hz removed with a

notch filter and filtered using Single Spectrum Analysis (SSA).

Fig. 2. Frontal (left image) and lateral (right image) view of the positioning of the surface EMG

sensors (channels 1 to 5).

For collecting Video and Depth data we used a Microsoft Kinect device. For the ultra-

sonic Doppler signal we used a custom built device [18], which also contains a mi-

crophone besides the ultrasonic emitter and receiver.

The audio from the main microphone (provided by the SyncBrightUp unit), the Ul-

trasonic Doppler signal, and a synchronization signal programmed to be automatically

emitted by the EMG device at the beginning of each prompt were recorded by an

external sound card (TASCAM US-1641). The activation of the output bit flag of the

EMG recording device generates a small voltage peak on the recorded synchroniza-

tion signal. To enhance and detect that peak, a second degree derivative is applied to

the signal, followed by an amplitude threshold. To be able to detect this peak, the

external sound board channel is configured with maximum input sensitivity. Then,

after automatically detecting the peak, we remove the extra samples (before the peak)

in all channels. More details regarding the setup can be found in [19].

3.4 EMG synchronization with US

For synchronizing the EMG signals with the US video, we use the audio signal pro-

vided by the SyncBrightUp unit, which, as described in section 3.2, contains the syn-

chronization pulses and is captured by both the EMG and US setups. Since, inside

each setup, this signal is synchronized with the remaining data, it was used to syn-

chronize data across both. To measure the delay between the two setups, we perform

a cross-correlation between the audio tracks, obtaining the time lag between the two,

and removing samples from the EMG signals (which always starts first).

3.5 US Data Annotation

The ultrasound was annotated to identify the segments where tongue movement was

present. The goal is to examine the video sequence and tag those segments where the

tongue is in motion. Given the large amount of US data, we used an automatic annota-

tion method [4]. This method consists in computing the pixel-wise inter-frame differ-

ence between each pair of video frames in the sequence. When the tongue moves, the

inter-frame difference rises resulting in local maxima (refer to Fig. 3 for an illustrative

example). Starting from these maxima, the second derivative is considered, left and

right, to expand the annotation. Then, to segment the repetition in each utterance the

envelope of the speech signal is used.

Considering, for example, one repetition of the sequence “iiiuuuiii”, at least four

tongue movements are expected: one at the start for the transition from resting to [i],

one for the transition from [i] to [u] (tongue moves backwards), one for the transition

from [u] to [i] (tongue moves forward) and one at the end, when the tongue goes from

[u] into the resting position. Therefore, the two annotated regions on the middle of

each repetition correspond to the transitions between sounds, as depicted in Fig. 3.

For the movements outside the utterances no movement direction was annotated. To

validate the proposed method, 72 tongue movement segments were manually annotat-

ed in several sequences belonging to the three speakers. For additional details, regard-

ing the US sequences automatic annotation, the reader is forwarded to [20].

Fig. 3. Inter-frame difference curve (in blue) for a speaker uttering “iiiuuuiii”. The automatic

annotation identifies all segments with tongue movement, and movement direction for those

inside the utterance.

4 Results

To assess detection of tongue movements, the synchronized data was explored for

differences on density functions. Additionally, a detection experiment based on the

probability of movement was also performed.

4.1 Densities

Using random utterances of each speaker and the annotations from US, the probability

mass functions for 3 classes were calculated and compared. Two of these classes rep-

resent the tongue movements found in each utterance and the third class denotes no

tongue movement. Based on preliminary experiments, we noticed that the statistics

for each speaker stabilize after 9 utterances. A Gamma distribution was adopted based

on the shape of the histograms. The 2 distribution parameters of this distribution were

estimated using Matlab.

As depicted in Fig. 4, differences in distributions were found for all speakers

with some variations within EMG channels.

Fig. 4. Density functions for the 5 EMG channels of speaker 1 and 2, including curves for one

of the movements (front) and non-movement.

4.2 Detection Exploratory Experiment

Based on the probability distribution functions described in the previous section, we

estimated the probability of each movement. Hence, considering the probability of the

measured EMG (meas) given a movement (mov), p(meas|mov), we can apply Bayes

rules as follows:

𝑝(𝑚𝑜𝑣|𝑚𝑒𝑎𝑠) =𝑝(𝑚𝑒𝑎𝑠|𝑚𝑜𝑣) 𝑝(𝑚𝑜𝑣)

(1−𝑝(𝑚𝑜𝑣)) 𝑝(𝑚𝑒𝑎𝑠|𝑛𝑜𝑛𝑚𝑜𝑣)+𝑝(𝑚𝑜𝑣) 𝑝(𝑚𝑒𝑎𝑠|𝑚𝑜𝑣) (1)

The detection threshold was set to 0.5 and p(mov) to 0.3, which, based on an empiri-

cal analysis, presented a good balance between detections and false positives. Fig. 5

presents the results in a (well behaved) utterance.

To assess the applied technique, we compared the detection results with the US an-

notation in order to obtain information on correct detections, failures in detection and

false detections. As this processing was done for each sample, the outputs were man-

ually analyzed to determine the number of correct detections and number of failures.

For false positive (FP) detections, a qualitative assessment was done, quantifying it

into 3 classes (0=a few or none; 1= some; 2= many). The best results are attained for

Speaker 1 in channel 3 with a detection of 80.0% and an average of 67.1% for the 5

channels. Other 2 speakers attained the best detection result of 68.6% and 66.8% in

EMG channels 4 and 5, respectively. There is also a strong variation of results across

prompts, as depicted in Fig. 6.

Fig. 5. Example of movement detection. Detected movements are shown at the top, the proba-

bility of movement in the middle and the US annotation at bottom, where forward movement is

represented by the segments represented above the middle line.

Fig. 6. Detection accuracy results with 95% confidence interval for some of the prompts.

In terms of false positives, we noticed that, although speaker 1 presents the best re-

sults, it also has a high rate of FP with 38.8% of the utterances having many FP. In

that sense, speaker 2 presents the best relation between correct detections and FP with

47.1% of the utterances presenting none or few FP. In terms of sensors the best rela-

tion between correct detections and FP was found for channels 2 and 5.

5 Conclusions

This paper describes a novel approach to assess the capability of surface EMG in

detecting tongue movements for SSI. The approach uses synchronized US imaging

and surface EMG signals to analyze signals of the tongue. For this purpose, a specific

corpus, designed with this goal in mind, was collected. After synchronous acquisition

modalities and automatically annotating the US data, this new dataset provides the

necessary grounds to interpret the EMG data. Results show that tongue movement

can be detected using Surface EMG with some accuracy but with variation across

speakers and possibility of high FP rates, suggesting the need for an adaptive solution.

The work here presented can be used to explore combinations of several sensors and

complement existing SSI with articulatory information, about the tongue, using a non-

invasive solution. Future work includes detailed analysis of the characteristics of in-

dividual movements and to developing an EMG-based tongue movement classifier.

Acknowledgements. This work was partially funded by Marie Curie Actions IRIS

(ref. 610986, FP7-PEOPLE-2013-IAPP) and project Cloud Thinking (QREN Mais

Centro, ref. CENTRO-07-ST24-FEDER-002031)

References

1. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., Brumberg, J.S.: Silent

speech interfaces. Speech Commun. 52, 270–287 (2010).

2. Wand, M., Schultz, T.: Session-independent EMG-based Speech Recognition.

International Conference on Bio-Inspired Systems and Signal Processing

(BIOSIGNALS). pp. 295–300 (2011).

3. Seikel, J.A., King, D.W., Drumright, D.G.: Anatomy and physiology for speech,

language, and hearing. Delmar Learning (2009).

4. Silva, S., Martins, P., Oliveira, C., Silva, A., Teixeira, A.: Segmentation and Analysis

of the Oral and Nasal Cavities from MR Time Sequences, Image Analysis and

Recognition. Proc. ICIAR, LNCS. Springer (2012).

5. Scobbie, J.M., Wrench, A.A., van der Linden, M.: Head-Probe stabilisation in

ultrasound tongue imaging using a headset to permit natural head movement.

Proceedings of the 8th International Seminar on Speech Production. pp. 373–376

(2008).

6. Stone, M., Lundberg, A.: Three-dimensional tongue surface shapes of English

consonants and vowels. J. Acoust. Soc. Am. 99, 3728–3737 (1996).

7. Livescu, K., Zhu, B., Glass, J.: On the phonetic information in ultrasonic microphone

signals. IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP 2009).

pp. 4621–4624. IEEE (2009).

8. Hardcastle, W.J.: Physiology of speech production: an introduction for speech

scientists. Academic Press New York (1976).

9. Jorgensen, C., Dusan, S.: Speech interfaces based upon surface electromyography.

Speech Commun. 52, 354–366 (2010).

10. Florescu, V.M., Crevier-Buchman, L., Denby, B., Hueber, T., Colazo-Simon, A.,

Pillot-Loiseau, C., Roussel-Ragot, P., Gendrot, C., Quattrocchi, S.: Silent vs vocalized

articulation for a portable ultrasound-based silent speech interface. Proceedings of

Interspeech 2010. pp. 450–453 (2010).

11. Hofe, R., Ell, S.R., Fagan, M.J., Gilbert, J.M., Green, P.D., Moore, R.K., Rybchenko,

S.I.: Evaluation of a silent speech interface based on magnetic sensing. Proceedings of

Interspeech 2010. pp. 246–249 (2010).

12. Alfonso, P.J., Baer, T.: Dynamics of vowel articulation. Lang. Speech. 25, 151–173

(1982).

13. Shankweiler, D., Harris, K.S., Taylor, M.L.: Electromyographic studies of articulation

in aphasia. Arch. Phys. Med. Rehabil. 49, 1–8 (1968).

14. Teixeira, A., Martins, P., Oliveira, C., Ferreira, C., Silva, A., Shosted, R.: Real-time

MRI for Portuguese: Database, methods and applications. Lecture Notes in Computer

Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes

in Bioinformatics). pp. 306–317 (2012).

15. Kent, R.D.: Some considerations in the cinefluorographic analysis of tongue

movements during speech. Phonetica. 26, 16–32 (1972).

16. Articulate Assistant Advanced Ultrasound Module User Manual, Revision 212,,

http://www.articulateinstruments.com/aaa/.

17. Plux Wireless Biosignals, http://www.plux.info/.

18. Freitas, J., Teixeira, A., Vaz, F., Dias, M.S.: Automatic Speech Recognition Based on

Ultrasonic Doppler Sensing for European Portuguese. Advances in Speech and

Language Technologies for Iberian Languages. pp. 227–236. Springer Berlin

Heidelberg (2012).

19. Freitas, J., Teixeira, A., Dias, M.S.: Multimodal Corpora for Silent Speech Interaction.

9th Language Resources and Evaluation Conference. pp. 1–5 (2014).

20. Silva, S., Teixeira, A.: Automatic Annotation of an Ultrasound Corpus for Studying

Tongue Movement. Proc. ICIAR, LNCS 8814. pp. 469–476. Springer, Vilamoura,

Portugal (2014).