Infotainment devices control by eye gaze and gesture recognition fusion

Infotainment Devices Control by Eye Gaze and Gesture Recognition Fusion

Tabassam Nawaz, Muhammad Saleem Mian, Hafiz Adnan Habib, Member, IEEE

Abstract — This paper presents a novel concept for

controlling the consumer devices such as MP3 player or other daily life appliances by fusion of eye gaze and gesture recognition methodologies. Such system is deployable for virtually controlling the consumer devices anywhere in the home and office. The usability of the system is also tested with patients at the intensive care units of the hospitals where the patients may not able to operate the consumer devices in a regular way. The proposed system consists of a video processing based embedded system, a CCD camera and situated display. Physical control options are displayed over the situated display. Selection / de-selection process of displayed control option over situated display is accomplished by analyzing the video sequence captured by the CCD camera. Eye gaze estimation and head gesture recognition algorithms analyze the video sequence and reveal the control command that has to be sent to the infotainment or other consumer devices.

Index Terms — virtual consumer appliance control, gesture

recognition, gaze direction estimation, recognition fusion, real time vision system.

I. INTRODUCTION Control of consumer devices is one of the common

activities in daily life of any human being. Many alternative possibilities exist for controlling / communicating the commands to the devices. Direct interaction method such as physical contact based controlling of a device is the most common. Similarly remote interaction methodologies such as infra red, Bluetooth, RF based signaling mediums are also among common possibilities.

In comparison to direction and indirect control of devices, there is another possible categorization or classification of controls of consumer devices from human machine interaction paradigm: perceptual and non-perceptual. Non-perceptual interfaces require physical contact. Touch screen based systems are the advance type of systems in this category.

Tabassam Nawaz and Hafiz Adnan Habib are with Faculty of

Telecommunication & Information Engineering, University of Engineering & Technology Taxila, 47050, Pakistan.

Muhammad Saleem Mian is with faculty of Electrical Engineering Department, Univeristy of Engineering & Technology Lahore Pakistan.

Tabassam Nawaz and Hafiz Adnan Habib are with Video & Image Processing Laboratory, Faculty of Telecommunication & Information Engineering, University of Engineering & Technology Taxila, 47050, Pakistan. (e-mail: [email protected], [email protected], [email protected]) .

Similarly perceptual interfaces does not require physical contact and remote operation of devices is possible, while there is no physical medium attached for sensing the commands as the case with IR remotes. The command generation or sensing process is also explicit and non-contact based. Gesture recognition is the most suggestive approach for the design of perceptual interface for human machine interaction.

II. REAL TIME GESTURE AND GAZE DIRECTION BASED VIRTUAL INFOTAINMENT CONTROL SYSTEM

A gesture may be defined as the physical movement of hands, arm, face and body with the intent to convey information or command. Gesture recognition consists the tracking of human movement and interpretation of that movement as semantically meaningful commands, Gesture recognition has the potential to be a natural and powerful tool for intuitive interaction between the human and computer [4]. Gesture recognition has been successfully applied in virtual reality, human computer interaction, game control, robot interaction, remote controlling of home and office appliances, sign language, activity recognition, human behavior, and training systems etc. Gesture recognition system is designed in four stages: gesture acquisition, feature extraction, classification, and learning. Gesture acquisition is accomplished by position sensors, motion / rate sensors, and digital imaging. Feature extraction and classification are real time stages to analyze the acquired gesture while learning stage is off-line activity to learn the relationship between gesture and information or command [12].

Eye gaze is a natural way of interaction with computers and machines. Eye gaze estimation has been a subject of study for diverse applications such as oculography determination in medical, blind spot detection in vehicular safety systems, human computer interaction systems for handicapped people etc. A human being’s gaze direction determined by two components: pose of human head and orientation of eyes within their sockets [5].

Infotainment device controls are displayed over a control panel which is distributed among different categories: source (i.e. type of device such as CD/DVD player, AM/FM Radio, cassette player/recorder and iPod etc.), play options (e.g. stop, play/pause, forward, reverse, next track, previous track, repeat etc), sound options (e.g. volume control, equalization and modes such as rock, classic, jazz etc. selection) and other advance functions. All these categories of controls are situated over control panel in the form of switches and rotary knobs. These switches and knobs are operated by fingers or hands of users.

T. Nawaz et al.: Infotainment Devices Control by Eye Gaze and Gesture Recognition Fusion 277

Contributed PaperManuscript received April 15, 2008 0098 3063/08/$20.00 © 2008 IEEE

This paper presents a novel concept of infotainment device control based upon fusion of visual gaze and head gesture recognition of the user as shown in Figure 1.

Switches & Rotary Controls

Gaze Focusing &Head Shake / Nod

Hand / FingerMovement

Visual Gaze and Head Gesture

Command to Infotainment

Device


Device

Figure 1 Infotainment Device Control Comparison

In an infotainment control environment, the devices operate

in binary fashion. One option from each control category, if required, is activated at a time. Rest of the options from the respective control category remains in in-active state. This represents the binary nature of the control categories.

Traditional, infotainment devices are controlled by pressing switches or moving/rotating the knobs. While the proposed system estimates the gaze direction for menu selection exhibited in the situated display and head gesture recognition for activation of the selected menu option. Both concepts are demonstrated by Figure 2.

Hand / FingerMovement

Visual Gaze and Head Gesture


Device


Device

Visual Gaze Estimation and Head Gesture Recognition

Fusion

Transducer Action of Switch or

Rotary Controls

Figure 2 Functional Comparison of Infotainment Device Control

Visual gaze and head gesture of the user are captured in the

video sequence through CCD based video sensor. Traditional control panel is re-organized in the form of discrete labeled images which are displayed before user in a tree like structure over the situated display. Therefore system contains a video camera, embedded processing board, communication module and situated display along with infotainment device which has to be controlled through visual gaze estimation and head gesture recognition. The architecture of the system is shown in Figure 3.

Suppose the output of infotainment control system is defined as { }Li mmmmM ,...,,...,, 21= where im is a particular command to be sent to infotainment device. Each im , sent to infotainment device, is subsequently selected

by traversing in a tree like structure. Similarly, commands are also bunched in various groups and exit command of each group goes to different sub-system in the infotainment device.

CCD Video Camera

Embedded Processing

Board

Situated Display

Control Command Infotainment

Device

User Face and Head Video Captured for


Discrete Labeled Menu Images for

Control of device are displayed


Performed.

Figure 3 Architecture of Proposed System

A functionality scheme has been devised for operation of

infotainment device through the proposed system based upon visual gaze and head gesture recognition fusion. Discrete labeled menu images are displayed over the situated display.

Focus eye gaze in command area > Pre-specified Time

Head Shake / Head Nod

Video of Face and Head of User

Traverse back/forward Or

Command Activate/De-activate

No Head Shake/Nod detected in pre-specified

interval

Non-focused Eye Gaze for pre-specified

duration

No

No

Yes

Yes

Figure 4 Command Selection/De-Selection Scheme

The commands are displayed and user have visual look of the available commands over the present discrete labeled image.

278 IEEE Transactions on Consumer Electronics, Vol. 54, No. 2, MAY 2008

Particular command is initially selected by focusing the eye gaze for pre-specified time interval. Once, the eye gaze is maintained at specific subject of discrete labeled image for pre-specified time, then that particular command is activated for selection / de-selection. Selection / de-selection of a command are performed by head gesture. A head shake is devised for selection of particular command and head nod is devised for de-selection of particular command. The same selection process by focusing eye gaze and head gesture shake / nod is used for traversing in the tree menu structure to reach the end node of tree where commands are sent to the device. Same concept is illustrated in Figure 4.

III. ALGORITHMIC DESIGN Algorithmic detail of eye gaze and gesture recognition

fusion based infotainment control is shown in Figure 5. The proposed infotainment control captures the digital video sequence dV by mono vision CCD based sensor. Frames

( )yxF , are extracted where ( )yx, are the spatial coordinates in the image plane.

A. Face Detection Face detection is the first stage and is responsible for

detection and localization of face in the extracted frames ( )yxF , . Objective of face detection is to localize all areas in image which contains a face regardless of its 3D position, orientation and lighting conditions [1]. Faces appear with massive variability in shape and color because of difference among individual, non-rigidity, facial hair and styles. Face detection has tremendous within class variability but face detection is yet a two class recognition problem: face vs. non-face class.

This paper adapted the approach mentioned in [7]. The implementation is based upon Adaboost which selects small number of critical visual features. These features act as classifier and are combined in cascaded form that yields extremely efficient classifier [2]. Rectangular features are computed which provide a coarse representation of edges, bars and simple image structures. These rectangular Haar like features are computed by integral image approach which is highly computational efficient. An 11 scale pyramidal structure is followed and scanning of image is started at 24*24 pixels base scale and next scale is 1.25 times of the previous scale.

Adaboost combines weak classifier and boost classification of simple learning algorithms. A weak classifier consists of a feature computed over image sub-window of 24*24 pixels, a threshold and a parity indicating the direction of inequality sign [7].

Eye Gaze Mapping to Menu Clusters

in Discrete Labeled Image

Video of Face and

Head of User

Traverse forward in Tree Menu

Or Activate

Command

No Head Shake/Nod detected in

pre-specified interval

Non-focused Eye Gaze for pre-

specified duration

No

No

Yes

Video Capture

Frame Extraction

Face Detection

Eye Detection

Eye Gaze Estimation

Hi-light Selected Menu Cluster

Eye Gaze Focus > Specified Time

Head Shake / Nod

Off-

line

lear

ned

disc

rete

labe

led

men

u im

ages

Yes

No

Head Shake

Head Nod

Traverse back in Tree Menu

Or De-activate Command

Figure 5 Algorithmic Design


B. Eye Detection Eye detection is the pre-stage for eye gaze estimation and

the aim is to detect and register eyes in presently captured frame ( )yxF , . Many approaches have been proposed in the literature for eye detection such as template matching, eigenspace, hybrid projection function [6], terrain features [13] and texture features [3] etc. At this stage of processing, it is assumed that the face is already detected and registered in the current frame. Here, the texture features are utilized for detection of eyes in the frame. Two texture features are utilized [3]: Gabor filter [14] for long horizontal line and angular radial transform [15] for emphasizing circular shapes and darkest regions around their surroundings. Response of Gabor filter and angular radial transform is evaluated for detection and localization of eyes in the captured frame ( )yxF , . Two local maxima are searched that are representative for each eye and search is carried with the assumption that both maxima should lie with in certain specified distance as the eyes are not physically located that far.

C. Eye Gaze Estimation The objective of gaze estimation algorithm is to produce a

mapping from image coordinates to the output space or real world [16]. Mathematically gaze can be defined by

( )XGv Φ=

Where Φ is the gaze function, X is feature vector computed from image domain and vG represents gaze vector which may be 2D or 3D depending upon application requirements and graphically as shown by Figure 6.

Figure 6 Gaze Vector

A number of approaches as well as specialized hardware

had been proposed for gaze direction estimation. D2 gaze direction can be estimated by electro-oculographic [17], pupil tracking [18], artificial neural networks [19], pupil and corneal reflection [20], corneal reflection [21], Purkinje image tracking and sclera coil tracking technique. Infra red illumination source is included in certain systems too. Here the proposed system is solely based upon moon vision sensor without additional illumination.

Gaze direction had also been considered from application point of view that whether gaze is estimated for directional

information only or estimated gaze is further investigated from its interaction with real world object. Such applications exist in human computer interaction domain where gaze direction is utilized to interact with display etc.

D. Eye Gaze Interaction Mapping

Let ( )yxG , represents the gaze direction of the user where

( )yx, are the coordinates in the sensor domain. The problem of estimating the gaze direction is solved in three subsequent stages: estimation of the pupil centre, searching for pupil contour, determination of pupil centre and radius and searching of the iris contour [10]. Let jD represents the

discrete labeled menu images where mj ,...,2,1= .Each j subscript represents the different labeled constituent of the

menu image. More specifically, it can be stated that each j is a labeled constituent in the discrete labeled menu image such as 1D may represent Source, 2D for Volume, 3D for Play etc.

Furthermore each jD is representing a subsequent discrete

labeled menu image which is traversed in a tree structure manner and ends at particular command to some sub-system in the infotainment device. The gaze direction ( )yxG , is estimated by the methodology proposed in [10].

E. Head Gesture Recognition

Let ( )yxF ,' represent the pre-processed image where

face/head area has been marked. Let { }NS HHH = represents the pre-stored templates of reference head gestures for head shake and head node respectively. H ’s are learned and stored in off-line training phase. Head shake gesture SH

is considered activation gesture while head node gesture NH is represented as de-activation gesture. Let { }ntttnt fffF ,...,, 21= represents the instantaneous head gesture which is developed by sampling the head pose parameters over pre-defined interval. These head pose parameters are namely three rotation and three translation parameters in the ( )zyx ,, space. Rotation parameters

( )δβα ,, are calculated around ( )zyx ,, coordinates [11]. While t defines the length of the sampling interval.

IV. PRACTICAL SETUP Gaze direction and gestured recognition based infotainment device control algorithm is prototyped for testing with standard infotainment device. Video of the face of the user is captured. Gaze direction estimation as well as head gesture recognition algorithms are computed in real time over the embedded video processing system. The embedded system also has built in display which displays discrete labeled menu images which are traversed in a tree fashion. Whenever a command has to be sent to original infotainment device, then it communicates through an infra red mouse. The infra red


mouse must be aligned with infra red receiver at infotainment device.

Figure 7 is showing PixelLink board camera with wide angle lens. Embedded system is based upon ARM processor with integrated display over the board.

Figure 7 Hardware System

V. RESULTS Proposed algorithm had been tested for the operation with

the infotainment device. Discrete labeled menu images are displayed over the display. Point of regard of gaze is computed with respect to contents over discrete labeled menu images.

Motion History

-10

-5

0

5

10

15

Time

Am

ount

of m

otio

n

HorizontalHead GestureMotionVertical HeadGesture Motion

Gaze DirectionHorizontalMotion

Figure 8 Head Shake Gross Motion in Horizontal as well as in Vertical Direction

Motion History

-10

-5

0

5

10

15

Time

Am

ount

of m

otio

n

HorizontalHead GestureMotionVertical HeadGesture Motion

Gaze DirectionHorizontalMotion

Figure 9 Head Nod Motion

It is evident from Figure 8 and Figure 9 that the

instantaneous motion of head is highest in the vertical direction in case of head shake and highest in horizontal direction in case of head node. %age success for sampled infotainment device control command is shown in Table 1.

# Command %age

Detection

1 1D 98

2 2D 97.6

3 3D 98.2

Table 1 System Results

VI. CONCLUSION The proposed system is capable of controlling the

infotainment device by visual gaze and head gesture. Results showed a reliable and practical system. This paper has contribution to research community by devising the solution for interaction with consumer devices such as infotainment device by visual gaze and gesture recognition. Such system has to be tested for pointing over multimedia projector presentations where some pointing device is used. Similarly, it is also planned to test the system performance for infotainment device control in other practical life environments such infotainment devices in automotive. The underlying technology may not be differing much but interesting applications can be developed. There is one short coming of the system which will be explored further by incorporating human computer interaction theoretical concepts that response


time of system is slower than a standard system in daily life. The underneath reason is not the lesser computational requirements, but an overall system delay due to focusing eye gaze and head shake/nod operations and traversing menu structure by repeated focusing of eye gaze and head shake/nod. However this short coming is tolerable since accurate results are experienced.

REFERENCES [1] M. Yang, D. J. Kriegman, N. Ahuja, “Detection of faces in images: a

survey”, IEEE transactions on Pattern Analysis and Machine Intelligence, Vol. 24, No. 1, January 2002.

[2] Y. Freund, R. E. Schapire, “A decision theoretic generalization of on-line learning and an application to boosting”, Computational learning theory: Eurocolt 95, pages 23-37, Springer Verlag 1995.

[3] C. W. Park, J. M. Kwak, H. Park, Y. S. Moon, “An effective method for eye detection based upon texture information”, International conference on convergence information technology, 2007.

[4] M. M. Cerney, J. M. Vance, “gesture recognition in virtual environments: a review and framework for future development,” Iowa state university human computer interaction technical report ISU-HCI-2005-01, Mar. 28, 2005.

[5] Jian Gang Wang, Eric Sung, “Study of Eye-gaze Estimation”, IEEE Transactions on Systems, Mans and Cybernetics, Part B Cybernetics Vol. 32, No. 3, June 2002.

[6] Z. H. Zhou and X. Geng, “Projection Functions for Eye Detection”, Pattern Recognition, Vol. 37, No. 5, pp. 1049-1056, 2004

[7] P. Viola, M. Jones, “Robust real time object detection”, 2nd International workshop on statistical and computational theories on vision-modeling, learning, computing and sampling, Vancouver, Canada, July 13, 2001.

[8] W. Junwen, M. T. Mohan, “Visual Modules for Head Gesture Analysis in intelligent vehicle systems”, Intelligent Vehicle Symposium, Tokyo, Japan, June 13-15, 2006.

[9] S. Kawato, N. Tetsutani, “Detection and Tracking of eyes for gaze-camera control”, IEEE 2001.

[10] A. Perez, M. L. Cordoba, “A Precise Eye-Gaze Detection and Tracking System”, Feb 3-7, Plzen, Czech Republic, WSCG 2003.

[11] U. M. Erden, S. Sclaroff, “Automatic detection of relevant head gestures in American Sign Language Recognition”, IEEE 2002.

[12] Hafiz Adnan Habib, Muid Mufti, “Real time mono vision gesture based virtual keyboard system” IEEE Transactions on consumer electronics issue November 2006.

[13] J. Wang and L. Yin, “Eye Detection Under Unconstrained Background by the Terrain Feature”, In IEEE Conference on Multimedia and Expo, pp. 1528-1531, 2005.

[14] T. S. Lee, “Image Representation Using 2D Gabor Wavelets”, IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 18, No. 10, pp. 959-971, Oct. 1996.

[15] B. S. Manjunath, P. Salembier, and T. Sikora, Introduction to MPEG-7: Multimedia Content Description Interface, Willy, 2002.

[16] A. Villanueva, D. W. Hansen, J. S. Agustin, R. Cabeza, “Basics of Gaze Estimation”, 2nd conference on communication by gaze interaction, 4th-5th September Turin Italy.

[17] J. Gips, P. Olivieri, and J. Tecce, “Direct control of the computer through electrodes placed around the eyes,” in Proc. Fifth Int. Conf. Human-Computer Interaction. Orlando, FL: Elsevier, 1993, pp. 630–635.

[18] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” Sch. Comput. Sci., Carnegie Mellon Univ., Pittsburgh, PA, USA, Tech. Rep. CMU-CS-94-102, 1994.

[19] J. K. P. White, T. E. Hutchinson, and J. M. Carley, “Spatially dynamic calibration of an eye-tracking system,” IEEE Trans, Syst., Man, Cybern., vol. 23, pp. 1162–1168, July/Aug. 1993.

[20] T. Cornweet and H. Crane, “Accurate two-dimensional eye tracker using first and fourth purkinje images,” J. Opt. Soc. Amer., vol. 63, no. 8, pp. 921–928, 1973.

[21] L. Bour, “DMI-search scleral coil,” Dept. Neurology, Clinical Neurophysiol., Academic Medical Cent., AZUA, Amsterdam, The Netherlands, Tech. Rep. H2-214, 1997.

Tabassam Nawaz is pursuing Ph D degree at University of Engineering & Technology Taxila, Pakistan. He got Masters Degree in Computer Sciences and in Computer Engineering in the years 2001 and 2005 respectively. His interest lies in the development of computer vision algorithms for vision based human machine interaction. He developed algorithms for gaze direction estimation and face

recognition which were prototyped in human computer interaction system for the patients. He holds the position of Assistant Professor at the faculty of telecommunication & information engineering. M. Saleem Mian has won Ph D degree from University of Manchester UK in 1998. He got M.Sc. and B.Sc. in Electrical Engineering in the years 1979 and 1972 respectively.. He has proven development track in microwave, consumer electronics and digital signal processing systems. He has developed secure telephony system with proprietary encryption and decryption algorithm. He is currently professor and head of department at UET Labore. He is supervising five students at PhD level. He has more than forty publications of international repute.

Hafiz A. Habib has Ph D degree from University of Engineering & Technology Taxila, Pakistan in 2007. He got M.Sc. and B.Sc. in Electrical Engineering in the years 2004 and 2000 respectively. His research interests lies in the design and development of real time imaging and video systems, gesture recognition system for human computer interaction. He currently holds the position of Assistant Professor in the faculty of telecommunication & information engineering. He is investigating the scope of gesture

recognition for the medical industry.


Infotainment devices control by eye gaze and gesture recognition fusion

Documents

Transcript of Infotainment devices control by eye gaze and gesture recognition fusion