Visual routines for eye location using learning and evolution

IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 4, NO. 1, APRIL 2000 73

Visual Routines for Eye Location Using Learning and EvolutionJeffrey Huang and Harry Wechsler

Abstract—Eye location is used as a test bed for developing nav-igation routines implemented as visual routines within the frame-work of adaptive behavior-based AI. While there are many eye lo-cation methods, our technique is the first to approach such a loca-tion task using navigational routines, and to automate their deriva-tion using learning and evolution, rather than handcrafting them.The adaptive eye location approach seeks first where salient ob-jects are, and then what their identity is. Specifically, eye locationinvolves: 1) the derivation of the saliency attention map, and 2)the possible classification of salient locations as eye regions. Thesaliency (“where”) map is derived using a consensus between nav-igation routines encoded as finite-state automata exploring the fa-cial landscape and evolved using genetic algorithms (GA’s). Theclassification (“what”) stage is concerned with the optimal selec-tion of features, and the derivation of decision trees, using GA’s, topossibly classify salient locations as eyes. The experimental results,using facial image data, show the feasibility of our method, and sug-gest a novel approach for the adaptive development of task-drivenactive perception and navigational mechanisms.

Index Terms—Active perception, Baldwin effect, behavior-basedAI, decision trees, evolutionary computation, genetic algorithms,navigation, saliency map, visual routines.

I. INTRODUCTION

O DESIGN an automated face recognition (AFR) system, oneneeds to address several related problems: 1) detection of a pat-tern as a face (in the crowd), 2) detection of facial landmarks,3) identification of the faces, and 4) analysis of facial expres-sion [1]. Face recognition starts with the detection of face pat-terns in sometimes-cluttered scenes. It then proceeds by nor-malizing the face images to account for geometrical and illu-mination changes, using information about the location and ap-pearance of facial landmarks. The next step is to identify thefaces using appropriate classification algorithms. The last steppostprocesses and disambiguates the results using model-basedschemes and logistic feedback. The ability to detect salient fa-cial landmarks is an important component of any face recog-nition system, in particular, for normalization purposes and forthe extraction of image features to be used subsequently by theface classifiers. As both the position of the eyes and the interoc-ular distance are relatively constant for most people, locating theeyes serves, first of all, as an important role in face normaliza-tion, and thus facilitates further localization of facial landmarks.It is eye location that allows one to focus attention on salient fa-cial configurations, to filter out structural noise, and to achieveeventual face recognition.

Manuscript received September 2, 1998; revised February 1, 1999 and March16, 1999.

J. Huang is with the Department of Computer and Information Science, In-diana University–Purdue University, Indianapolis, IN 46202-5132 USA (e-mail:[email protected]).

H. Wechsler is with the Department of Computer Science, George MasonUniversity, Fairfax, VA 22030-4444 USA (e-mail: [email protected]).

Publisher Item Identifier S 1089-778X(00)00624-X.

The exploration of the facial landscape involves measuringthe likelihood of each region to cover one of the two eyes.Salient regions, for which the likelihood is large, are then testedfor actual eye location using specific classification rules. Theapproach is conceptually similar to active perception (AP) [2],and it applies as well to any exploratory task carried out oversome unknown terrain landscape. AP has advanced the widelyheld belief that intelligent data collection, rather than imagerecovery and reconstruction, is the goal of visual perception ingeneral and vision routines in particular. AP involves a largedegree of adaptation, and it provides a mobile and intelligentobserver, either human or animat, with the capability to decidewhere to seek information,what information to pick up, andhow to process it.

This paper uses eye location as a test bed for developingnavigation routines implemented as visual routines (VR).Navigation routines are meant to provide the animat with thecapability to move in a trajectory that successfully crossesareas of interest, and to avoid obstacles likely to be encounteredwhile moving toward some specific target. Eye location isfundamental to face recognition as it provides for imagecalibration based on the almost constant interocular distance.While there are many eye location methods, our technique is thefirst to approach such a location task using navigation routines,and to automate the derivation of such routines using learningand evolution rather than handcrafting them. The adaptive eyelocation approach seeks firstwheresalient things are, and thenwhattheir identity is. Specifically, eye location involves: 1) thederivation of the saliency attention map, and 2) the possibleclassification of salient locations as eye regions. The saliency(“where”) map is derived using a consensus between navigationroutines encoded as finite-state automata (FSA) exploring thefacial landscape and evolved using genetic algorithms (GA’s).The classification (“what”) stage is concerned with the optimalselection of features and the derivation of decision trees (DT)for confirmation of eye classification using GA’s. As the designof eye location routines would address both attention (“where”)mechanisms and recognition (“what”) schemes, one thus hasto address the twin problems of optimal feature selection andclassifier design. As searching nonlinear space is both com-plex and expensive, this paper introduces a novel and hybridadaptive methodology for developing eye location routines,drawing on both learning and evolutionary components. Theexperimental results obtained using the standard FERET facialimage data [3] support our conceptual approach.

II. EYE LOCATION

There are two major approaches for automated eye location.The first approach, theholistic one, conceptually related totemplate matching, attempts to locate the eyes using global

1089–778X/00$10.00 © 2000 IEEE

74 IEEE TRANSACTIONS ON EVOLUTIONARY COMPUTATION, VOL. 4, NO. 1, APRIL 2000

representations. Characteristics of this approach are connec-tionist methods such as principal component analysis (PCA)using eigenrepresentations [4]. Although location by matchingraw images has been successful under limited circumstances, itsuffers from the usual shortcomings of straightforward corre-lation-based approaches, such as sensitivity to eye orientation,size, variable lighting conditions, and noise. The reason for thisvulnerability of direct matching methods lies in their attempt tocarry out the required classification task in a space of extremelyhigh dimensionality. To overcome the curse of dimensionality,contextual information should be employed as well. As anexample, Samaria [5] employs stochastic modeling, usinghidden Markov models (HMM’s). When the frontal facialimages are sampled using top–bottom scanning, the naturalorder in which the facial landmarks would appear is encodedusing an HMM. The HMM leads to the efficient detection ofeye strips, but the task of locating the eyes still requires furtherprocessing. As the eye location methods lack size invariance,they either assume a prespecified face size or they have tooperate at several multiscale grid resolutions. The use of visualroutines in an iterative fashion would address this problem, andimplicitly could achieve size invariance.

The second approach for eye detection, theabstractiveone,extracts (and measures) discrete local features, while standardpattern recognition techniques are then employed for locatingthe eyes using these measurements. Yuille [6] describes a com-plex but generic strategy, characteristic of the abstractive ap-proach, for locating facial landmarks such as the eyes, using theconcept of deformable templates. The template is a parameter-ized geometric model of the face or part of it (mouth and/or eyes)to be recognized, together with a measure of how well it fitsthe image data, where variations in the parameters correspondto allowable deformations. Lam and Yan [7] extended Yuille’smethod for extracting eye features by using corner locationsinside the eye windows, which are obtained by means of av-erage anthropometric measures after the head boundary is firstlocated. Deformable templates and the closely related elasticsnake models are interesting concepts, but using them can bedifficult in terms of learning and implementation, not to men-tion their computational complexity. Eye features could also bedetected using their symmetrical characteristics. As an example,Yeshurun and colleagues [8] developed a generalized symmetryoperator for eye detection. Most recently, an attempt has beenmade to speed up the optimization process involved in sym-metry detection using evolutionary computation. Gofmanet al.[9] have introduced a global optimization approach for detectingthe local reflectional symmetry in gray-level images using 2-DGabor decomposition via soft windowing in frequency domain.Our approach draws from both the holistic and abstractive eyelocation approaches, and it develops corresponding visual rou-tines for eye location using both learning and evolution.

III. B EHAVIOR-BASED AI

Behavior-based AI [10] has advanced the idea that, for suc-cessful operation (and survival), an intelligent and autonomousagent should: 1) consist of multiple competencies (“routines”),2) be “open” or “situated” in its environment, and 3) monitorthe domain of application and determine, in a competitivefashion, what should be done next while dealing with many

conflicting goals simultaneously. Maes [10] has then proposedthat autonomous agents (animats) should be designed as setsof modules, each of them having its own specific but limitedcompetence. A similar behavior-based-like approach for patternclassification and navigation tasks has been suggested by theconcept of visual routines [11], recently referred to as a visualroutine processor (VRP) [12]. The VRP assumes the existenceof a set of visual routines that can be applied to base imagerepresentations (maps), subject to specific functionalities, anddriven by the task at hand. A difficult and still open questionregarding the concept of visual routines is the extent to whichtheir design could be automated. Brooks [13], for example,has shown that much of his initial success in building mobilerobots has been due to carefully, but manually chosen anddesigned behaviors. The robots were cleverly designed astightly coupled sense-act loops with hierarchical arbitrationamong the modules.

Moghaddam and Pentland [14] have recently suggested thatreactive behavior be implemented in terms of “perceptual in-telligence” so the sensory input is coupled directly to a (prob-abilistic) decision-making unit for the purpose of control andaction. An autonomous agent is then essentially a finite-state au-tomaton whose feature inputs are chosen by sensors connectedto the environment and/or are derived from it, and whose ac-tions operate on the same environment. The automaton decidesits actions based on inputs and/or features it “forages,” whilethe behavior of the controller is learned. The visual routine de-scribed in this paper attempts to implement the same “perceptualintelligence” ability with respect to eye location. The importantquestion then for the VRP mentioned earlier is how to automat-ically craft such visual routines, and how to integrate their out-puts. Early attempts developed using manual design, involvedsimulations lacking low-level (base) representations, and oper-ating on bitmaps only. Ramachandran [15], an advocate of theutilitarian theory of perception, has suggested as an alternativethat one could craft such visual routines by evolving a “bag ofperceptual tricks” whose survival is dependent on functionalityand fitness. This approach, which can be directly traced to theearlier “neural Darwinism” theory of neuronal group selectionas a basis for higher brain [16], suggests natural selection as themajor force behind the automatic design of visual routines andtheir integration. In order to scale up to more complex behaviorand greater robustness, one has to look for machine learning andevolutionary algorithms to develop (“learn”)and evolve suchbehavioral routines to specific task functionalities such as thosesupporting perceptual intelligence as described earlier.

IV. A CTIVE PERCEPTION

The flow of visual input consists of huge amounts of time-varying information. It is crucial for both biological vision andautomated (“animat”) systems to perceive and comprehend sucha constantly changing environment within a relatively short pro-cessing time. To cope with such a computational challenge, oneshould locate and analyze only the information relevant to thecurrent task by focusing quickly on selected areas of the sceneas needed. Attention mechanisms are known to balance betweencomputationally expensive parallel techniques and time-inten-


sive serial techniques to simplify computation and reduce theamount of processing. In addition to complexity reasons, effi-cient attention schemes also are needed to form the basis forbehavioral coordination [17].

Active perception leads directly to issues of attention. Sen-sory, perceptual, and cognitive systems are space–time limited,while the potential information available to each of them ispotentially infinite. Much of attentional selectivity, explainedas filter theory in terms of system limitation with respect toboth storage and processing capabilities, is mostly concernedwith the early selection of spotlights and late but selective pro-cessing of control and recognition. The selective processing ofregions with restricted location (or motion) is thus necessaryfor achieving almost real-time and enhanced performance usinglimited resources.

Various computational models of active perception in gen-eral, and visual attention in particular, have been proposed tofilter out some of the input, and thus not only reduce the com-putational complexity of the underlying processes, but possiblyto provide a basis to form invariant and canonical object rep-resentations as well. Several biologically motivated models ofattentional mechanisms and visual integration have appeared inthe literature. For most of these models, every point in the visualfield competes for control over attention based on its local con-spicuity and the history of the system. A high-level feature inte-gration mechanism is then implemented to select the next centerof gaze or fixation point. Koch and Ullman [18] have proposedusing a number of elementary maps encoding conspicuous ori-entation, color, and direction of movements. The maps recordparticular image representations that offer multiple ways to dis-criminate information. If each feature map represents a rangeof possible feature values at each location, then the selectionprocess consists of assigning different interest measures to theelements of the map. Two aspects of the selection process shouldbe considered: 1) thedistinctivenessof a feature, i.e., how dif-ferent it is from other values of the same feature map, and 2) thespatial relationsof a location with respect to the other locationsof the map, in terms of neighborhood. The conspicuity mapsare eventually merged into a single representation, called thesaliency map. The most active location of this map is computedby awinner-take-all(WTA) mechanism, and is selected as thenext focus of attention. One possible cortex region where con-spicuity maps are assembled and merged would be the V4 cortexarea, while the saliency map is derived later at the IT level. Thispaper describes how to derive the saliency map using a hybridand adaptive approach where FSA search for salient facial areasin terms of their likelihood to contain the eyes.

Eye location can be viewed as a face recognition (routine)competency, whose inner workings include screening out thefacial landscape to detect (“pop out”) salient locations for pos-sible eye locations, and eventually labeling as eye only thosesalient areas whose feature distribution “fits” what the eye clas-sifier has been trained to recognize as such. In other words, thefirst stage of the eye location routine would implement atten-tion mechanisms whose goal is to create a saliency “eye” map,while the second stage processes the facial landscape filtered bythe just-created saliency map for actual eye recognition.

V. PATTERN RECOGNITION AND EVOLUTION

Robust pattern analysis and classification require the integra-tion of various learning processes in a modular fashion. Adap-tive systems that employ several strategies can potentially offersignificant advantages over single-strategy systems. Since thetype of input and acquired knowledge are more flexible, suchhybrid systems can be applied to a wider range of problems.As an example, Hinton and Nowlan [19] advocate the hybriduse of evolution and learning, with learning possibly easing thecomputational burden on evolution. Evolution (genotype adap-tation) only has to get close to the goal; (phenotype) learning canthen finely tune the behavior [20]. Although Darwinian theorydoes not allow for the inheritance of acquired characteristics, asLamarck suggested, learning (as acquired behaviors) can stillinfluence the course of evolution. The Baldwin effect [21], [22]suggests that learning can change the fitness of individuals, andthat it eventually alters the course of evolution in terms of bothefficiency and overall performance [23]. The (computational)Baldwin effect assumes that local search is employed to esti-mate the fitness of (chromosome) strings, but that the acquiredimprovements do not change the genetic encoding of the indi-vidual.

This paper develops a hybrid and adaptive architecture for de-veloping visual routines for eye location using learning and evo-lution. Fogelet al.[24] first introduced a methodology to evolveFSA using evolutionary programming. Johnsonet al. [25] havesuggested using genetic programming (GP) operating on vari-able length programs as the approach for evolving visual rou-tines whose perceptual task was to find the left and right handof a person from a black and white silhouette (“bitmaps”) ofa person. Our approach for eye location is based on GA’s andoperates on real imagery data rather than bitmaps as was thecase with Johnsonet al. [25]. The reason for using GA’s interms of fixed-length strings (“chromosomes”) rather than GPcomes from our belief that a compact way to evolve a successfulproblem-solving strategy rather than the elaborate code for im-plementing the same strategy is computationally more efficientfor now. It should be easier to craft behaviors, such as visualroutines, when natural selection is confined to finding only op-timal problem-solving strategies rather than writing the geneticinstructions for actuating them, i.e., for expressing the pheno-type.

VI. L EARNING AND DECISION TREES

Pattern recognition, a difficult but fundamental task for intel-ligent systems, depends heavily on the particular choice of thefeatures used by the classifier. One usually starts with a givenset of features and then attempts to derive an optimal subset offeatures leading to high classification performance. A standardapproach involves ranking the features of a candidate featureset according to some criteria involving second-order statistics(ANOVA) and/or information-theory-based measures such as“infomax,” and then deleting lower-ranked features. Rankingby itself is usually insufficient because the criteria used do notcapture possible non linear interactions among the features, nor


does it measure the effectiveness of the features selected on theactual recognition task itself.

Performance (“fitness”) functions for feature subsets, basedon information content, orthogonality, etc., however, canstill provide baseline feedback for stochastic search methodsresulting in enhanced performance. As an example, GA’scan search the space of all possible subsets of a large setof candidate discrimination features, capture important nonlinear interactions, and thus improve on the quality of thefeature subsets produced by using ranking methods only. Theeffectiveness of the features selected on the actual patternrecognition task can be then assessed by learning appropriateclassifiers and measuring their observed performance.

The basic aim of any concept-learning symbolic systemsupporting pattern recognition and classification is to constructrules for classifying objects given atraining set of objectswhose class labels are known. The objects belong to only oneclass and are described by a fixed collection of attributes, eachattribute with its own set of discrete values. The classificationrules can be derived using C4.5, the most commonly usedalgorithm for the induction of decision trees (DT) [26]. TheC4.5 algorithm uses the entropy as an information-theoreticaldiscriminating measure for building the decision tree. Theentropy is a measure of uncertainty (“ambiguity”) and charac-terizes the intrinsic ability of a set of features to discriminatebetween classes of different objects. The entropyfor afeature set is given by

where is the number of classes and is the number of dis-tinct values that feature can take on, while is the numberof positive examples in classfor which feature takes on itsth value. Similarly is the number of negative examples in

class for which feature takes on its th value. C4.5 deter-mines in an iterative fashion the feature, that is the most discrim-inatory, and then it splits the data into two sets of classes as di-chotomized by this feature. The next significant feature of eachof the subsets is then used to further split them and the processis repeated recursively until each of the subsets contain only onekind of labeled (“class”) data. The resulting structure is called adecision tree, where nodes stand for feature discrimination testswhile their exit branches stand for those subclasses of labeledexamples satisfying the test. An unknown example is classifiedby starting at the root of the tree, performing the sequential testsand following the corresponding branches until a leaf (terminalnode) is reached indicating that some class has been decided on.After decision trees are constructed a tree pruning mechanismis invoked. Pruning is used to reduce the effect of noise in thelearning data. It discards some of the unimportant sub-trees andretains those covering the largest number of examples. The treeobtained thus provides a more general description of the learnedconcept. Fig. 1 shows how the interplay among learning, using

Fig. 1. Adaptive pattern recognition using learning (DT) and evolution (GA’s).

Fig. 2. Eye location using a two-stage “where” and “what” architecture.

decision trees, and evolution using GA’s, takes place in the con-text of pattern recognition.

VII. EYE LOCATION ARCHITECTURE

Active perception works by systematically organizing thevisual tasks in such a manner that, as visual processing pro-gresses in time, the volume of data to attend to is reduced, andcomputing resources are focused only on salient regions of theimage. The adaptive eye location approach seeks firstwheresalient things are, and thenwhat their identity is. Specifically,the eye location architecture, shown in Fig. 2, involves: 1) thederivation of the saliency attention map, and 2) the possibleclassification of salient locations as eye regions. The saliency(“where”) map is derived using a consensus between navigationroutines encoded as FSA exploring the facial landscape andevolved using GA’s. The classification (“what”) stage is con-cerned with the optimal selection of features and the derivationof DT (decision trees) for the confirmation of eye classificationusing GA’s. The saliency (“where”) and recognition (“what”)components are briefly described in Sections VI-A and B,respectively, while their specific implementation is deferred toSection VII.

A. Derivation of the Saliency Map

The derivation of the saliency map (Fig. 3) is addressedin terms of tasks involved, computational models, and corre-sponding functionalities. The tasks involved include featureextraction, derivation of conspicuous (“paths”) maps, and the


Fig. 3. Architecture of the “where” stage—the saliency map.

integration of outputs from several visual routines. The cor-responding computational model involves the mean, standarddeviation, and entropy as feature maps, FSA evolved usingGA’s forage the feature maps and derive conspicuous paths,while consensus methods integrate the conspicuous paths intothe final saliency map.

B. Eye Recognition

Once the saliency map is derived, the eye recognition compo-nent has to decide if the most salient patches correspond to ac-tual eye locations. Toward that end, the eye recognition compo-nent implements a pattern recognition classifier whose task is tomake the binary decision between eye versus noneye locations.The eye classifier is implemented as a decision tree whose adap-tive derivation takes advantage of the feedback provided by evo-lution on how to choose an optimal feature subset—see Fig. 1.Specifically, GA’s use the classification performance providedby the decision tree as the feedback needed to bias the search foran optimal feature subset and improve the population’s (“DT’s”)average fitness.

VIII. I MPLEMENTATION

We first describe briefly the data preparation stage in termsof the database used, and the face detection stage preceding thelocation of facial landmarks such as the eyes. We then addressin detail the implementation for eye location consisting of twostages: 1) the “where” stage for the derivation of the conspicuityand saliency attention maps, and 2) the “what” stage for eyerecognition (see Fig. 2 and Section VI). The facial imagery usedis drawn from the FERET database developed over the last sev-eral years, and has become a standard test bed for face recogni-tion applications [3]. The FERET database consists of 1564 setscomprising 14 126 images. It contains 1199 individuals, pos-sibly wearing glasses. Since large amounts of the face imageswere acquired during different photo sessions, the lighting con-ditions, the size of the facial images, and their orientation canvary. In particular, as the frontal poses are slightly rotated to theright or left, the results reported later show that the eye locationtechnique is robust to such slight rotations. The diversity of theFERET database is across gender, race, and age, and includes365 duplicate sets taken at different times.

Face processing starts with the detection of face patternsusing FERET frontal images taken at a resolution of 256 × 384.The approach used for face detection [27] (Fig. 4) involves threemain steps: those oflocation, cropping, and postprocessing.

The first step finds a rough approximation for the possiblelocation of the face box, the second step refines the box, whilethe last step normalizes the face image. The first step locatesBOX1 surrounding the face, using a simple but fast algorithmin order to focus processing for the next cropping step. Thelocation step consists of three components: 1) histogramequalization, 2) edge detection, and 3) analysis of projectionprofiles. The cropping step labels each 8 × 8 window fromBOX1 as face or nonface using decision trees (DT). The DT areinduced (“learned”) from both positive (“face”) and negative(“nonface”) examples expressed in terms of features such asmean, standard deviation (s.d.), and entropy. The labeled outputof the cropping step is postprocessed to derive the face BOX2in terms of straight boundaries, while the last step normalizesBOX2 and yields a uniform sized face BOX3 of size 192 × 192(see Fig. 5), where eye location will take place.

A. “Where” Stage and the Saliency Map

The tasks involved include feature extraction, derivation ofconspicuous (“paths”) maps, and the integration of outputsfrom several visual routines. The corresponding computationalmodel (“colony of animats”) defines the mean, standard devia-tion, and entropy as feature maps, finite-state automata (FSA)evolved using GA’s forage the feature maps and derive con-spicuous paths, while consensus integrates the (conspicuous)paths into the final saliency map. Note that different FSA’s areevolved for finding salient left and right eye regions.

1) Feature Maps:The input (BOX3) consists of detectedface images whose resolution are 192 × 192 using 8 bit graylevels. To account for illumination changes, the original imagesare preprocessed using 5 × 5 Laplacian masks. The Laplacianfilters out small changes due to illumination, and detects thoseimage transitions usually associated with changes in image con-tents or contrast. Three feature maps corresponding to the mean,standard deviation (s.d.), and entropy are then computed over 6× 6 nonoverlapping windows, and then compressed using quan-tization to yield three 32 × 32 feature maps, each encoded usingfour gray levels (2 bits) only. The original facial images are alsoresized to the resolution of 32 × 32 for display purposes. Exam-ples of such feature maps are shown in Fig. 6.

2) Finite-State Automata:The FSA implements exploresthe previously derived 32 × 32 feature maps in order togenerate trajectories consisting of conspicuous points on thepath to salient eye locations. The FSA (“animat”) starts tosearch toward the eye from an initially defined initial point,the center of the bottom line of the resized (32 × 32) face box,corresponding approximately to the center of the chin region.The basic FSA encoding, string like, resembles a chromosome,and the full structure of the FSA is evolved using GA’s. IfPS and NS stand for the present and next state, respectively,INPUT is drawn from the feature maps, and ACTION standsfor some specific animat move, the FSA is then fully definedif one derives the transition function: {PS, INPUT} → {NS,ACTION}. The FSA is assumed to start from some initialstate IS. The FSA, exploring the 32 × 32 facial landscape, theoriginal face image, and the corresponding three feature maps,consists of eight states. As the animat (“FSA”) moves acrossthe face, it senses (“forages”) from the three precomputed


Fig. 4. Face location.

Fig. 5. Examples of face images.

feature maps, whose composite range is encoded using 6 bits (3features maps × 2 bits per feature = 64 level INPUT subfield).As measurements are taken, the animat decides on its next stateand an appropriate course of action (“move”). As shown inFig. 7(a), both the present state (PS) and the composite featurebeing sensed are represented using eight state fields⟨0⟩–⟨7⟩and 64 corresponding (feature) INPUT subfields⟨0⟩–⟨63⟩. Thetransition (FSA) function defines then next state (NS) andmove [see Fig. 7(b)]. The animat is prevented from movingbackward by design—no backward moves are allowed—andit can choose from five possible lateral or forward moves for atotal of eight possible moves [Fig. 7(c)]. Two of the moves aresideways (left and right), while two moves each are allocatedto left 45 , straight on, and right 45. The pointer to the initialstate [Fig. 7(a)], and the shaded blocks from the state transitiontable [Fig. 7(b)], corresponding to the next state and move,are subject to learning through evolution, as described next.The fitness behind evolution is defined as the ability of the leftor right FSA to find its way to the left or right eye within alimited number of moves (fewer than 64) and home on the eyewithin two pixels from the eye center. The GA component isimplemented using GENESIS [32], and the standard defaultparameter settings from the GENESIS were used. This resultsin a constant population size of 50 FSA chromosomes, atwo-point crossover rate of 0.6, and a mutation rate of 0.001.As FSA’s become more fit, the LEFT and RIGHT animatseventually learn to move in a trajectory that crosses the eyeregion. The animats themselves have no capability to “localize”the eyes. This will be done by a separate classification (“what”)procedure (Section VII).

3) Derivation of the Saliency Map Using Consensus:Thegoal for the derivation of the saliency map is to screen outmost of the facial landscape as possible eye locations so thatthe eye recognition stage can operate on fewer, but mostpromising data. Note that the derivation of the saliency map

takes place only during testing. For robustness purposes, 20LEFT and RIGHT animats (FSA’s) search the face landscapein parallel while starting from five closed by points withinthe chin region. As a consequence, one has to collect andintegrate their conspicuous (“paths”) outputs to eventuallydetermine the most salient eye locations. The motivation forthe parallel search approach also comes from the fact that, ifone were to let loose trained animats, it is likely that areas ofincreased traffic, subject to face modeling constraints, wouldcorrespond to the eye regions. To realize the parallel searchapproach, we trained different animats on the same task, LEFTand RIGHT eye location, using random seeds to initialize theFSA model described earlier. Once the (L and R) animats endtheir travel on the upper boundary of the face image, L and Rtraffic density across the facial landscape is collected, and onegenerates the (L-R) traffic with the expectation that the eyeswill show up strongly, thus indicating increased image saliency,the nose regions will cancel out, while other facial areas willshow only insignificant strength. The consensus method thenconsists of the following steps. Left and right path traffics arecounted for a number of different LEFT and RIGHT animatsand several starting adjacent locations, the (L-R) traffic map isgenerated, and its significant local maxima are then detectedusing hysteresis and thresholding. This procedure, stepwiseillustrated in Fig. 8, shows how the colony of animats detectssalient eye locations on an unseen face image using 20 LEFTand 20 RIGHT trained animats starting at five nearby chinlocationsandconsensus.

B. “What” Stage—Eye Recognition

The first step in applying GA’s to the problem of feature se-lection for eye recognition is to map the pattern space into a rep-resentation suitable for genetic search. Since the main interestis in representing the space of all possible subsets of the orig-inal feature list, the simplest form for image base representa-tions considers each feature as a binary gene. Each individualchromosome is then represented as a fixed-length binary stringstanding for some subset of the original feature list. A chromo-some corresponds to an-dimensional binary feature vector,where each bit represents the absence or inclusion of the asso-ciated feature. The advantage of this representation is that theclassical GA’s operators described earlier (crossover and mu-tation) can be easily applied to this representation without anymodification. Eye recognition is activated now only on salienteye locations using the hybrid GA–DT architecture (see Fig. 1).

The strategy used to learn DT able to recognize the eyes isconceptually akin to cross validation (CV) and bootstrapping,


Fig. 6. Mean, standard deviation, and entropy feature maps.

Fig. 7. FSA animat: (a) chromosome, (b) transition table, and (c) moves.

Fig. 8. Saliency derivation through traffic consensus.

as it draws randomly from the original list of face region exam-ples, both positive (“eye”) and negative (“noneye”), and it gener-ates two labeled sets consisting of training and tuning data. Thetraining data are used to induce trees, while the tuning data areused to evaluate the trees and generate appropriate fitness mea-sures in terms of eye recognition accuracy and the number ofeye features used. The fitness measure is fed back to the evolu-tion module in order to generate the next subset of eye features.Once the evolution stops, the best tree is frozen for use when an-

imats are actuated on the salient eye locations found earlier. Thewhole process is initialized by randomly selecting some subsetof eye features, and making them available to the decision treeevaluation module. In order to implement the genetic algorithmand decision tree architecture, we use GENESIS (genetic searchimplementation system) [28] and C4.5 [29], respectively.

We now describe in detail how one learns the DT for eyerecognition. The facial images are normalized to 64 × 64 res-olution from the original face BOX3 images of size 192 × 192


Fig. 9. Feature extraction for eye recognition decision trees.

based on our determination that the 64 × 64 size is the minimumrequired to induce accurate learning. For consistency purposes,the salient locations found earlier on 32 × 32 images are adjustedaccordingly to fit the increased dimension where eye recogni-tion takes place. The database for training the DT’s consists of120 positive (+) eye examples and 480 negative (−) noneye ex-amples, manually cropped and randomly drawn from across thefacial landscape. The resolution for both the positive and neg-ative eye examples is 24 × 16. The feature list for each 24 ×16 window consists initially of 147 features .The set of features, each of them measured over 6 × 4 windowsand using two-pixel overlap, is (see Fig. 9)

• – : means for each window• – : entropies for each window• – : means for each preprocessed window using the

Laplacian.The set of 600 examples, 120 (+) eye and 480 (−) noneye ex-

amples, is divided into three equal subsets for cross-validation(CV) training and tuning. Once each CV is completed, its resultsare passed back as feedback to the corresponding GA module.Note that, as the CV procedure involves additional tuning be-yond training the DT, the feedback passed back to evolutionas a result of local hill climbing is more than just a simplefitness measurement. The corresponding error rates, includingboth false positive—false detection of eyes—and false nega-tive—missing the eyes, on tuning data, are shown as fitnessmeasures in Fig. 10 as a function of the generation number foreach of the three CV rounds. The feature subset correspondingto the tree derived from the third CV round, which achieved thesmallest (4.87%) error rate after 1000 generations, consists ofonly 60 of the original 147 features. This feature subset is theone used to evaluate the overall performance on the eye locationtask using all of the candidates suggested by the saliency maps.

Due to the coarse resolution of the saliency map, the“what”—eye recognition stage might have to process salientbut adjacent locations, and as a consequence, multiple butredundant eye locations are found. To account for this effect,postprocessing using winner-take-all (WTA) is employedfollowing the eye recognition stage. The WTA procedure firstclusters adjacent eye candidates and finds the cluster centers.The eye locations that do not meet the expected pairwise(horizontal and vertical) ocular distances are then discarded.

Fig. 10. Fitness measures related to feature selection for eye classification.

Fig. 11. Conspicuous paths leading to the left and right eye location foundduring training.

IX. EXPERIMENTAL RESULTS

We use ten face images drawn from the output of the facedetection module to train 20 LEFT and 20 RIGHT FSA’s forderiving the saliency (“where”) map. It takes on the order ofabout 2000 generations before evolution succeeds to generatesuch an FSA, i.e., 100% performance in terms of animat trajec-tories starting from the chin area and passing by the eye withintwo pixels. Fig. 11 shows the paths followed by the animatssearching for LEFT or RIGHT eye locations using some of thetraining images.

During testing, the animats detect salient eye locations on anunseen but already detected face image using 20 LEFT and 20RIGHT trained FSA’s starting at five nearby chin locationsandconsensus.

As can be seen from Fig. 12, while all of the eye locationsare correctly identified as salient, several false positive eye lo-cations have shown up as well. The 24 × 16 windows, centeredon the salient eye locations, are now used to verify the actualpresence of the eyes, and to possibly discard the false positivesalient locations found earlier. The DT’s are learned using 120eye and 480 noneye examples drawn from 60 face images fortraining the eye (“what”) recognition stage (see Section VII-B).The results of applying a DT on salient locations are shown inFig. 13, and one can see that all of the false positive salient lo-cations have now been discarded. As discussed earlier, due tocoarse resolution, multiple eye recognition can still take placeat adjacent locations. Postprocessing using WTA, as explainedearlier, shows the final results in Fig. 14, and now one can see


Fig. 12. Salient eye locations.

Fig. 13. Verification of eye location candidates.

Fig. 14. Final results for eye location.

that all of the eye locations have been correctly and uniquelyidentified. On the test dataset of 20 face images, we missed twoeye locations only, and no false positives were identified. Oneshould note that the eye location is accurate within two pixelsfrom the center of the eye for 64 × 64 face images. This accu-racy is enough for normalization purposes, where the distancebetween the two eyes and the orientation of the line connectingthe two eyes with respect to the horizontal are used to scale androtate the face image to a standard frontal size.

To assess the benefits, if any, of using our approach againststandard eye location methods, we also run the same eye loca-tion experiment, but using exhaustive search on the entire faceimage. The same classification method using DT is used, butnow, the derivation of the saliency attention map is omitted. Thesame DT classifier, derived in the “what” stage (Section VII-B),is directly applied on 24 × 16 windows with a shift of 4 × 4pixels for eye recognition. The test results, using the same 20face images as before, yield only 87.5%—missing one eye inthree images and both eyes in one image. One can explain theloss in performance as a result of considering more candidatesfor eye location, and not being able to focus on a salient set offace locations. Regarding computational resources, the visualeye location routine reduces the set of eye candidates from 143windows to an average of 9 windows. Since the FSA navigationand traffic count are extremely fast compared to the time it takesto run the DT classification module, our novel eye location rou-tine enjoys a 12 : 1 time (“speed”) advantage when compared tothe exhaustive search.

X. CONCLUSION

This paper uses eye location as a test bed for developingnavigation routines implemented as visual routines (VR) drivenby an adaptive behavior-based AI. While there are many eyelocation methods, our technique is the first to approach sucha location task using navigational routines, and to automatethe derivation of such routines using evolution and learningrather than handcrafting them. Experimental results, using

facial image data, show the feasibility of our two-stage “where”and “what” approach for eye location. The advantage of ourarchitecture is its ability to filter out unlikely eye locationsusing the inexpensive FSA animat, and to limit the use ofthe more expensive decision tree (DT) classifier; “where”is cheaper to implement than “what.” One limitation of ourcurrent implementation is that it requires the chin location asthe starting point for the FSA animat. One extension for futureresearch would consider starting the animat from any locationon the facial landscape and assigning it the task to locate theeye.

Our adaptive and hybrid approach also seems suitable for thelocation of other facial landmarks, for the adaptive developmentof task-driven active perception and navigational mechanisms,and for the implementation of collective behavior of a multi-agent society. The method described also has the potential tolocate landmarks of interest in a scale-invariant fashion, as theFSA would just further iterate on the same step when the scaleis increased.

Our experiments support the Baldwin conjecture that evolu-tion and learning are complementary adaptive processes, andthat their hybrid use is mutually beneficial. The Baldwin con-jecture has been tested regarding both accuracy and the numberof features needed to evolve successful classifiers, like decisiontrees (DT’s) for eye recognition. Specifically, we have observedthat increased amounts of learning, i.e., a larger number of ex-amples used to induce the DT’s, leads to: 1) smaller optimalfeature sets, and 2) reduced error rates [27] during evolution.

REFERENCES

[1] A. Samal and P. Iyengar, “Automatic recognition and analysis of humanfaces and facial expressions: A survey,”Pattern Recognit., vol. 25, pp.65–77, 1992.

[2] R. Bajcsy, “Active perception,”Proc. IEEE, vol. 76, no. 8, pp. 996–1005,Aug. 1988.

[3] J. P. Phillips, H. Wechsler, J. Huang, and P. Rauss, “The FERET databaseand evaluation procedure for face-recognition algorithms,”Image VisionComput. J., vol. 16, no. 5, pp. 295–306, 1998.

[4] A. P. Pentland, B. Moghaddam, and T. Starner, “View-based andmodular eigenspaces for recognition,” inProc. Comput. Vision PatternRecognit., Seattle, WA, 1994.

[5] F. S. Samaria and A. C. Harter, “Parameterization of a stochastic modelfor human face identification,” inProc. 2nd IEEE Workshop Appl.Comp. Vision, Sarasota, FL, 1994.

[6] A. L. Yuille, “Deformable templates for face recognition,”J. CognitiveNeurosci., vol. 3, no. 1, pp. 59–70, 1991.

[7] K. M. Lam and H. Yan, “Locating and extracting the eye in human faceimages,”Pattern Recognit., vol. 29, no. 5, pp. 771–779, 1996.

[8] Y. Yeshurun, D. Reisfeld, and H. Wolfson, “Symmetry: A context freecue for foveated vision,” inNeural Networks for Pattern Recognition,H. Wechsler, Ed. New York: Academic, 1991, vol. 1.

[9] Y. Gofman and N. Kiryati, “Detecting symmetry in gray level images:The global optimization approach,” inProc. ICPR, Vienna, Austria,1996, pp. 889–894.

[10] P. Maes, “Behavior-based AI,” inFrom Animals to Animats 2, J. A.Meyer, H. L. Roitblat, and S. W. Wilson, Eds. Cambridge, MA: M.I.T.Press, 1992.

[11] S. Ullman, “Visual routines,”Cognition, vol. 18, pp. 97–159, 1984.[12] I. Horswill, “Visual routines and visual search,” inProc. Int. Joint Conf.

Artif. Intell., Montreal, Canada, 1995.[13] R. A. Brooks, “Visual map making for a mobile robot,” inProc. IEEE

Int. Conf. Robot. Automat., 1985, pp. 819–824.[14] B. Moghaddam and A. Pentland, “Probabilistic visual learning for ob-

ject detection,” inProc. 5th Int. Conf. Comput. Vision, Cambridge, MA,1995.

[15] V. S. Ramachandran, “Apparent motion of subjective surfaces,”Percep-tion, vol. 14, pp. 127–134, 1985.


[16] G. M. Edelman,Neural Darwinism. New York: Basic Books, 1987.[17] A. Allport, “Visual attention,” inFoundations of Cognitive Science, E.

Posner, Ed. Cambridge, MA: M.I.T. Press, 1989.[18] C. Koch and S. Ullman, “Shifts in selective visual attention: Toward

the underlying neural circuitry,” inMatters of Intelligence, L. M. Vaina,Ed. Amsterdam, The Netherlands: Reidel, 1987.

[19] G. E. Hinton and S. J. Nolan, “How learning can guide evolution,”Com-plex Syst., vol. 1, pp. 495–502, 1987.

[20] H. Mühlenbein and J. Kinderman, “The dynamics of evolution andlearning toward genetic neural networks,” inConnectionism in Per-spective, R. Pfeifer, Z. Schreter, F. Fogelman-Soulie, and L. Steels,Eds. Amsterdam, The Netherlands: Elsevier Science, 1989, pp.173–197.

[21] D. Whitley, V. S. Gordon, and K. Mathias, “Lamarckian evolution,the Baldwin effect, and function optimization,” inPPSN III, Y.Davidor, H.-P. Schwefel, and R. Manner, Eds. Berlin, Germany:Springer-Verlag, 1994, pp. 6–15.

[22] R. W. Anderson, “Other operators: The Baldwin effect,” inHandbookof Evolutionary Computation, T. Bäck, D. Fogel, and Z. Michalewicz,Eds, New York: Oxford Univ. Press, 1997, ch. 3, pp. 4.1–4.7.

[23] J. Bala, K. De Jong, J. Huang, H. Vafaie, and H. Wechsler, “Usinglearning to facilitate the evolution of features for recognizing visualconcepts,”Evol. Comput., vol. 4, no. 3, pp. 297–312, 1997.

[24] L. J. Fogel, A. J. Owens, and M. J. Walsh,Artificial Intelligence ThroughSimulated Evolution. New York: Wiley, 1966.

[25] M. P. Johnson, P. Maes, and T. Darell, “Evolving visual routines,” inArtificial Life IV, R. A. Brooks and P. Maes, Eds. Cambridge, MA:M.I.T. Press, 1994.

[26] J. R. Quinlan, “The effect of noise on concept learning,” inMachineLearning: An Artificial Intelligence Approach, R. S. Michalski, J. G.Carbonell, and T. M. Mitchell, Eds. San Mateo, CA: Morgan Kauf-mann, 1986, pp. 149–166.

[27] J. Huang, S. Gutta, and H. Wechsler, “Detection of human faces usingdecision trees,” inProc. 2nd Int. Conf. Automat. Face and GestureRecognition (ICAFGR), Killington, VT, 1996.

[28] J. J. Grefenstette, L. David, and D. Cerys,Genesis and OOGA: TwoGenetic Algorithms Systems. Melrose, MA: TSP, 1991.

[29] J. R. Quinlan,C4.5: Programs for Machine Learning. San Mateo, CA:Morgan Kaufmann, 1993.

Visual routines for eye location using learning and evolution

Documents

Transcript of Visual routines for eye location using learning and evolution