Learning Situation Models in a Smart Home

8
56 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009 Learning Situation Models in a Smart Home Oliver Brdiczka, James L. Crowley, and Patrick Reignier Abstract—This paper addresses the problem of learning situ- ation models for providing context-aware services. Context for modeling human behavior in a smart environment is represented by a situation model describing environment, users, and their activities. A framework for acquiring and evolving different layers of a situation model in a smart environment is proposed. Different learning methods are presented as part of this framework: role detection per entity, unsupervised extraction of situations from multimodal data, supervised learning of situation representations, and evolution of a predefined situation model with feedback. The situation model serves as frame and support for the different methods, permitting to stay in an intuitive declarative framework. The proposed methods have been integrated into a whole system for smart home environment. The implementation is detailed, and two evaluations are conducted in the smart home environment. The obtained results validate the proposed approach. Index Terms—Context awareness, human-centered computing, machine learning, situation modeling, situation split. I. I NTRODUCTION S MART environments have enabled the computer obser- vation of human (inter)action within the environment. Computerized spaces and their devices require situational infor- mation, i.e., context [1], to respond correctly to human activity. In order to become context aware, computer systems must thus maintain a model describing the environment, its occupants, and their activities. Situations are semantic abstractions from low-level contextual cues that can be used for constructing such a model of the scene. The situation model [2] and the underly- ing concepts are motivated by models of human perception of behavior in the environment. Human behavior is described by a finite number of states called situations. These situations are characterized by entities playing particular roles and being in relation within the environment. Perceptual information from the different sensors in the environment is associated to the situations, roles, and relations. The different situations are con- nected within a network. A path in this network (called script) describes behavior in the scene. System services to be pro- vided are associated to the different situations in the network. Manuscript received April 13, 2007; revised September 26, 2007. First published September 16, 2008; current version published January 15, 2009. This paper was recommended by Guest Editor A. Pentland. O. Brdiczka is with the Palo Alto Research Center, Palo Alto, CA 94304 USA (e-mail: [email protected]). J. L. Crowley is with the PRIMA Research Group, Institut National de Recherche en Informatique et en Automatique (INRIA) Rhône-Alpes, 38334 Saint Ismier Cedex, France, and also with the Institute National Polytechnique de Grenoble, 38031 Grenoble Cedex 1, France. P. Reignier is with the PRIMA Research Group, Institut National de Recherche en Informatique et en Automatique (INRIA) Rhône-Alpes, 38334 Saint Ismier Cedex, France, and also with the University Joseph Fourier, 38041 Grenoble Cedex 9, France. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCB.2008.923526 Fig. 1. Example of a simple situation model for a lecture room. Empty, Audience, and Lecture are the available situations. Lecturer and Audience are the available roles, and NotSameAs is the available relation. Fig. 1 shows a simple example of a situation model for a lecture room. The situations “empty,” “lecture,” and “audience” are characterized by the roles “lecturer” and “audience” as well as the relation “NotSameAs.” System services can then be associated to the situations (e.g., service “switch on projector” to situation “lecture”). Human behavior and needs evolve over time. Hence, a con- text model representing behavior and needs of the users must also evolve. Machine learning methods are necessary to acquire such a model from observation data and to adapt it according to the changing behavior and needs. System reasoning and behav- ior must, however, be kept transparent for the users. Hence, it is essential to operate on a human-understandable context model like the situation model, representing user behavior and needs as well as system service execution. This paper proposes a framework for learning situation mod- els. The objective is to build up and evolve a context model for providing context-aware services in a smart environment. The proposed framework consists of different machine learning methods that acquire and adapt a situation model with different levels of supervision. The approach has been implemented and evaluated in the smart home environment of the PRIMA research group. II. PROBLEM DEFINITION AND APPROACH Experts normally define and implement context models ac- cording to the needs of users and application (Fig. 2). Based on user needs and envisaged application, a human engineer specifies and implements the context model. Sensor percep- tions, context model, and system services to be provided are associated manually. Human behavior evolves over time. New activities and sce- narios emerge in a smart environment, and others disappear. New services must be integrated into the environment, whereas 1083-4419/$25.00 © 2008 IEEE

Transcript of Learning Situation Models in a Smart Home

56 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009

Learning Situation Models in a Smart HomeOliver Brdiczka, James L. Crowley, and Patrick Reignier

Abstract—This paper addresses the problem of learning situ-ation models for providing context-aware services. Context formodeling human behavior in a smart environment is representedby a situation model describing environment, users, and theiractivities. A framework for acquiring and evolving different layersof a situation model in a smart environment is proposed. Differentlearning methods are presented as part of this framework: roledetection per entity, unsupervised extraction of situations frommultimodal data, supervised learning of situation representations,and evolution of a predefined situation model with feedback. Thesituation model serves as frame and support for the differentmethods, permitting to stay in an intuitive declarative framework.The proposed methods have been integrated into a whole systemfor smart home environment. The implementation is detailed, andtwo evaluations are conducted in the smart home environment.The obtained results validate the proposed approach.

Index Terms—Context awareness, human-centered computing,machine learning, situation modeling, situation split.

I. INTRODUCTION

SMART environments have enabled the computer obser-vation of human (inter)action within the environment.

Computerized spaces and their devices require situational infor-mation, i.e., context [1], to respond correctly to human activity.In order to become context aware, computer systems must thusmaintain a model describing the environment, its occupants,and their activities. Situations are semantic abstractions fromlow-level contextual cues that can be used for constructing sucha model of the scene. The situation model [2] and the underly-ing concepts are motivated by models of human perception ofbehavior in the environment. Human behavior is described bya finite number of states called situations. These situations arecharacterized by entities playing particular roles and being inrelation within the environment. Perceptual information fromthe different sensors in the environment is associated to thesituations, roles, and relations. The different situations are con-nected within a network. A path in this network (called script)describes behavior in the scene. System services to be pro-vided are associated to the different situations in the network.

Manuscript received April 13, 2007; revised September 26, 2007. Firstpublished September 16, 2008; current version published January 15, 2009.This paper was recommended by Guest Editor A. Pentland.

O. Brdiczka is with the Palo Alto Research Center, Palo Alto, CA 94304USA (e-mail: [email protected]).

J. L. Crowley is with the PRIMA Research Group, Institut National deRecherche en Informatique et en Automatique (INRIA) Rhône-Alpes, 38334Saint Ismier Cedex, France, and also with the Institute National Polytechniquede Grenoble, 38031 Grenoble Cedex 1, France.

P. Reignier is with the PRIMA Research Group, Institut National deRecherche en Informatique et en Automatique (INRIA) Rhône-Alpes, 38334Saint Ismier Cedex, France, and also with the University Joseph Fourier, 38041Grenoble Cedex 9, France.

Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TSMCB.2008.923526

Fig. 1. Example of a simple situation model for a lecture room. Empty,Audience, and Lecture are the available situations. Lecturer and Audience arethe available roles, and NotSameAs is the available relation.

Fig. 1 shows a simple example of a situation model for a lectureroom. The situations “empty,” “lecture,” and “audience” arecharacterized by the roles “lecturer” and “audience” as wellas the relation “NotSameAs.” System services can then beassociated to the situations (e.g., service “switch on projector”to situation “lecture”).

Human behavior and needs evolve over time. Hence, a con-text model representing behavior and needs of the users mustalso evolve. Machine learning methods are necessary to acquiresuch a model from observation data and to adapt it according tothe changing behavior and needs. System reasoning and behav-ior must, however, be kept transparent for the users. Hence, it isessential to operate on a human-understandable context modellike the situation model, representing user behavior and needsas well as system service execution.

This paper proposes a framework for learning situation mod-els. The objective is to build up and evolve a context modelfor providing context-aware services in a smart environment.The proposed framework consists of different machine learningmethods that acquire and adapt a situation model with differentlevels of supervision. The approach has been implementedand evaluated in the smart home environment of the PRIMAresearch group.

II. PROBLEM DEFINITION AND APPROACH

Experts normally define and implement context models ac-cording to the needs of users and application (Fig. 2). Basedon user needs and envisaged application, a human engineerspecifies and implements the context model. Sensor percep-tions, context model, and system services to be provided areassociated manually.

Human behavior evolves over time. New activities and sce-narios emerge in a smart environment, and others disappear.New services must be integrated into the environment, whereas

1083-4419/$25.00 © 2008 IEEE

BRDICZKA et al.: LEARNING SITUATION MODELS IN A SMART HOME 57

Fig. 2. Top-down manual specification and implementation of a contextmodel.

Fig. 3. Bottom-up acquisition and evolution of a context model using anautomatic framework.

obsolete services should be deleted. Thus, a fixed contextmodel is not sufficient. Experts can construct and adapt con-text models according to the changing needs of users andapplication. However, experts are expensive and not alwaysavailable. Moreover, the environment’s intelligence lies in itsability to adapt its operation to accommodate the users. Theresearch challenge is thus to develop machine learning methodsfor this process, making it possible to automatically acquire andevolve context models reflecting user behavior and needs in asmart environment (Fig. 3). Intelligibility [3] of the employedcontext model and the reasoning process is essential in order topermit the users to trust the system.

The proposed approach addresses the problem by providingan intelligible framework for acquiring and evolving an intu-itive comprehensible context model of the scene, i.e., the situ-ation model. The methods proposed as part of this frameworkacquire different layers of the situation model, with differentlevels of supervision (Fig. 3). The situation model serves asframe and support for the different learning methods, permittingto stay in an intuitive declarative framework. First, roles arelearned and detected by using support vector machines (SVMs)

Fig. 4. Smart room environment as shown by the wide-angle camera.

based on collected data labeled by an expert [4]. Situations arethen extracted in an unsupervised manner from observation datausing the Jeffrey divergence between sliding histograms [6].The extracted situation segments can then be used to learn situ-ation labels with user or expert input [5]. The resulting situationmodel can finally be evolved according to user feedback usingthe situation split [7].

III. IMPLEMENTATION

In this section, we describe our current implementation.The implementation is based on a 3-D tracking system thatcreates and tracks targets in our smart home environment. Theextracted targets are used to detect individual roles per entity(Section III-B). By using the role values of several entities,observations are generated, which are the input for unsuper-vised situation extraction. The results of the extraction are usedfor supervised situation learning. The learned situation modelis then the basis for the integration of user preferences, i.e.,associating and changing services.

A. Smart Home Environment: Three-Dimensional TrackingSystem, Microphone Array, and Headset Microphones

The experiments described in the following sections are per-formed in our laboratory mockup of a living room environmentin a smart home. The smart room is equipped with a wide-angle camera (Fig. 4) plus two other normal cameras mountedin the corners of the room. A microphone array mounted againstthe wall of the smart environment is used for noise detection.The speech of people in the scene is recorded by using headsetmicrophones.

A 3-D video tracking system [8] detects and tracks entities(people) in the scene in real time using multiple cameras(Fig. 5). The tracker itself is an instance of basic Bayesianreasoning combining tracking results from several 2-D trackers[9] running on the video images of each camera. Each couplecamera detector is running on a dedicated processor. All inter-process communications are managed with an object-orientedmiddleware for service connection [10].

The output of the 3-D tracker are the position (x, y, z) of eachdetected target as well as the corresponding covariance matrix

58 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009

Fig. 5. Three-dimensional video tracking system fusing the information ofthree 2-D trackers to a 3-D representation.

(3 × 3 matrix describing the form of the bounding ellipsoid ofthe target). Additionally, a velocity vector �v can be calculatedfor each target.

The microphone array is used for noise detection. Based onthe energy of the audio streams, we determine whether thereis noise in the environment or not (e.g., movement of objectson the table). A real-time speech activity detector [6] analyzesthe audio stream of each headset microphone and determineswhether the corresponding person speaks or not.

The association of the audio streams (microphone number) tothe corresponding entity (target) generated by the 3-D tracker isdone at the beginning of each recording by a supervisor.

Ambient sound, speech detection, and 3-D tracking are syn-chronized. As the audio events have a much higher frame rate(62.5 Hz) than the video (up to 25 Hz), we add sound events (nosound, speech, and noise) to each video frame (of each entity).

B. Role Detection per Entity

Role detection is conducted per entity (person) and for eachobservation frame. The input are the extracted properties ofeach target (position (x, y, z), 3 × 3 covariance matrix, andspeed |�v|) provided by the 3-D tracking system. The output isone of the role labels (Fig. 8 bottom).

The role detection process consists of three parts (Fig. 6). Thefirst part is based on SVMs. The first approach used only SVMsas a black box learning method, without considering specifictarget properties. From the first results obtained in our smarthome environment [4], we concluded that, in order to optimizerole recognition, we need to reduce the number of classes aswell as the target properties used for classification. Additionalclasses are determined by using specific target properties (speedand interaction distance) and expert knowledge (see parts 2 and3 of the role detection process).

The first part of the process (Fig. 6 left) takes the covariancematrix values of each target as input. Trained SVMs detect,based on these covariance values, the basic individual roles,namely, “sitting,” “standing,” and “lying down” (Fig. 7).

Fig. 6. Role detection process: (left) SVMs, (middle) target speed, and(right) distance to interaction object.

Fig. 7. Basic individual roles, namely, “standing,” “lying down,” and“sitting,” detected by the SVMs.

The second part of the process (Fig. 6 middle) uses the speedvalue |�v| of each target. Based on empirical values in our smartenvironment, we can then determine whether the speed of thetarget is zero, low, medium, or high.

The third part of the process (Fig. 6 right) uses the position(x, y, z) of each target to calculate the distance to an interactionobject. In our smart environment, we are interested in theinteraction with a table at a known position (white table inFig. 4). Thus, we calculate the distance d between the targetand the table in the environment. If this distance is approachingzero (or below zero), the target is interacting with the table.

The results of the different parts of the detection process arecombined to roles following the schema in Fig. 8.

Based on role detection results as well as the ambient soundand speech detection, we derive multimodal observation codesfor each entity created and tracked by the 3-D tracking system.Twelve individual role values (Fig. 8 bottom) are derived foreach entity by the role detection process. Furthermore, theambient sound detector indicates whether there is noise inthe environment or not. The speech activity detector deter-mines whether the concerned entity is speaking or not. Thismultimodal information is fused to 53 observation codes foreach entity. Codes 1–13 (12 role values + 1 error code) arebased on the role detection process. These 13 codes are com-bined with ambient sound detection (codes 27–39 and 40–52)and speech detection per entity (codes 14–26 and 40–52).As ambient sound and speech detection return binary values,22 ∗ 13 = 52 different code values are necessary to representrole, ambient sound, and speech detection. If we add an ob-servation code value for a nonexisting entity (code 0), we get53 different observation code values.

As we can have several persons involved in a situation, mul-timodal observations of several entities are fused by combiningthe individual observation codes (Fig. 9). The idea is to attributea code to the combination of two multimodal entity observationcodes (without considering their order). The resulting observa-tion code fuses the observation codes of two (or more) entities.In order to fuse the observation codes of more than two entities,

BRDICZKA et al.: LEARNING SITUATION MODELS IN A SMART HOME 59

Fig. 8. Schema describing the combination of basic individual role, speed, and distance values to roles (blue arrows refer to “no interaction distance with table;”red arrows refer to “interaction distance with table”).

Fig. 9. Fusion algorithm combining the multimodal observation values (a, b)of two entities. For maxcode = 52, the resulting codes are between 0 and 1430.

Fig. 10. Different parts of the implementation and their evaluation: Roledetection per entity, unsupervised situation extraction, supervised situationlearning, and integration of user preferences.

the fusion can be applied several times, fusing successively allentities.

IV. EXPERIMENTAL EVALUATION AND RESULTS

In this section, we detail the experimental evaluations thathave been conducted as well as the obtained results. We didtwo different evaluations (Fig. 10).

The aim of Evaluation A was to investigate the quality ofone-person and multiperson situation learning and recognitionusing the proposed framework. Therefore, we recorded severalsmall scenarios showing different situations like “presentation”or “siesta.” The recordings have been segmented automatically,and the situations have been learned by using the methodsof the framework. We evaluate the recognition of the one-person and multiperson situations in the scenarios with andwithout automatic presegmention. The aim of Evaluation B wasto show and validate the combination of the three methods:unsupervised situation extraction, supervised situation learning,and integration of user preferences. Therefore, we recorded

Fig. 11. One-person situations, namely, “individual work” and “siesta (leftside),” and multiperson situations, namely, “introduction,” “aperitif,” “presen-tation,” and “game.”

three long scenarios showing several situations like “aperitif,”“playing game,” or “presentation.” The recordings have firstbeen automatically segmented. Then, the extracted segmentshave been labeled, and the situations have been learned. Fi-nally, the learned situation model has been evolved with userfeedback. We evaluate the recognition of the labeled as well asthe added situation (via situation split).

A. Evaluation A

In this section, we aim at investigating the quality of one-person and multiperson situation learning and recognition.Therefore, we made three different recordings of each of thefollowing situations: “siesta,” “an individual working,” “aper-itif,” “introduction/address of welcome,” “presentation,” and“playing a game” (Fig. 11). “Introduction/address of welcome,”“aperitif,” “presentation,” and “playing a game” involved twopersons, whereas “siesta” and “individual work” concernedonly one person. The role detection values have been generated,as described in Section III-B. The sequences designated forlearning are presegmented, i.e., only the segment containingthe pure situation is used for learning. This means that, forrecordings containing only one situation, disturbances at the be-ginning and the end of the recording are automatically removed

60 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009

Fig. 12. Extracted segments for situation recordings “Aperitif 1,” “Aperitif 2,”and “Aperitif 3.” Segments at the beginning and the end of the recordings willbe removed automatically.

TABLE ICONFUSION MATRIX AND INFORMATION RETRIEVAL STATISTICS FOR

ONE-PERSON SITUATION DETECTION WITHOUT AND WITH

PRESEGMENTATION. THE TOTAL RECOGNITION RATE IS 93.33%(WITHOUT PRESEGMENTATION) AND 96.67%

(WITH PRESEGMENTATION)

(see Fig. 12 for an example). The supervised learning scheme[5] is then used for learning the situation representations fromthe sequences. We adopt left–right hidden Markov models asunique learner class for the situations.

First, we did an evaluation on the situation detection for oneperson only (role detection value between 0 and 52). Situationrecordings involving two people thus gave two one-person se-quences. We did a threefold cross validation, taking two third ofthe sequences as input for supervised learning and the remain-ing one third as basis for recognition. Table I shows the results.The presegmentation improves the recognition results for theone-person recording sequences. In particular, “aperitif” and“game” can correctly be distinguished, whereas some wrongdetections between “introduction” and “presenter” persist.

TABLE IICONFUSION MATRIX AND INFORMATION RETRIEVAL STATISTICS FOR

TWO-PERSON SITUATION DETECTION WITH AND WITHOUT

PRESEGMENTATION. THE TOTAL RECOGNITION RATE IS

91.67% (WITHOUT PRESEGMENTATION) AND 100.00%(WITH PRESEGMENTATION)

Additionally, we did an evaluation on the situation detec-tion for two-person situations. Therefore, multiperson obser-vation codes have been generated from the individual roledetection values. We did again a threefold cross validation onthe situation recognition after a supervised situation learningof the given observation sequences. Table II shows the re-sults. The presegmentation also improves the recognition re-sults for the two-person recordings. As for the one-personsituation detection, situations “aperitif” and “game” can becorrectly distinguished with presegmentation. The two-personobservation fusion further eliminates wrong detections between“aperitif” and “game,” resulting in a correct situation recog-nition rate of 100% (Table II). The obtained results indicatethat multiperson observation and presegmentation of obser-vation streams is beneficial when learning and recognizingsituations.

B. Evaluation B

In this section, we intend to show and validate the combi-nation of the three methods: unsupervised situation extraction,supervised situation learning, and integration of user prefer-ences. Therefore, we evaluated the integral approach on threescenarios recorded in our smart home environment. The sce-narios involved up to two persons doing different activities(situations: “introduction/address of welcome,” “presentation,”“aperitif,” “playing a game,” and “siesta” {one person}) in theenvironment. The role detection values have been generated, asdescribed in Section III-B, by using 3-D tracker as well as noiseand speech detection (headset microphones). The role detectionvalues have then been fused to multiperson observations.

The first step of our proposed approach is to create theinitial situation model. We extract the situations from the sensorperceptions, i.e., the observations generated for the targets inthe scene using our automatic segmentor [6]. The automaticallyextracted segments and the ground truth for the scenarios are

BRDICZKA et al.: LEARNING SITUATION MODELS IN A SMART HOME 61

Fig. 13. Extracted situation segments and the corresponding ground truth forscenario 1 (Q = 0.68), scenario 2 (Q = 0.95), and scenario 3 (Q = 0.74).

shown in Fig. 13. The overall segmentation exactitude Q [11]is best for scenario 2. This can be explained by the fact thatthe algorithm has difficulties to distinguish ground-truth seg-ments “game” and “aperitif.” In scenarios 1 and 3, “game” and“aperitif” are detected as one segment. Because “playing game”and “aperitif” are separated by “presentation” in scenario 2,these segments can be correctly detected.

The supervised learning scheme [5] is applied on the detectedsegments. As expert knowledge, we inject the situation labels“introduction,” “presentation,” “group activity” (=“aperitif” or“game”), and “siesta.” We will adopt the left–right hiddenMarkov models as unique learner class for the situations. Toevaluate, we did a threefold cross validation, taking the detectedsegments + expert labels of two scenarios as input for learningand the third scenario as basis for recognition. As our systemshould be as responsive as possible, we evaluated differentwindow sizes used for recognition. The obtained situationrecognition rates are shown in Fig. 14. If we limit the obser-vation time provided for recognition to 10 s (i.e., 250 frameswith a frame rate of 25 frames/s), we get a recognition rate of88.58% (Table III). The recognition rate of “siesta” is poor dueto the fact that in two of the three scenario recordings, wrongtargets have been created and detected when a person lies downon the couch, resulting in a disturbance of the existing targetproperties.

We have now learned an initial situation model with thesituations “introduction,” “group activity,” “presentation,” and

Fig. 14. Recognition rate of situations “introduction,” “presentation,” “groupactivity” (= “aperitif” or “game”), and “siesta” for different recognitionwindow sizes.

TABLE IIICONFUSION MATRIX AND INFORMATION RETRIEVAL STATISTICS FOR

EACH SITUATION (OBSERVATION WINDOW SIZE = 250). THE OVERALL

SITUATION RECOGNITION RATE IS 88.58%

“siesta.” In order to integrate user preferences into this model, auser can give feedback to our system. The feedback is recordedand associated to the particular frame when it has been given.The initially learned model is then adapted according to thisfeedback. For our scenarios, we want to integrate the followingservices:

S1) introduction ⇒ normal light and no music;S2) aperitif ⇒ dimmed light and jazz music;S3) game ⇒ normal light and pop music;S4) presentation ⇒ dimmed light and no music;S5) siesta ⇒ dimmed light and yoga music.

The user gives one feedback indicating the correspondingservice during each situation. As the initial situation modeldoes not contain any situation-service associations, S1), S4),and S5) can then be simply associated to the correspondingsituations. For S2) and S3), there is only one situation, i.e.,“group activity,” which is too general in order to associateboth distinct services. This situation thus needs to be split intosubsituations (following the situation split scheme in [7]). Thelearned situation representation for “group activity” [here, anhidden Markov model (HMM)] is erased, and two distinct situ-ation representations (here, HMMs) for “aperitif” and “game”are learned. The observations necessary to learn these situationsare taken around the time points when the user gave the corre-sponding feedback. The size of the observation window usedfor learning the new subsituations can be varied. The situationrecognition rates for different learning window sizes are shown

62 IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 39, NO. 1, FEBRUARY 2009

Fig. 15. Recognition rate of situations “introduction,” “presentation,”“aperitif,” “game” (after split), and “siesta” for different learning window sizes.The curve is for 250 observations (recognition window size).

TABLE IVCONFUSION MATRIX AND INFORMATION RETRIEVAL STATISTICS FOR

EACH SITUATION (OBSERVATION WINDOW SIZE = 250) AFTER THE SPLIT

OF “GROUP ACTIVITY.” THE WINDOW SIZE FOR LEARNING THE NEW

SUBSITUATIONS IS 400. THE OVERALL SITUATION

RECOGNITION RATE IS 81.86%

in Fig. 15. We used a window size of 250 observations forrecognition (i.e., 10 s of observation time with a frame rateof 250 frames/s). The curve indicates that a larger learningwindow size does not always result in a better recognitionrate. The total situation recognition rate can even drop witha larger learning window size. This is due to the fact that thebest recognition results are obtained when the learning windowcontains a maximum of observation data being characteristicfor the concerned situation and a minimum of “foreign” obser-vations, i.e., wrong detections or observations corresponding toother situations. The resulting situation recognition curve tendsupward, but it contains local peaks corresponding to a learningwindow size with a good tradeoff between the characteristic andforeign observations. For our scenario recordings, such a localpeak is at a learning window size of 400, i.e., 400 observationsaround the feedback time points to learn “aperitif” and “game.”The obtained results of the threefold cross validation for recog-nition window size 250 are detailed in Table IV.

V. CONCLUSION

Although the obtained results are promising, a really smarthome anticipating the needs and preferences of the user is stillnot available. First products that a user could buy in his localcomputer store and install himself are not mature enough. First,

the sensors necessary for a reliable sensing of user activitiesare still too invasive. Multiple cameras, microphones, or othersensors must be installed and calibrated in the home. These arestill not autoinstalling, difficult to use, and raise user privacyissues. Second, even though our results are encouraging, theerror rates are still too high. Further improvements in detectionand learning algorithms are necessary in order to provide areliable system that could be accepted by a user in his dailylife. One way to alleviate this is to provide explanations. Whenerrors occur (and corresponding system explanations are good),the user could understand and correct wrong system perceptionsand reasoning himself.

REFERENCES

[1] A. K. Dey, “Understanding and using context,” Pers. Ubiquitous Comput.,vol. 5, no. 1, pp. 4–7, Feb. 2001.

[2] J. L. Crowley, J. Coutaz, G. Rey, and P. Reignier, “Perceptual componentsfor context aware computing,” in Proc. UbiComp, 2002, pp. 117–134.

[3] V. Bellotti and K. Edwards, “Intelligibility and accountability:Human considerations in context-aware systems,” Hum.-Comput.Interact., vol. 16, no. 2, pp. 193–212, 2001.

[4] O. Brdiczka, J. Maisonnasse, P. Reignier, and J. Crowley, “Learning in-dividual roles from video in a smart home,” in Proc. 2nd IET Int. Conf.Intell. Environ., Jul. 2006, vol. 1, pp. 61–69.

[5] O. Brdiczka, P. Yuen, S. Zaidenberg, P. Reignier, and J. Crowley, “Auto-matic acquisition of context models and its application to video surveil-lance,” in Proc. ICPR, 2006, pp. 1175–1178.

[6] O. Brdiczka, D. Vaufreydaz, J. Maisonnasse, and P. Reignier, “Unsu-pervised segmentation of meeting configurations and activities usingspeech activity detection,” in Proc. 3rd IFIP AIAI Conf., Jun. 2006,pp. 195–203.

[7] O. Brdiczka, P. Reignier, and J. L. Crowley, “Supervised learning ofan abstract context model for an intelligent environment,” in Proc.sOc-EUSAI Conf., Oct. 2005, pp. 259–264.

[8] A. Biosca-Ferrer and A. Lux, A Visual Service for Distributed En-vironments: A Bayesian 3D Person Tracker, 2007. PRIMA internalReport.

[9] A. Caporossi, D. Hall, P. Reignier, and J. L. Crowley, “Robust visualtracking from dynamic control of processing,” in Proc. 6th IEEE PETSWorkshop, May 2004, pp. 23–31.

[10] R. Emonet, D. Vaufreydaz, P. Reignier, and J. Letessier, “O3miscid: Anobject oriented opensource middleware for service connection, introspec-tion and discovery,” in 1st IEEE Int. Workshop Serv. Integr. PervasiveEnviron., Jun. 2006.

[11] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud,“Multimodal group action clustering in meetings,” in Proc. 2nd VSSNWorkshop, 2004, pp. 54–62.

Oliver Brdiczka received the diploma degree incomputer science from the University of Karlsruhe,Karlsruhe, Germany, the engineer’s degree fromthe Ecole National Supérieure d’Informatique et deMathématiques Appliqués, Grenoble, France, andthe Ph.D. degree from the Institut National Polytech-nique de Grenoble (INPG). He pursued his Ph.D.research with the PRIMA Research Group at theINRIA Rhône-Alpes Research Center.

After his Ph.D., he headed the Ambient Collabo-rative Learning Group at TU Darmstadt, Germany.

He currently holds a position as Scientific Researcher at the Palo Alto ResearchCenter, Palo Alto, CA. His research interests include context modeling, activityrecognition, and machine learning.

BRDICZKA et al.: LEARNING SITUATION MODELS IN A SMART HOME 63

James L. Crowley leads the PRIMA ResearchGroup at the Institut National de Recherche en Infor-matique et en Automatique Rhône-Alpes ResearchCenter, Montbonnot (near Grenoble), France. He isa Professor with the Institute National Polytechniquede Grenoble, Grenoble, where he teaches courses incomputer vision, signal processing, pattern recog-nition, and artificial intelligence with the EcoleNationale Supérieure d’Informatique et de Mathéma-tiques Appliqués, Grenoble. He has edited two booksand five special issues of journals, and has authored

over 180 articles on computer vision and mobile robotics. He ranks number1473 in the CiteSeers most cited authors in Computer Science (August 2006).

Patrick Reignier is a Researcher with the LIGLaboratory, Institut National de Recherche en Infor-matique et en Automatique Rhône-Alpes ResearchCenter, Montbonnot (near Grenoble), France. Heis also an Assistant Professor with the UniversityJoseph Fourier, Grenoble.