Linking recognition accuracy and user experience in an affective feedback loop

7
1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1 Linking recognition accuracy and user experience in an affective feedback loop Domen Novak, Member, IEEE, Aniket Nagle, and Robert Riener, Member, IEEE Abstract— In an affective feedback loop, the computer maps various measurements to affective variables such as enjoyment, then adapts its behavior based on the recognized affects. The affect recognition is never perfect, and its accuracy (percentage of times the correct affective state is recognized) depends on many factors. However, it is unclear how this accuracy relates to the overall user experience. As recognition accuracy is difficult to control in a real affective feedback loop, we describe a method of simulating recognition accuracy in a game where difficulty is increased or decreased after each round. The game was played by 261 participants at different simulated recognition accuracies. Participants reported their satisfaction with the recognition algorithm as well as their overall game experience. We observed that in such a game, the affective feedback loop must adapt game difficulty with an accuracy of at least 80% to be accepted by users. Furthermore, users who do not enjoy the game are likely to stop playing it rather than continue playing and report low enjoyment. However, the acceptable recognition accuracy may not generalize to other contexts, and studies of affect recognition accuracies in other applications are needed. Index Terms—Affective computing, computer games, machine learning, physiological computing, user acceptance —————————— —————————— 1 INTRODUCTION 1.1 Affective feedback loop FFECTIVE computing focuses on recognizing, inter- preting and responding to human affects (feelings and emotions) using various sensory modalities: facial expressions, body language, voice, physiological respons- es etc. [1] An affective computing system can take the form of a feedback loop: the computer maps measured sensory data to affective variables such as stress, work- load or engagement, then continuously adapts its behav- ior based on the recognized affects [2]. Affective feedback loops have been used in a variety of applications. For example, Liu et al. [3] recognized the player's anxiety level from autonomic nervous system (ANS) responses and used it to adapt difficulty in a game of Pong. They used a similar approach to recognize chil- dren's enjoyment while playing with a robot, and had the robot change its behavior in order to maximize enjoyment [4]. Haarmann et al. [5] recognized the level of arousal from heart rate and skin conductance, then used it to adapt the turbulence level in a flight simulator. Similarly, ANS responses have been used to recognize workload level in motor rehabilitation and adapt the exercise in order to optimally challenge the patient [6]. As a final example, affective feedback can be applied to problems such as warning car drivers in case of recognized drowsi- ness [7], [8]. 1.2 Recognition accuracy in affective computing Regardless of the sensory modality used in the affective feedback loop, a key challenge is to recognize human affect from raw sensor data. A number of recognition methods exist in affective computing, many of them mo- dality-specific [9], [10]. Specifically in affective feedback loops, the dominant approach is classification of recorded data into a set of discrete classes (e.g. low/high workload, low/high enjoyment) using supervised machine learning [10]. The computer then executes a specific command (e.g. increase/decrease difficulty) depending on the class. For classification, the recognition accuracy is defined as the percentage of correctly recognized classes and, consequently, the percentage of correctly executed com- mands. This accuracy is calculated with respect to a refer- ence 'correct' class. In affect recognition from facial ex- pressions or voice, the reference is often obtained by ask- ing participants to act out a certain emotion [9]. In affect recognition from physiological responses, which cannot be voluntarily controlled, participants are generally ex- posed to various stimuli and then asked (verbally or via questionnaires) what they actually 'felt'. This response is then taken as the correct class [10, 11]. Such an approach is also common in affective feedback loops, where partic- ipants should be interacting with the system in a natural way rather than intentionally generating affective cues. To put it another way, recognition accuracy in affective computing is defined as the percentage of times that the class (type of affect) recognized by the computer is the same as the class reported by the participant. 1.3 What accuracy do we need? Recognition accuracy varies depending on the setting, the sensors, the classifier type and whether the classifier is trained on data from the current user or from other users. For example, Healey and Picard [7] achieved an accuracy of 97.4% when classifying three stress levels from ANS responses while Chanel et al. [12] achieved an accuracy of less than 60% when classifying three difficulty levels from xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society ———————————————— All authors are with the Sensory-Motor Systems Lab, ETH Zurich, Swit- zerland. E-mails: [email protected], [email protected], [email protected]. Mailing address: Sensory-Motor Systems Lab, Tannenstrasse 1, CH-8092 Zurich, Switzerland. D. Novak and R. Riener are also with Balgrist University Hospital, Zurich, Switzerland. A

Transcript of Linking recognition accuracy and user experience in an affective feedback loop

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

IEEE TRANSACTIONS ON AFFECTIVE COMPUTING 1

Linking recognition accuracy and user experience in an affective feedback loop

Domen Novak, Member, IEEE, Aniket Nagle, and Robert Riener, Member, IEEE

Abstract— In an affective feedback loop, the computer maps various measurements to affective variables such as enjoyment,

then adapts its behavior based on the recognized affects. The affect recognition is never perfect, and its accuracy (percentage

of times the correct affective state is recognized) depends on many factors. However, it is unclear how this accuracy relates to

the overall user experience. As recognition accuracy is difficult to control in a real affective feedback loop, we describe a method

of simulating recognition accuracy in a game where difficulty is increased or decreased after each round. The game was played

by 261 participants at different simulated recognition accuracies. Participants reported their satisfaction with the recognition

algorithm as well as their overall game experience. We observed that in such a game, the affective feedback loop must adapt

game difficulty with an accuracy of at least 80% to be accepted by users. Furthermore, users who do not enjoy the game are

likely to stop playing it rather than continue playing and report low enjoyment. However, the acceptable recognition accuracy

may not generalize to other contexts, and studies of affect recognition accuracies in other applications are needed.

Index Terms—Affective computing, computer games, machine learning, physiological computing, user acceptance

—————————— ——————————

1 INTRODUCTION

1.1 Affective feedback loop

FFECTIVE computing focuses on recognizing, inter-preting and responding to human affects (feelings

and emotions) using various sensory modalities: facial expressions, body language, voice, physiological respons-es etc. [1] An affective computing system can take the form of a feedback loop: the computer maps measured sensory data to affective variables such as stress, work-load or engagement, then continuously adapts its behav-ior based on the recognized affects [2].

Affective feedback loops have been used in a variety of applications. For example, Liu et al. [3] recognized the player's anxiety level from autonomic nervous system (ANS) responses and used it to adapt difficulty in a game of Pong. They used a similar approach to recognize chil-dren's enjoyment while playing with a robot, and had the robot change its behavior in order to maximize enjoyment [4]. Haarmann et al. [5] recognized the level of arousal from heart rate and skin conductance, then used it to adapt the turbulence level in a flight simulator. Similarly, ANS responses have been used to recognize workload level in motor rehabilitation and adapt the exercise in order to optimally challenge the patient [6]. As a final example, affective feedback can be applied to problems such as warning car drivers in case of recognized drowsi-ness [7], [8].

1.2 Recognition accuracy in affective computing

Regardless of the sensory modality used in the affective feedback loop, a key challenge is to recognize human affect from raw sensor data. A number of recognition

methods exist in affective computing, many of them mo-dality-specific [9], [10]. Specifically in affective feedback loops, the dominant approach is classification of recorded data into a set of discrete classes (e.g. low/high workload, low/high enjoyment) using supervised machine learning [10]. The computer then executes a specific command (e.g. increase/decrease difficulty) depending on the class.

For classification, the recognition accuracy is defined as the percentage of correctly recognized classes and, consequently, the percentage of correctly executed com-mands. This accuracy is calculated with respect to a refer-ence 'correct' class. In affect recognition from facial ex-pressions or voice, the reference is often obtained by ask-ing participants to act out a certain emotion [9]. In affect recognition from physiological responses, which cannot be voluntarily controlled, participants are generally ex-posed to various stimuli and then asked (verbally or via questionnaires) what they actually 'felt'. This response is then taken as the correct class [10, 11]. Such an approach is also common in affective feedback loops, where partic-ipants should be interacting with the system in a natural way rather than intentionally generating affective cues.

To put it another way, recognition accuracy in affective computing is defined as the percentage of times that the class (type of affect) recognized by the computer is the same as the class reported by the participant.

1.3 What accuracy do we need?

Recognition accuracy varies depending on the setting, the sensors, the classifier type and whether the classifier is trained on data from the current user or from other users. For example, Healey and Picard [7] achieved an accuracy of 97.4% when classifying three stress levels from ANS responses while Chanel et al. [12] achieved an accuracy of less than 60% when classifying three difficulty levels from

xxxx-xxxx/0x/$xx.00 © 200x IEEE Published by the IEEE Computer Society

————————————————

All authors are with the Sensory-Motor Systems Lab, ETH Zurich, Swit-zerland. E-mails: [email protected], [email protected], [email protected]. Mailing address: Sensory-Motor Systems Lab, Tannenstrasse 1, CH-8092 Zurich, Switzerland.

D. Novak and R. Riener are also with Balgrist University Hospital, Zurich, Switzerland.

A

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

2 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, TAFFC-2014-02-0026

ANS responses with the same classifier. In a thorough comparison, AlZoubi et al. [13] demonstrated that differ-ent machine learning methods achieve significantly dif-ferent accuracies even when tested on the same dataset.

Accuracy can be improved by combining multiple sen-sory modalities. For example, Chanel et al. [12] combined ANS responses with electroencephalography (EEG) and achieved a 3-class accuracy of 63% (compared to 57% with ANS and 56% with EEG). Novak et al. [6] combined ANS responses with task performance and achieved a 2-class accuracy of 89% (compared to 82% with task per-formance and 68% with ANS). Finally, Koelstra et al. [14] combined EEG and ANS responses with multimedia content analysis. However, collecting additional data increases the financial and temporal investment. The question then becomes: is it worth adding another sensor to obtain, for example, a 5% increase in recognition accu-racy?

The question of how recognition accuracy relates to user experience has not been clearly addressed in affec-tive computing, as noted by our recent review [10]. This is partially because the field is still young and studies most-ly focus on demonstrating technical feasibility, but also because different applications have different require-ments: for example, a drowsiness detection system may need much higher accuracy than a computer game. How-ever, experience from other fields such as intelligent pros-thetics shows that higher offline recognition accuracy does not always translate to better user satisfaction, per-formance or even online recognition accuracy [15].

The general research questions that we wish to explore are therefore: Q1. In an affective feedback loop, what is the relation-

ship between recognition accuracy and user satisfaction with the recognition? Q2. In an affective feedback loop, what is the relation-

ship between recognition accuracy and the general user experience?

We specifically chose to explore these questions in the context of computer games, which represent a common application of affective computing. In the context of this paper, we define an affective game as a game where hu-man affect (workload, frustration, anxiety …) is recog-nized from any affective cue and used to adapt game difficulty. This definition is similar to, though narrower than classic definitions of affective games [16].

2 MATERIALS AND METHODS

2.1 Linking recognition accuracy and user experience

In a real affective feedback loop, it is difficult to estimate how recognition accuracy relates to user experience, as the accuracy of a recognition algorithm is unpredictable. Furthermore, as each subject would experience a specific accuracy that is not known in advance, a large number of subjects would be required. This is prohibitively time-consuming for measurements such as EEG that have a long setup time. For such an evaluation, it would instead be optimal to use a system where the accuracy can be

manually varied by the experimenter. A possible solution was proposed by van de Laar et al.

[17] for a related problem: evaluating brain-computer interface controllability. In their study, the interface is simulated with a simple keyboard. The user presses a button, and the computer either executes the desired command or randomly chooses one of the other possible commands. By varying the probability of the desired command being executed, the experimenter can simulate different controllability levels.

The study above has implications for affect recogni-tion. Most importantly, it states that, paradoxically, actual measurements of affective cues are not necessary in order to study the effects of recognition accuracy in an affective feedback loop. In principle, the effect of accuracy can be evaluated simply by artificially varying the probability of the correct action being taken.

As mentioned, the correct class (and consequently cor-rect action) in an affective feedback loop is often assumed to be the subjective feeling reported by the user in re-sponse to stimuli. In an affective feedback loop, these self-report values are generally obtained every time affect recognition is performed [3], [5], [6]. Assuming that the self-report values are correct, we propose a simple way of simulating recognition accuracy: collect the self-report data (correct class), then have a certain probability of recognizing the same class as the self-reported one. By varying this probability, different recognition accuracies can be simulated. Ideally, the human would not know what is being done, and would believe that the recogni-tion is independent of the self-report questions.

To examine the effect of recognition accuracy on user experience in an affective game, we therefore require a simulated affective game where the user's opinion is be recorded at regular intervals, and the computer has a certain probability of agreeing with the recorded opinion. By varying this probability, different recognition accura-cies can be simulated as the percentage of agreement between the user and the computer. The principles of an actual and a simulated affective game are shown in Fig. 1.

Fig. 1. The principle of a real affective game (subfigure a) and a

simulated affective game where desired affect recognition accuracy

can be set manually (subfigure b).

The most basic type of affective computer game allows

the computer to choose between two actions: either in-crease or decrease difficulty [6]. While some also include the option to keep difficulty the same [4], affective games with more than 3 possibilities are rare due to low recogni-tion accuracies. Our study thus used a game where diffi-culty is increased or decreased after each game interval.

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

NOVAK ET AL.: LINKING RECOGNITION ACCURACY AND USER EXPERIENCE IN AN AFFECTIVE FEEDBACK LOOP 3

2.2 The game

The game is a variant of the classic Snake game where the player controls a snake that moves over the playing field at a certain speed (Fig. 2). The player cannot stop the snake, but can turn it left or right by pressing the left and right arrows on the keyboard. At the start of each round, the snake is very short, and a piece of food appears in a random position for the snake to collect. When the snake collides with the food, the food is eaten, the player re-ceives 100 points, the snake grows longer, and a new piece of food appears in a random position. If the snake either collides with itself or the edge of the playing field, it dies and the round ends.

The simulation of an affective game was done as fol-lows. At the start of the game, a value defined as pthresh was generated randomly for each participant between 50% and 100%. This value represented the desired recog-nition accuracy of a simulated affective feedback loop. The game always began at a default difficulty where the snake needs approximately five seconds to cross the screen. After each round, the participant was asked if he/she would prefer difficulty to increase or decrease. The answer was given via the keyboard, with no option to stay at the same difficulty level. This represents the self-reported 'correct' class. Once the participant made a choice, the computer had a pthresh probability of agreeing with that choice. When agreeing, difficulty was changed by one level in the participant's preferred direction; oth-erwise, it was changed in the opposite direction. Higher difficulties correspond to increased snake speed.

The computer's choice was displayed to the participant before the next round began. However, participants were informed by the initial instructions that the computer would act according to an intelligent difficulty adaptation algorithm that would not take the player's preference into account. The game claimed that the player's preference would only be recorded to evaluate the performance of the (nonexistent) intelligent algorithm. This simulated an affective game where subjective responses are collected as a reference but are not used to affect difficulty adaptation.

Participants were required to play the game for at least 5 rounds. After the fifth round, they had the option to quit playing, but could play as many rounds as they liked.

2.3 Participants

The game was implemented in Flash and placed online at http://www.seriousgames.ethz.ch/snake.html. It was advertised via alumni mailing lists and online communi-ties (websites such as Facebook, Something Awful and Reddit). Participants were told that the study was about game difficulty adaptation algorithms, but were not in-formed about the actual purpose of the study. An online implementation was chosen based on recommendations of van de Laar et al. [17]: since the effect of recognition accuracy may be small, many participants are required and can most easily be obtained online.

Participants were first presented with the instructions and asked to fill out an initial questionnaire. They then played the game until they chose to quit (with a mini-

mum of five rounds), and finally were asked to fill out a post-game questionnaire. Of course, participants could also end their participation by closing their web browser. However, any participants who did not fill out both ques-tionnaires or gave obviously dummy answers (e.g. choos-ing the default option for all answers) were excluded from the analysis. Additionally, participants who ob-tained an average score of less than 200 points per round were excluded, as they likely did not understand the game or experienced potential technical glitches. In total, 439 participants began playing the game, of which 128 did not complete the questionnaire and 50 were excluded due to low scores. 261 participants were included in the analysis (216 males, 45 females; age 26.0 ± 4.6 years).

Fig. 2. The Snake game. The playing field is colored black and bor-

dered with gray. The snake is colored green with a blue head. The

food (single square) is colored red.

2.4 Questionnaire

2.4.1 Before the game

Participants were asked about their gender, age, and how often they play computer games (options: "never", "less than 1 hour/week", "1-2 hours/week", "2-5 hours/week", and "more than 5 hours/week"). Due to the online nature of the game, the reliability of these responses is lower than in studies where participants visit the lab in person.

Participants were also asked how difficult they prefer games to be and how easily frustrated they are. These two questions were answered using a continuous slider marked "not at all" on one end and "very difficult" or "very easily" on the other end. Slider choices were con-verted to values between 0 and 100.

2.4.2 After the game

Participants were asked five questions: - "How fun was the game?" - "How frustrating was the game?" - "How satisfied are you with the decisions of the diffi-culty adaptation algorithm?" - "Would you recommend this difficulty adaptation al-gorithm for practical use?" - "Would you play this game again?" The first three were answered using continuous sliders

marked "not at all" on one end and "very much" on the other. Slider choices were converted to values between 0

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

4 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, TAFFC-2014-02-0026

and 100. The last two were answered with yes or no.

2.5 Data analysis

We first calculated the agreement between the participant and the difficulty adaptation as the percentage of times the computer adapted difficulty in the direction preferred by the participant. This differs from pthresh since pthresh is defined in advance whereas agreement is calculated post-hoc according to what actually happened. This agreement was used as the stand-in for recognition accuracy in the simulated affective game.

Participants were separated into six bins according to their agreement percentage: [0 50), [50 60), [60 70), [70 80), [80 90) and [90 100]. For each post-game question, differ-ences between the bins were analyzed using one-way analysis of variance (ANOVA) and post-hoc tests. If the assumptions for one-way ANOVA were not met, ANO-VA on ranks was used instead.

Finally, Spearman correlation coefficients between agreement and answers to post-game questions as well as between pre- and post-game answers were calculated.

3 RESULTS AND DISCUSSION

3.1 Game completion and biases

Table 1 shows the distribution of the 261 valid partici-pants into the six bins according to their agreement per-centage. Bin 6 (very high agreement) has a much higher number of participants than all the others. This could at first glance be due to biased assignment of pthresh values, but a follow-up analysis of generated pthresh values showed roughly equal distributions.

Instead, the bias in agreement values is due to two fac-tors. First, pthresh is not guaranteed to match agreement, and a participant with a pthresh of 50 can in principle have an agreement value anywhere between 0 and 100. Further-more, the agreement value depends on the number of rounds the game is played. For example, participants could end the game and answer the post-game question-naire after the fifth round without stating whether diffi-culty should increase or decrease. With only four stated difficulty decisions, agreement is then calculated from these four cases. As the participant and computer either agree or disagree in each case, the only possible agreement values for five played rounds are 0 (always disagree), 25, 50, 75 or 100 (always agree). The [60 70) and [80 90) bins therefore have a lower number of participants since they

only include participants with at least 6 rounds played. The second source of bias is that only participants who

played at least 5 rounds can answer the post-game ques-tionnaire and be included in the analysis. If we look at the 261 valid participants plus the 128 who did not complete the questionnaire, subjects with low agreement were much more likely not to complete the post-game questionnaire. It was completed by:

- 44% of subjects with agreement below 50, - 52% of subjects with agreement between 50 and 60, - 65% of subjects with agreement between 60 and 70, - 79% of subjects with agreement between 80 and 90, - 76% of subjects with agreement of 90 or more. This suggests that participants who are dissatisfied

with the difficulty adaptation will simply quit rather than play the game for the required number of rounds. Results are therefore skewed, as fun and frustration values are collected only from participants who were engaged enough to complete the five required rounds. An alterna-tive would have been to present the post-game question-naire to all participants. However, this is difficult in an online experiment where participants can always close their browser and ignore the questionnaire. Additionally, it is questionable whether, e.g., participants who quit after one round actually provide relevant data.

Readers should note that these biases mean that the following results should be taken with a grain of salt. However, such biases will (in our opinion) represent an important issue for future studies on the topic, and ap-propriate ways to deal with them should be found.

3.2 Perception of recognition accuracy

Participants' responses to "How satisfied are you with the decisions of the difficulty adaptation algorithm?" and "Would you recommend this difficulty adaptation algo-rithm for practical use?" are shown in Figs. 3 and 4. The correlation coefficient between agreement and satisfaction with the difficulty adaptation was 0.43 (p < 0.001). Partic-ipants therefore clearly do notice recognition accuracy, and their satisfaction with the difficulty adaptation in-creases steadily with recognition accuracy (Fig. 3). Fig. 3. Participants’ satisfaction with the difficulty adaptation, pre-

sented as a function of agreement between the participant and the

difficulty adaptation. * indicates p < 0.05, ** indicates p < 0.01, and

*** indicates p < 0.001.

TABLE 1 DISTRIBUTION OF VALID PARTICIPANTS INTO BINS

Bin Agreement N % male Age

1 [0 50) 21 76.2 25.5 ± 4.7

2 [50 60) 30 90.0 26.4 ± 5.4

3 [60 70) 24 85.7 27.0 ± 4.6

4 [70 80) 49 83.7 26.8 ± 3.5

5 [80 90) 41 87.2 26.2 ± 5.9

6 [90 100] 96 79.6 25.4 ± 4.1

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

NOVAK ET AL.: LINKING RECOGNITION ACCURACY AND USER EXPERIENCE IN AN AFFECTIVE FEEDBACK LOOP 5

Fig. 4. Percentage of participants that would recommend the difficul-

ty adaptation algorithm, presented as a function of agreement be-

tween the participant and the difficulty adaptation. * indicates p <

0.05, ** indicates p < 0.01, and *** indicates p < 0.001.

Participants who recommended the difficulty adapta-tion had a higher agreement value than those who did not (median 85.7% vs 75.0%, p < 0.001). Given these two me-dians and the large jump between the [70 80) and [80 90) bins in Fig. 4, we can give an approximate suggestion: in an affective game with only two possible classes (corre-sponding to increase/decrease difficulty), a recognition accuracy of least 80% is required for participants to rec-ommend the difficulty adaptation. Higher accuracy fur-ther increases satisfaction with the difficulty adaptation, as seen in Fig. 3.

3.3 Experience with game

Participants' responses to the "How fun was the game?" and "Would you play the game again?" questions are shown in Figs. 5 and 6. Correlations between agreement, fun and frustration were not significant. Participants who stated that they would play the game again had a border-line significantly higher agreement value (median 83.3% vs. 75.0%, p = 0.058). Fig. 5. Amount of fun experienced by participants, presented as a

function of agreement between the participant and the difficulty

adaptation. * indicates p < 0.05.

Fig. 6. Percentage of participants that would play the game again,

presented as a function of agreement between the participant and

the difficulty adaptation. * indicates p < 0.05, ** indicates p < 0.01.

At face value, these results suggest that recognition ac-

curacy has little effect on the overall experience with the game, with a mean fun value of approximately 42 (out of 100) for chance recognition and 50 for perfect recognition. However, here we must reiterate that many more subjects with a high agreement actually completed the required five rounds of the game. Therefore, it is likely that the differences in user experience are small since subjects who would have reported low fun instead chose not to complete the game.

Furthermore, although there was no direct significant connection between agreement and fun or frustration in the game, satisfaction with the difficulty adaptation was cor-related with fun ( = 0.33, p < 0.001), and participants who recommended the difficulty adaptation algorithm had higher fun values (median 56.4 vs 33.3, p < 0.001). Overall, this suggests that difficulty adaptation does matter, but that the optimal adaptation choice is not always what is suggested by the participant. This is reinforced by the fact that the most fun and the most willingness to play the game again were seen in participants with agreement be-tween 80% and 90%. A similar result was observed by van de Laar et al. in their evaluation of brain-computer interface controllability, where perfect control was found to be less fun than slightly imperfect control [17]. The result also implies that using self-reported values as (sole) references for difficulty adaptation may actually lead to suboptimal user experience.

3.4 Other significant correlations

The number of game rounds played by the participant was correlated with fun ( = 0.22, p < 0.001). Given that many participants with low agreement did not complete the required five rounds, this further supports the expla-nation that, while no connection was found between agreement and fun, participants who did not enjoy the game simply did not complete the experiment.

Participants who stated that they are more easily frus-trated reported higher frustration ( = 0.17, p = 0.005) and higher fun ( = 0.14, p = 0.024). Participants who stated

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

6 IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, TAFFC-2014-02-0026

that they like difficult games played the game for more rounds ( = 0.15, p = 0.019).

3.5 Extension to other games and applications

Our study is limited by the use of a simple game where difficulty adaptation only has two possible actions (in-crease/decrease). Results are not guaranteed to generalize to more complex games with multiple possible actions. In such games, actions taken by the computer may also not be purely wrong or right, but could be divided into, for example, small mistakes or major mistakes, which would require a different evaluation. Results may have also been different if participants had played the game for a longer time or over multiple sessions, something that would be expected with real affective games.

The findings are also not guaranteed to apply to other affective computing situations. For example, in situations such as drowsiness detection during driving, required accuracy would be much higher than in a game, and we would also need to consider the tradeoff between false positives (where a warning to the driver would be annoy-ing) and false negatives (where failure to warn the driver could be catastrophic). A less critical, but nonetheless different situation would be one where participants can-not quit using the system and must continue using it despite poor accuracy.

Despite the limited scope and the biases mentioned in previous sections, our study provides a first rough esti-mate of acceptable recognition accuracies in affective computing. As the relationship between recognition accu-racy and user experience has seen little research in affec-tive computing, the biases and problems encountered also highlight future challenges in what we believe will be a critical issue in future development of affective compu-ting. The proposed methodology also has the benefit of not being limited to a single sensory modality (e.g. physi-ological monitoring) and could in principle be used to study difficulty adaptation in general, though our main interest is in affective computing.

4 CONCLUSIONS

Our study provides an early look at the relationship be-tween affect recognition accuracy and user experience in affective games. Based on satisfaction with the difficulty adaptation algorithm, the percentage of dropouts and the percentage of people that would recommend the algo-rithm, the study suggests that at least approximately 80% recognition accuracy is needed in a two-class (in-crease/decrease difficulty) affective game setup.

Furthermore, though there was no direct correlation between recognition accuracy and fun, there was a signif-icant correlation between satisfaction with the difficulty adaptation algorithm and fun. As 100% accuracy was defined as always doing what the user wanted, this sug-gests that difficulty adaptation matters, but that simply following the user's preferences may not result in the optimal experience in an affective game.

The methodology used is independent of affect recog-nition modality, as it avoids using real sensors and can

therefore be generalized to any modality (physiological monitoring, facial expression recognition…). However, it is limited to a particular application, and further studies are necessary in different contexts to determine how ac-ceptable recognition accuracy varies with application.

ACKNOWLEDGMENT

The authors wish to thank all study participants as well as all lab members who tested the game the study. This work was supported by the Swiss National Science Foun-dation, NCCR Robotics and NCCR Neural Plasticity and Repair, as well as by the Clinical Research Priority Pro-gram “NeuroRehab” University of Zurich. D. Novak is the corresponding author.

REFERENCES

[1] R.A. Calvo and S. D’Mello, “Affect Detection: An Interdiscipli-

nary Review of Models, Methods and Their Applications,”

IEEE Trans. Affective Computing, vol. 1, pp. 18-37, 2010.

[2] S. Fairclough, “Fundamentals of Physiological Computing,”

Interacting with Computers, vol. 21, pp. 133-145, Jan. 2009.

[3] C. Liu, P. Agrawal, N. Sarkar and S. Chen, “Dynamic Difficulty

Adjustment in Computer Games Through Real-Time Anxiety-

Based Affective Feedback,” International Journal of Human-

Computer Interaction, vol. 25, pp. 506-529, 2009.

[4] C. Liu, K. Conn, N. Sarkar and W. Stone, “Online Affect Detec-

tion and Robot Behavior Adaptation for Intervention of Chil-

dren with Autism,” IEEE Trans. Robotics, vol. 24, pp. 883-896,

Aug. 2008.

[5] A. Haarmann, W. Boucsein and F. Schaefer, “Combining Elec-

trodermal Responses and Cardiovascular Measures for Probing

Adaptive Automation during Simulated Flight,” Applied Ergo-

nomics, vol. 40, pp. 1026-1040, Nov. 2009.

[6] D. Novak, M. Mihelj, J. Ziherl, A. Olenšek and M. Munih, “Psy-

chophysiological Measurements in a Biocooperative Feedback

Loop for Upper Extremity Rehabilitation,” IEEE Trans. Neural

Systems and Rehabilitation Engineering, vol. 19, pp. 400-410, Aug.

2011.

[7] J.A. Healey and R.W. Picard, “Detecting Stress during Real-

World Driving Tasks using Physiological Sensors,” IEEE Trans.

Intelligent Transportation Systems, vol. 6, pp. 156-166, Jun. 2005.

[8] A. Picot, S. Charbonnier and A. Caplier, “On-line Detection of

Drowsiness using Brain and Visual Information,” IEEE Trans.

Systems, Man and Cybernetics Part A: Systems and Humans, vol.

42, pp. 764-775, May 2012.

[9] Z. Zeng, M. Pantic, G.I. Roisman and T.S. Huang, “A Survey of

Affect Recognition Methods: Audio, Visual and Spontaneous

Expressions,” IEEE Trans. Pattern Analysis and Machine Intelli-

gence, vol. 31, pp. 39-58, Jan. 2009.

[10] D. Novak, M. Mihelj and M. Munih, “A Survey of Methods for

Data Fusion and System Adaptation using Autonomic Nervous

System Responses in Physiological Computing,” Interacting

with Computers, vol. 24, pp. 154-172, May 2012.

[11] A. Schwerdtfeger, “Predicting Autonomic Reactivity to Public

Speaking: Don’t Get Fixed on Self-Report Data!”, International

Journal of Psychophysiology, vol. 52, pp. 217-224, May 2004.

[12] G. Chanel, C. Rebetez, M. Bétrancourt and T. Pun, “Emotion

Assessment from Physiological Signals for Adaptation of Game

Difficulty,” IEEE Trans. Systems, Man and Cybernetics Part A:

1949-3045 (c) 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citationinformation: DOI 10.1109/TAFFC.2014.2326870, IEEE Transactions on Affective Computing

NOVAK ET AL.: LINKING RECOGNITION ACCURACY AND USER EXPERIENCE IN AN AFFECTIVE FEEDBACK LOOP 7

Systems and Humans, vol. 41, pp. 1052-1063, Nov. 2011.

[13] O. AlZoubi, S.K. D’Mello and R.A. Calvo, “Detecting Natural-

istic Expressions of Nonbasic Affect using Physiological Sig-

nals,” IEEE Trans. Affective Computing, vol. 3, pp. 298-310, 2012.

[14] S. Koelstra, C. Mühl, M. Soleymani, J.-S. Lee, A. Yazdani, T.

Ebrahimi, T. Pun, A. Nijholt and I. Patras, “DEAP: A Database

for Emotion Analysis using Physiological Signals,” IEEE Trans.

on Affective Computing, vol. 3, pp. 18-31, 2012.

[15] L.J. Hargrove, E.J. Scheme, K.B. Englehart and B.S. Hudgins,

“Multiple Binary Classifications via Linear Discriminant Analy-

sis for Improved Controllability of a Powered Prosthesis,” IEEE

Trans. Neural Systems and Rehabilitation Engineering, vol. 18, pp.

49-57, Feb. 2010.

[16] K.M. Gilleade, A. Dix and J. Allanson, “Affective Videogames

and Modes of Affective Gaming: Assist Me, Challenge Me,

Emote Me,” Proc. 2005 Digital Games Research Conf., pp. 547-554,

Jun. 2005.

[17] B. Van de Laar, D.P. Bos, B. Reuderink, M. Poel and A. Nijholt,

“How Much Control is Enough? Influence of Unreliable Input

on User Experience,” IEEE Trans. Cybernetics, vol. 43, pp. 1584-

1592, Dec. 2013. Domen Novak obtained his diploma (2008) and PhD (2011) from the University of Ljubljana, Slovenia, where he was a junior re-searcher from 2008 to 2012. Since 2012, he is a postdoctoral fellow at the Sensory-Motor Systems Lab, ETH Zurich, Switzerland. He has published 9 papers as first author and 5 papers as co-author in international journals as well as a textbook on virtual reality. His research interests include motor and cognitive rehabilitation, human-robot interaction, psychophysiology and machine learning.

Aniket Nagle obtained his bachelor’s degree (2007) from the Uni-versity of Pune, India, and his master’s degree (2011) from ETH Zurich, Switzerland. He is currently a PhD student at the Sensory-Motor Systems Lab, ETH Zurich, Switzerland. His research interests include serious games, game design, and the psychology of motiva-tion.

Robert Riener obtained his diploma (1993) and PhD (1997) from the Technical University of Munich. He is currently full professor for Sensory-Motor Systems at ETH Zurich and professor of medicine at the University Hospital Balgrist, University of Zurich. He has pub-lished more than 400 peer-reviewed journal and conference articles, 20 books and book chapters and filed 21 patents. He has received more than 15 personal distinctions and awards including the Swiss Technology Award in 2006, the IEEE TNSRE Best Paper Award 2010, and the euRobotics Technology Transfer Awards 2011 and 2012. His research focuses on human-machine interaction, particu-larly human sensory-motor control and the design of novel user-cooperative robotic devices and virtual reality technologies. The main application areas are the fields of rehabilitation and sports.