PETA - a Pedagogical Embodied Teaching Agent

8
PETA – a Pedagogical Embodied Teaching Agent David Powers, Richard Leibbrandt, AILab, Informatics & Engineering Flinders University Adelaide, South Australia +61-8-8201-3663/3659 David.Powers, Richard.Leibbrandt, Martin Luerssen, Trent Lewis, HeadLab, Informatics & Engineering Flinders University Adelaide, South Australia +61-8-8201-3867/3650 Martin.Luerssen, Trent.Lewis, @flinders.edu.au Mike Lawson Centre for Education Futures Flinders University Adelaide, South Australia +61-8-8201-2829 Mike.Lawson ABSTRACT We describe a hybrid real and virtual system that monitors and teaches children in an everyday classroom environment without requiring any special virtual reality set ups or any knowledge that there is a computer involved. This system is truly pervasive in that it interacts with a child who is playing with normal physical toys using speech. A simulated virtual head provides a focus and the opportunity for microphonological language teaching, whilst a simulated world allows the teacher to demonstrate using her set of blocks – much as a teacher might demonstrate at the front of the class. However this system allows individual students, or pairs of students, to interact with the task in a computer-free way and receive feedback from PETA, the Teaching Head, as if from their own private tutor. Categories and Subject Descriptors H.5.2. [Information Interfaces and Presentation]: User Interfaces – evaluation/methodology, graphical user interfaces input devices and strategies, interaction styles, natural language, voice I/O. General Terms Design, Experimentation, Human Factors Keywords CALL, second language teaching, human-computer interfaces, affect-awareness, embodied conversational agents. 1. INTRODUCTION For effective language learning, immersion in a natural language environment is regarded as highly desirable, if not essential [8, 13, 14, 15]. Normally, for first language or bilingual language learning, this is provided naturally by living in an area in which the target language is spoken. Our focus is schools in Australia where currently children are seldom exposed to native language speakers as teachers, and where students seldom achieve internalization of many of the fundamental of the target language. In this paper we will assume our target language is German or English, as it is for our pilot experiments, but in fact we are looking at a broader range of languages, including also Asian languages. As our aim is to create a teaching experience that is as similar as possible to a natural interaction with a human parent or peer, we have aimed to minimize the impression of dealing with a computer and keep the number of peripheral computer input/output devices employed in the system to a minimum: the only display devices are a pair of computer screens, and the only other technical equipment is a set of web cameras. Communication with the computer is pervasive – rather than typing at a keyboard children physically manipulate and talk about real-world objects that are part of a shared reality for both the computer system and the learner. In this way, the interaction with the computer tutor is similar to conversing with a human tutor about things in the immediate “here and now”. This physical interface constitutes a language-neutral and indirect way of interacting with a computer, and so may be profitably employed in other situations where no assumption can be made about the user’s level of language skill or computer literacy. This is especially important as we are targeting children at an age where they are most receptive to language but not necessarily able to make effective use of a traditional computer interface. 2. THINKING HEAD & ROBOT WORLD One precursor of this project is the Thinking Head [2], a virtual anthropomorphic software agent that engages the user in multimodal dialogue involving a variety of verbal and nonverbal behaviours. The original Prosthetic Head was commissioned for performance artist Stelarc, and the latest version is performing at the National Art Museum of China in association with the Beijing Olympics. The Head is able to “speak” English sentences via speech synthesis and facial animation. In addition, the latest version can display emotion by means of various facial expressions and through vocal prosody, as well as being capable of a variety of gestures (see Fig. 1). Users can converse with the Head by speech or keyboard input; the Head responds by matching this input against the most appropriate verbal and nonverbal responses using techniques that are well established for Embodied Conversational Agents. The Thinking Head initiative aims for a human-computer interaction that is as natural as a face- to-face conversation with another person. A second precursor of the project are the Magrathea and Microjaea Robot Worlds [5, 6, 11, 16, 17]. These 3D simulations were designed to allow a computer to learn linguistic, ontological and semantic concepts (see Fig. 2) in the context of what has come to be known as L0 language learning [3]. © ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The definitive version was published in PETRA Workshop on Pervasive Technologies in e/m- Learning and Internetbased Experiments, {VOL#, ISS#, (DATE)} http://doi.acm.org/10.1145/nnnnnn.nnnnnn

Transcript of PETA - a Pedagogical Embodied Teaching Agent

PETA – a Pedagogical Embodied Teaching Agent David Powers, Richard Leibbrandt, AILab, Informatics & Engineering

Flinders University Adelaide, South Australia +61-8-8201-3663/3659

David.Powers, Richard.Leibbrandt,

Martin Luerssen, Trent Lewis, HeadLab, Informatics & Engineering

Flinders University Adelaide, South Australia +61-8-8201-3867/3650

Martin.Luerssen, Trent.Lewis,

@flinders.edu.au

Mike Lawson Centre for Education Futures

Flinders University Adelaide, South Australia

+61-8-8201-2829

Mike.Lawson

ABSTRACT

We describe a hybrid real and virtual system that monitors and

teaches children in an everyday classroom environment without

requiring any special virtual reality set ups or any knowledge that

there is a computer involved. This system is truly pervasive in

that it interacts with a child who is playing with normal physical

toys using speech. A simulated virtual head provides a focus and

the opportunity for microphonological language teaching, whilst a

simulated world allows the teacher to demonstrate using her set of

blocks – much as a teacher might demonstrate at the front of the

class. However this system allows individual students, or pairs of

students, to interact with the task in a computer-free way and

receive feedback from PETA, the Teaching Head, as if from their

own private tutor.

Categories and Subject Descriptors H.5.2. [Information Interfaces and Presentation]: User

Interfaces – evaluation/methodology, graphical user interfaces

input devices and strategies, interaction styles, natural language,

voice I/O.

General Terms Design, Experimentation, Human Factors

Keywords CALL, second language teaching, human-computer interfaces,

affect-awareness, embodied conversational agents.

1. INTRODUCTION For effective language learning, immersion in a natural language

environment is regarded as highly desirable, if not essential [8,

13, 14, 15]. Normally, for first language or bilingual language

learning, this is provided naturally by living in an area in which

the target language is spoken. Our focus is schools in Australia

where currently children are seldom exposed to native language

speakers as teachers, and where students seldom achieve

internalization of many of the fundamental of the target language.

In this paper we will assume our target language is German or

English, as it is for our pilot experiments, but in fact we are

looking at a broader range of languages, including also Asian

languages.

As our aim is to create a teaching experience that is as similar as

possible to a natural interaction with a human parent or peer, we

have aimed to minimize the impression of dealing with a

computer and keep the number of peripheral computer

input/output devices employed in the system to a minimum: the

only display devices are a pair of computer screens, and the only

other technical equipment is a set of web cameras.

Communication with the computer is pervasive – rather than

typing at a keyboard children physically manipulate and talk

about real-world objects that are part of a shared reality for both

the computer system and the learner. In this way, the interaction

with the computer tutor is similar to conversing with a human

tutor about things in the immediate “here and now”. This physical

interface constitutes a language-neutral and indirect way of

interacting with a computer, and so may be profitably employed in

other situations where no assumption can be made about the

user’s level of language skill or computer literacy. This is

especially important as we are targeting children at an age where

they are most receptive to language but not necessarily able to

make effective use of a traditional computer interface.

2. THINKING HEAD & ROBOT WORLD One precursor of this project is the Thinking Head [2], a virtual

anthropomorphic software agent that engages the user in

multimodal dialogue involving a variety of verbal and nonverbal

behaviours. The original Prosthetic Head was commissioned for

performance artist Stelarc, and the latest version is performing at

the National Art Museum of China in association with the Beijing

Olympics. The Head is able to “speak” English sentences via

speech synthesis and facial animation. In addition, the latest

version can display emotion by means of various facial

expressions and through vocal prosody, as well as being capable

of a variety of gestures (see Fig. 1). Users can converse with the

Head by speech or keyboard input; the Head responds by

matching this input against the most appropriate verbal and

nonverbal responses using techniques that are well established for

Embodied Conversational Agents. The Thinking Head initiative

aims for a human-computer interaction that is as natural as a face-

to-face conversation with another person.

A second precursor of the project are the Magrathea and

Microjaea Robot Worlds [5, 6, 11, 16, 17]. These 3D simulations

were designed to allow a computer to learn linguistic, ontological

and semantic concepts (see Fig. 2) in the context of what has

come to be known as L0 language learning [3].

© ACM, 2008. This is the author's version of the work. It is

posted here by permission of ACM for your personal use. Not

for redistribution. The definitive version was published in

PETRA Workshop on Pervasive Technologies in e/m-

Learning and Internetbased Experiments, {VOL#, ISS#,

(DATE)} http://doi.acm.org/10.1145/nnnnnn.nnnnnn

Figure 1. Thinking Head based on performance artist Stelarc.

Originally designed to explore the role of embodied

conversational agents and virtual reality technology in

performance, it is the forefather of a series of new Thinking Heads

and applications.

3. THE TEACHING HEAD In the current Teaching Head system, learners interact with the

system by manipulating and speaking about members of a set of

physical objects that are placed within a circumscribed area, and

are surveyed by means of web cameras which track the position of

the objects as they are moved around. Using this video

information, the software system is able to maintain a

dynamically-updated model of where all objects are located in

space. In this way, both learner and teacher are potentially focused

on a shared reality, grounded in the physical world.

The vision system consists of 3 cameras strategically placed to

observe the participant’s interaction with the Teaching Head. One

camera captures the participant’s face and body to gauge

emotional state and focus of attention, and in future versions to

lip-read and track correct articulation. This camera also allows the

Teaching Head to monitor the participant current focus, e.g.,

tracking the participant’s gaze and adjusting the gaze of the

Teaching Head to look at either the student or the objects the

student is attending to. The two additional cameras monitor the

Physical Arena from orthogonal perspectives, allowing full 3D

localization. Implementation of the vision software is making use

of state-of the-art classifier techniques [10, 22] and object

tracking algorithms [1].

At the same time, a Microjaea display of a three-dimensional

virtual world is presented for the purpose of demonstrating

hypothetical situations involving these same manipulable objects.

These situations are then described by the Teaching Head in the

target second language. For instance, one might imagine the

teaching of German prepositions using a number of blocks of

various colours. The Head would utter the sentence “der grüne

Block ist unter dem blauen Block” while the rendered scene in

the imaginary arena shows a green block stacked underneath a

blue one.

In Figure 3 we show a schematic of the teaching workspace

showing one possible configuration of screens and cameras. In

fact the layout is something we are still evaluating/optimizing, and

we expect that our standard configuration will have the two

screens angled at around 120°. This allows the Head to look at

what is happening both on its MicroJaea screen and in the semi-

enclosed workspace bounded by the screen on two sides. This is

true in two senses – the Thinking Head’s simulated eyes can look

at the subject, an object or a simulated, but also the cameras

associated with the workspace can also be focused appropriately –

this can be achieved by fixed cameras with logical panning and

focusing of attention, but we are also evaluating the use of

Logitech QuickCam Sphere webcams that can physically pan as

required.

Figure 3 also illustrates a particular configuration of three

orthogonal cameras, but other configurations are being evaluated.

In particular, with the angled screens each screen can have a

mounted or built in webcam, fixed ahead or panning, and whilst

perfect orthogonality simplifies the mathematics somewhat, there

is no overriding reason to mandate orthogonality and there are

pragmatic advantages in using down looking cameras from the

two monitors (for the workspace) and an upward/forward looking

camera from between the monitors (for face tracking).

Computer input is thus implicit from tracking the user’s face,

movements and action, as well as their speech.

Figure 2. MicroJaea language teaching scenario. Originally

designed to teach the computer English, this Java 3D world and

scripting language is now being redeployed to teach children

English and German.

4. PEDAGOGICAL APPROACH Krashen [9] distinguished between acquisition and learning, that

is acquiring language in a spontaneous, uncritical, unconscious

manner, rather than formal language learning, which is concerned

with a conscious inculcation of language rules and facts that are

used by the learner to criticise and correct his/her own

productions. We consider this to be an important distinction and

regard language learning as more akin to a social negotiation [17]

– we learn and indeed invent language and culture in an integrated

fashion that is as much influenced by our interactions with other

learners (peers) as our parents and teachers (models). In a normal

“mother tongue” environment the language of our parents,

teachers and older siblings is similar enough that we do indeed

learn something pretty close to our mother’s language. But where

there is a mix of languages or we have closer association with

peers than models, we tend to learn a creolized amalgam of our

own invention plus multiple linguistic influences.

In a classroom situation, we have relatively few hours available

for language learning each week – sometimes not even one hour

(e.g. one 40 minute lesson a week). This hour is often not well

used from the acquisition perspective. Typical exercises in

primary school are focused on vocabulary learning, drawing

pictures or identifying pictures. At the critical ages for

internalizing the complex morphological and syntactic system of a

language we are simply not providing children with appropriate

models, and we are not requiring them or giving them opportunity

to experiment with language and succeed or fail in expressing

themselves comprehensibly.

A crucial prerequisite for language acquisition is that learners

should be engaged in an activity or discussion with a native

speaker of the target language (German) – this interaction should

ideally be monolingual. Giving away that you can speak their

language (English), reduces the motivation to communicate.

Interspersing English and German into the one conversation

reduces the actual exposure to German and confuses the features

of the languages.

The ideal way of achieving native speaker competence is to have a

particular situation or person where you only communicate in the

target language. In mixed background families or environments

this often relates to a parent, grandparent, governess/au pair, or

other person who naturally speaks a different language. This then

leads to coordinate bilingualism [23] where we see native speaker

competence in the specific contexts where the child was exposed

to that language, and often this is associated with quite different

cultural and personality traits. On the other hand, vocabulary and

registers for other areas of discourse will be lacking – but the

morphological and syntactic infrastructure of language will be

properly in place and exposure to new situations will lead to quick

learning of the appropriate vocabulary and register.

To ensure the required level of motivation and interest on the part

of the learner, as well as the immersion and embodiment that

characterizes learning native level competence, our focus is on

providing intriguing problem solving and collaborative

environments for learning. For this reason, all lessons are

structured as language games, involving physical props, and are

designed to be interesting to learners of the target age.

Because interfacing with the system using a keyboard and mouse

are likely to detract from the naturalness of the interaction, and so

divert learners’ attention from the task at hand, the interface is

designed to be very similar to a normal conversation, with both

parties speaking, and the learner moving physical props around so

as to achieve language game goals such as constructing bridges,

assembling animals or robots from body parts, etc.

The teacher, in this case PETA, the Teaching Head, has an

important role to play in keeping the student interested in the

activity, and gauging understanding by ensuring that the student is

paying attention to the appropriate things. Hence, abilities to

“read” facial expressions and to track visual gaze are

indispensable; these will be determined from video input coming

from a camera directed at the learner during the interaction.

Lastly, the teacher needs to structure the language that he/she uses

in such a way that it is not well beyond the reach of the student,

but also not too simple. The optimal complexity for language

acquisition seems to be language input that is just slightly more

complex than the level that the student has already attained [9,

17]. To this end, our computerized teaching system will be able to

gauge the learner’s current level of development, and to adapt the

complexity of its utterances accordingly.

Virtual Arena Teaching Head

Physical Arena

Participant

Camera

Arena

Cameras

Figure 3. The layout of the Teaching Head installation.

5. SOFTWARE SYSTEM COMPONENTS

5.1 Thinking Head The Thinking Head is a three-dimensional, animated human head

capable of speaking arbitrary English sentences by synthesizing

the sounds of the sentence while moving its facial effectors in an

appropriate manner. The Head is also capable of a wide range of

emotional facial expressions. Users can converse with the Head

by speech or keyboard input; the Head responds by matching this

input against the most appropriate verbal or nonverbal responses.

The Thinking Head initiative aims for a human-computer

interaction that is as natural as a face-to-face conversation with

another person. Nonverbal signals play an important role here, as

agents that correctly perceive and respond with such signals are

more likely to be accepted as equal partners. The use of a Head

that models natural articulation and expression is a unique facet of

this pedagogical software system. The English-speaking version

of the Head is being developed in Australia in the context of an

ARC (Australian Research Council) Thinking Systems project,

whilst the German-speaking version of the Head is being

developed under a DFG (Deutsches Forschungs-Gemeinschaft)

project.

Currently, Thinking Head emotions/expressions are primarily

controlled by metadata in the form of tags embedded into the

scripts. However, we are also experimenting with semantic

modeling techniques [25-27] involving corpus-based and

thesaurus-based word similarity techniques. Longer term, we are

building the capability of recognizing expressions and modeling

learner attitudes, although due to the data collection and model

training this will be gradually introduced during the alpha- and

beta-testing phases with actual school students undertaking real

lessons.

5.2 Physical arena In the current system, the Head will become a virtual participant

in the physical world by creating and monitoring an internal

computer representation of a small section of the world, which

will then be the topic of its verbal exchanges with the learner.

This will be achieved by setting aside a circumscribed area of the

real world and placing it under surveillance by video cameras.

Within this ‘physical arena’ will be placed a number of small

objects/”props”, such as blocks/toy animals, which the learner will

be able to manipulate. Information from the cameras will allow

the computer system to build a computer model of the physical

arena, which can be updated dynamically as the learner moves

objects about. In this way, both learner and teacher are focused on

a shared reality, grounded in the physical world.

The physical arena is a unique characteristic of this system in that

users interact as usual with toys and other real-world props. There

is no obvious interaction with a computer, and the user won’t see

anything that looks like a computer (keyboard, mouse, etc.) other

than the two TV-like screens that bound the far sides of the

physical arena. In Wizard-of-Oz training/evaluation versions of

the experiments the same set up can be used for remotely

monitored teaching and collection of training data. In some such

experiments with Embodied Conversational Agents, people have

refused to believe that there wasn’t a Wizard. As Weizenbaum

[24] learned from the Eliza experience, this suspension of

disbelief does not so much result from the “intelligence” of the

computer but its ability to satisfactorily fulfill the role one

expects. In the case of Eliza, playing the part of a psychologist,

all the computer had to do was reflect back what the patient was

saying, and use the occasional keyword match or random change

of subject to keep the conversation moving. This is still the basic

technique in use in the present Thinking Head. However for the

Language Teaching Head, we again have a situation where the

“intelligence” required of the system is minimal, and this mainly

relates to being able to track what the learner is doing in the

physical arena.

The physical arena indeed supplies literal grounding [4], that is

argued to be required for true understanding.

5.3 Virtual arena In addition to the physical arena, there is a parallel ‘virtual arena’,

consisting of a computer screen display of a three-dimensional

scene. The virtual arena will contain virtual depictions of the same

objects that are in the physical arena. Virtual world scenes will be

created and displayed by means of the MicroJaea virtual world

software, developed in the AI Lab at Flinders for teaching the

computer language, the L0 task.

MicroJaea (Fig. 2) allows a script writer to define entities that will

populate a virtual world, and to create elementary animated

movies involving these entities. The virtual arena is what allows

the system to ‘teach’ L2 to the learner. The system uses the

MicroJaea rendering capabilities to demonstrate to the learner

situations involving the objects in the physical arena, while at the

same time modeling the descriptions of these situations in L2. So,

for instance, one might imagine the teaching of German

prepositions in a situation where the physical arena contains, say,

a number of blocks of various colours. The Head could then utter

the sentence ‘der grüne Block ist unter dem blauen Block’ while

the rendered scene in the imaginary arena shows a green block

stacked underneath a blue one. The focus of teaching could then

involve variations on this basic theme, varying the preposition

used and the objects involved. The virtual arena will also display

text versions of the sentence currently being spoken.

The virtual arena can also be argued to ground, although [4]

would see this as just simulated grounding. However, given that

the learner successfully makes the connection back to the physical

arena, and the computer understands correctly and reflects it in

manipulating the virtual objects, the grounding cycle can be seen

to be complete.

5.4 Vision System The vision system for the L2 Project consists of 3 cameras

strategically placed to observe the participant’s interaction with

the Teaching Head. One camera (Participant Camera, Figure 3)

captures the participant’s face and body to gauge emotional state,

focus of attention, and allows the Teaching Head to track the

participant, e.g., tracking the participant’s gaze and adjusting the

gaze of the Teaching Head to look at and to focus the attention of

the participant. The two additional cameras monitor the Physical

Arena from orthogonal perspectives, allowing full 3D

localization. The arena cameras keep track of the current state of

the physical world (e.g. the coloured blocks in Figure 3).

The computational aspects of the vision system are quite complex,

however, many of the algorithms required for participant and

object tracking are implemented in the Intel Open Source

Computer Vision (OpenCV) library. Specifically we will make

use of the cascade of boosted classifiers using Haar-like features

[10, 19] and the CAMSHIFT object tracking algorithms [1].

These algorithms have been successfully applied to multiple face-

tracking and arbitrary object detection and tracking. The

complexity of the system is also reduced by having a constrained

physical arena.

For this project, we have available some 3D equipment which

comes into play in ensuring that we can achieve our expectations

in relation to the Vision System. We have 3D cameras/scanners,

printers and autostereoscopic monitors available and the ability to

accurately scan and produce both real and simulated objects is

essential to the project. We can make use of existing libraries of

3D rendered objects, and produce them both on our 3D printer

and on a 2D or 3D screen. We can also visually capture (using

stereo cameras or laser scanners) 3D objects to produce our own

library of objects based on the desired props and toys.

This has several advantages – not only can we ensure arbitrarily

close correspondence between the physical objects and the

simulated objects, we are able to derive significant advantage in

relation to the vision monitoring. In particular we are cheating

slightly by ensuring that we can distinguish different objects and

parts by hue and saturation alone. Note that effects due to shadow

largely do not affect hue and saturation but rather define a set of

narrow colour equivalence classes. Similarly effects due to

changes in lighting are systematic and can be dealt with by white-

balancing. Thus for the early stages, and many of the proposed

scenarios, complex pattern matching is obviated.

5.5 Language games Throughout the teaching process, the system will automatically

assess the student’s knowledge in the context of a number of

‘language games’. These activities will be engaging for young

students, and will involve simple tasks such as constructing

towers, bridges, etc. from building blocks, or putting together a

toy animal from detached body parts. During assessment, the

learner is required to demonstrate his/her newly-acquired

knowledge of the second language via manipulation of the

physical objects. In one possible language game, the Head would

speak a sentence in L2, and the learner would be expected to enact

the situation described (e.g. by stacking a blue block on top of a

green one). From the video information obtained from its

cameras, the system will be able to determine whether the correct

situation had been recreated, and will provide appropriate

feedback.

Although this interface is immediately extensible to applications

outside teaching, it is particularly well-suited for application to

teaching a second language, as it allows language learning to take

place in a context of social interaction, which has been shown to

improve acquisition [12]. The approach is compatible with

approaches to language teaching in which language acquisition is

developed within a context of social interaction that is learner-

centred, and promotes complex thinking and critical literacy.

Feedback is provided not by text on a screen, but in spoken form

by an engaging, human-like interlocutor. This simulates provision

of a personal language tutor for every student.

Although we are focused on employing puzzles and games to

teach language, the class of tasks used can be much broader than

this might suggest. It should also be noted that we have defined

between subjects within class control experiments in which part of

the class uses the system in their native language, except for the

final evaluation stage of the experiment. Thus puzzles and games

should be compelling even when undertaken in the native

language. The idea is to keep the student engaged and immersed,

to give them tasks and goals to achieve, but these are not

intrinsically linguistic tasks and goals. Rather the use of language

is incidental but necessary to successful collaboration and

achievement of the goals. To the extent that the required level of

language ability is not present, it will be harder to achieve the

goal, but the goal is also graded and matched to the students

cognitive and problem-solving level.

Many well-known games and toys are candidates here, including

both traditional board games and modern computer games,

solitaire games and challenging puzzles. To use the example of

blocks: Tetris, Lego, Tower of Hanoi, Jenga can all be regarded as

examples of goal-oriented block play.

5.6 Empathic and phonological feedback In teaching a language, a good teacher avoids saying “right” or

“wrong”, and aims to be generally encouraging, allowing students

to safely make mistakes, making the learning experience game-

like and fun so that risk-taking is appropriate, and modelling the

ideal form of response (including correct vocabulary, grammar,

pronunciation, inflection, etc.) In such a response the teacher will

tend to very subtly emphasize particular features of the response

that are currently being targeted, whilst perhaps not drawing

attention to other problems that are not currently in focus.

Generally, the teacher’s voice and face will convey an appropriate

degree of approbation. At this stage of our design and

experimentation, we are focussing on the visual cues to affect.

In relation to learning, our initial focus is on grammatical and

morphological learning, but we envisage that a follow-on trial will

be concerned with phonological and articulatory learning, as we

use the head to model the correct way of holding the face,

utilizing the musculature, and hence articulating the phonology

correctly. We envisage that a future version of the system will be

able to render the child’s own voice and face so that the Head can

mirror the child as s/he talks, and then show exactly how it should

look and sound – we will be looking to see whether there is any

advantage in this being in the child’s own face/voice, a same

gender or similar age persona, etc.

Empathic capabilities are in the initial version solely visual – they

are not reflected in the audio output. These abilities include a

broad range of expressions including concern, approval,

happiness, etc, as well as specific gestures such as nods and

winks. In addition, as mentioned earlier, the both the head and

the eyes can be directed either to the student (with automatic

tracking using one of the cameras), or to a physical or virtual

object in the shared attention space, in order to direct or confirm

semantic reference. Phonological capabilities are currently

visually expressed by a reasonably accurate lip-syncing.

Generally, the facial articulation can not only be controlled in

terms of specificity, but in degree or subtlety. For performance

purposes it is usual to exaggerate expressivity, but for educational

purposes this is something that is being evaluated separately and

contrastively. In particular, it is hypothesized, based on teacher

training and practice, that it is of value to emphasize the

particular features of the utterance that are in focus, and to control

the degree of approbation finely.

6. EXPERIMENTAL EVALUATION During experimental testing with users in schools, we will

determine the extent to which particular aspects or components of

the system are crucial to attaining good pedagogical outcomes.

We will investigate whether having a physically grounded

interface is necessary for teaching success. Likewise, we will

investigate the effect of a reduced display of affect and personal

warmth on the part of the Head, and the use of detailed, verbose

explanations during feedback versus more terse responses.

However a significant feature of this project is that prior to

finalizing the details of the system to be deployed in schools we

are conducting tuning experiments specifically aimed at

determining the effect of affect. This is being measured using

variations of standard techniques for Virtual Reality evaluation,

including VRuse [7] from a general perspective and the

framework of [21] from a pedagogical perspective. These

frameworks do not directly assess the effect of affect, and some

measures are in appropriate due to different assumptions about the

nature of the virtual environments.

Two kinds of initial evaluation are performed, within subjects

comparison of performance between two specific affective

conditions, as well as between subjects comparison across

conditions not directly testable in this way due to interference

effects. Two kinds of measures are made in relation to each

condition: performance measures (e.g. using comprehension tests

in our first battery of experiments) and affective measures

(specific Likert questions: I found the Head

Likeable/Engaging/Easy to understand/Life-like/Humorous). In

addition to the evaluation of two conditions per subject (in a

single session with randomized order of conditions and

questions), a number of other questions are asked relating to

background and overall impression to facilitate a proposed factor

analysis.

The initial pair of conditions being investigated within subject is

the effect of expression in the Head’s speech, whilst the initial

pair of conditions being investigated between subjects is the affect

of eye contact/tracking.

This initial evaluation is being followed by two streams of

classroom evaluation, covering multiple design conditions. We

have two specific aims for the evaluation process.

Aim 1: To evaluate the influence of specific computer-assisted

language learning (CALL) components of the Thinking Head

system on language learning outcomes;

Aim 2: To evaluate the influence of specific aspects of student-

teacher interaction in the Thinking Head on language learning

outcomes.

The first stream of evaluation will be conducted as a broad scale

pilot study encompassing preschool, K-12, and university

evaluation the major focus for which will be the broad range of

classes offered by the Adelaide School for German. This study is

intended to derive preliminary results about both these questions,

as well as test the age suitability of the various kinds of materials

and scenarios under development. Whilst our overall aim is to

simulate the learning of preschool children, and primary school

students are seen to be the key age groups for our studies, we must

recognize that most language teaching takes place at high school

and as adults. A second feature of this study is that Adelaide

School for German has a large proportion of students with some

degree of existing German background, so that degree of language

competence will be able to be examined independent of age, and

the applicability of the approach to people who already have some

kind of natural German-speaking environment will allow degree

of exposure to the specific kinds of teaching scenarios to be

analyzed versus general exposure to German – to validate the

teachers assessment that these German-background students also

fail to master the preposition and case systems and can benefit

from our system.

Following tuning of the system in the light of this pilot study, we

will conduct a full study which will explore many variants of the

system at the focal year 3-4 (8-9 year old) level, as well as

conducting a second series of age varying experiments at

preschool, first year high school and first year university, when

there are contingents of students with vastly varying degrees of

exposure. We will specifically conduct three experiments in which

the Thinking Head system is used to teach German prepositions,

in order to evaluate the extent to which the system is successful at

the teaching task, and to identify crucial aspects of the system:

Experiment 1 will involve a comparison of the effects of

teaching students using the full Teaching Head system versus

teaching the same material in a number of traditionally structured

sessions conducted by an experienced German teacher. For this

reason, a number of participating classes will be randomly split in

two, with one group being taught German spatial prepositions in a

traditional classroom-based manner, and the other group being

taught by means of the Head. We will conduct this experiment

using students from each of our 4 target groups (preschool, years

3-4 (8-9 years old), first-year secondary school, first-year

university).

Language performance will be assessed by means of a series of

tests of graded difficulty for both receptive and productive

language, including text-based multiple-choice tests, and, notably,

by assessing performance on an unseen language game which will

be conducted using the complete Teaching Head system. (For this

reason, it is important that students who are taught via traditional

methods will also receive exposure to the Thinking Head

interface, playing the same language games as the other group, but

in English rather than German, in order to ensure that they have a

comparable level of familiarity with the interface at test).

Performance of the two groups on the two assessment measures

will be compared for significant differences by means of t-tests.

For each age group, we will require a minimum of 2 classes

which will be split in this way.

Experiment 2 will be concerned with determining which

components of the Teaching Head system are crucial to success,

in line with our Aim 1. The three conditions of interest are: the

full system, the system without the visual Head interface (i.e. the

sounds of words are produced but no visual animated head is

present; No Head), and the system without the physical interface

with blocks and toys (instead these are virtual only and

manipulated using a touch screen or mouse; No physical arena).

Here we will focus on the main target age of 8/9-year-old

schoolchildren only. Classes will be split and groups randomly

assigned to each of these three conditions for the duration of

teaching. Subsequently, evaluation of learning outcomes will be

conducted on each of the groups using the same methods as for

Experiment 1.

We hypothesize that students exposed to the full system will have

a superior knowledge of prepositions to those exposed to reduced

systems.

If testing were to be conducted using the full system only,

students who had been taught using that system would enjoy a

familiarity advantage during testing. To avoid this, students will

be assessed using first the system (full or reduced) on which they

were taught, followed by the one on which they had not been

taught. This allows us to distinguish the effects of the training

interface against the effects of the testing interface.

The significance of the effect of the testing interface will be tested

by paired t-tests within subjects (for each testing condition

individually as well as both combined), and a significant effect of

the teaching interface by unpaired t-tests between subjects, as well

as ANOVAs with time, age and continuous assessment variates.

We will require a minimum of 2 split school classes in each

condition and age group.

Experiment 3 will be concerned with evaluating the effects of

aspects of the student- tutor interaction itself that are crucial to

success (in line with our Aim 2). In particular, we will evaluate

the effects of having markedly reduced affective behaviour on the

part of the Head (Reduced Affect), and of providing terse

responses with little explanation (Reduced Feedback). The

experimental design and evaluation procedures are exactly the

same as for Experiment 2.

We hypothesize that the full Head system will be more efficacious

in teaching German prepositions to students than the reduced

versions.

Qualitative student and teacher evaluation of the system: In

addition to the quantitative evaluation of the use of the system for

language teaching, both student and teacher views about use of

the system and system features will be gathered in all 3

experiments.

The analysis of this qualitative information will form an important

part of the educational evaluation of the system.

Quantitative evaluation measures: It is most important to ensure

unbiased evaluation that takes into account performance levels

that are achievable without using the Teaching Head system or the

specific feature under investigation in any given experiment

For this reason, careful baselines and controls have been included

in the above experiments, and we will be using Bookmaker

Informedness and Markedness [18-20] to directly determine the

unbiased probabilities of the tested contingencies.

This approach will also directly give us confidence intervals,

significance and power estimates which will be complementary to

the tests used in the outlined experiments and will help determine

the final group sizes.

7. CONCLUSION At the time of submission, the formal evaluation is only just

starting, although preliminary experiment results show a

significant effect in the provision of appropriate facial

experimentation in verbal comprehension task..

Informally, we have found in our preliminary experiments and

demonstrations that the target age group of around 9 years old

does seem to reflect a peak of interest in and willingness to

engage in long conversations with the Thinking Head. We have

found that teachers are very keen on the concept, and teachers

have identified the interaction of prepositions and case in German

has been identified as a key area which students tend not to master

by the end of high school, even when the students have a German-

speaking home background as immigrants to Australia – this area

is thus the focus of our current design and evaluation.

Users of the systems, referred to as clients, are generally

impressed by the fact that the head maintains eye contact, and by

the range of expression that is used. The Thinking Head that is the

basis for these experiments is on display New Media Arts

Exhibition at the National Art Museum of China in the lead up to

the Beijing Olympics – for this performance purpose expressions

tend to be more exaggerated than natural. However, we are still

conducting experiments on subjects blind to our capability of

controlling these features, and expressions are designed to be as

natural and appropriate as possible for the subject matter, which

has been chosen from standard educational materials targeted to

senior primary school students, although for our initial evaluation

our subjects are first year university students.

The purpose of the present study is to formally confirm that the

Teaching Head is able to successfully employ supralinguistic

features to achieve demonstrable gains in teaching performance –

unfortunately very little educational software seems to be

evaluated formally, and similarly choices in user interfaces and

virtual reality tend to be assessed post hoc rather than developed

based on a systematic psychological evaluation of the effect of

individual features.

Currently, our focus is on ‘workshopping’ with teachers and

educators to develop the teaching materials/scenarios and refine

the approach and experimental methodology. This is a major

project with international cooperation from several countries, and

with strong interest in extension to further languages. In

presenting in this forum, we are specifically inviting feedback on

our approach and encouraging participation in the project – both

in terms of porting to additional languages and supporting

development of additional teaching content, and in the application

and evaluation of the Teaching Head in different countries and

educational contexts.

We will report elsewhere on results obtained using a read-out-

loud task in which the Thinking Head read out a text with and

without expression, and the user completed both Likert testing of

affect and comprehension questions based on the read material.

Preliminary results show a range of response to perceived

difficulty of the material presented, and where difficulty is well

matched to the subject, a significant positive effect to the addition

of appropriate expression by the Thinking Head, but this effect is

not seen for a text perceived as difficult.

8. ACKNOWLEDGEMENTS The Thinking Head project is funded under ARC SRI

TS0669874, in the context of the Australian Research Council

and National Health and Medical Research Council joint Special

Research Initiative in Thinking Systems. Australian Partners

include the University of Western Sydney, Flinders University,

Macquarie University, University of Canberra, Industry Partners

include Seeing Machines Ltd, International Partners include

Carnegie Mellon University, Berlin University of Technology and

Technical University of Denmark.

Development of the German language Head is supported by the

Deutsches Forschungs Gemeinschaft (DFG).

We also acknowledge and appreciate the assistance of the Goethe

Institute and numerous schools, teachers and students.

9. REFERENCES [1] Bradski, G. R. 1998. Computer Vision Face Tracking For

Use in a Perceptual User Interface. Intel Technology Journal,

1998, Q2, 15

[2] Davis, C, Kim, J. Kuratate, T. & Burnham, D. 2007. Making

a thinking-talking head. Proceedings of the International

Conference on Auditory-Visual Speech Processing (AVSP

2007). Hilvarenbeek. The Netherlands.

[3] Feldman, J.A., Lakoff, G., Stolcke, A. & Hollbach Weber, S.

1990. Miniature language acquisition: a touchstone for

cognitive science. Technical report, ICSI, Berkeley, CA.

[4] Harnad, S. 1990. The Symbol Grounding Problem, Physica

D 42: 335-346.

[5] Homes, D. A. 1998. Perceptually Grounded Language

Learning. B.Sc. Honours Thesis, Dept of Computer Science,

Flinders University, Adelaide.

[6] Hume, D. 1984. Creating Interactive Worlds with Multiple

Actors, Computer Science Honours Thesis, Electrical

Engineering and Computer Science, Uni. of NSW, Sydney.

[7] Kalawsky, R.S. 1999. VRUSE-a computerized diagnostic

tool: for usability evaluation of virtual/synthetic environment

systems. Applied Ergonomics, 30, 11-25.

[8] Kindler, A. 2002. Survey of the states' limited English

proficient students and available educational programs and

services, 2000-2001 summary report. National Clearinghouse

for English Language Acquisition & Language Instruction

Educational Programs.

[9] Krashen, S. D. 1981. Principles and Practice in Second

Language Acquisition. English Language Teaching series.

London: Prentice-Hall International UK Ltd.

[10] Lienhart, R. & Maydt, J. 2002 An Extended Set of Haar-like

Features for Rapid Object Detection, IEEE ICIP 2002, 1,

900-903.

[11] Li Santi, M., Leibbrandt, R.E. & Powers, D.M.W. 2007.

Developing 3D Worlds for Language Learning, Australian

Society for Cognitive Science Conference, Adelaide,

Australia.

[12] Lucas, T. 1993. Secondary schooling for students becoming

bilingual. In M. B. Arias & U. Casanova (Eds.), Bilingual

education: politics, practice and research pp. 113-143.

Chicago: University of Chicago Press.

[13] MacWhinney, B. 1997. Second language acquisition and the

Competition Model. In A.M.B. de Groot & J. Kroll (Eds.),

Tutorials in bilingualism: Psychological perspectives pp.

169-199. Mahwah, NJ: Erlbaum.

[14] Montessori, M. 1907. House of children: a planet without

schools or teachers. http://www.montessori.edu/maria.html

accessed 15 Feb 2008.

[15] Piaget, J. 1928/2001. Judgment and reasoning in the child.

New York: Harcourt, Brace and Company.

[16] Powers, D.M.W., Leibbrandt, R.E., Li Santi, M. & Luerssen,

M.H. 2007. A multimodal environment for immersive

language learning - space, time, viewpoint and physics, Joint

HCSNet-HxI Workshop on Human Issues in Interaction and

Interactive Interfaces, Sydney, Australia.

[17] Powers, D.M.W. & Turk, C. 1989. Machine Learning of

Natural Language, Research Monograph, New York/Berlin:

Springer-Verlag.

[18] Powers, D M. W. 2003. Recall and Precision versus the

Bookmaker. International Conference on Cognitive Science,

pp529-534. http://david.wardpowers.info/BM/BMPaper.pdf

accessed 11 May 2008.

[19] Powers, D.M.W. 2007, Evaluation: From Precision, Recall

and F-Factor to ROC, Informedness, Markedness &

Correlation, School of Informatics and Engineering, Flinders

University • Adelaide • Australia, Technical Report SIE-07-

001, December 2007

[20] Powers, D.M.W. (in press), Evaluation Evaluation, ECAI

July 21-25 2008.

[21] Roussou, M., Gillingham, M., Moher, T. 1998. Evaluation of

an Immersive Collaborative Virtual Learning Environment

for K-12 Education. American Educational Research

Association Annual Meeting (AERA), San Diego, CA.

[22] Viola, P. A. & Jones, M. J. 2001. Rapid object detection

using a boosted cascade of simple features. Proceedings of

the Conference on Computer Vision and Pattern Recognition

(CVPR'01), IEEE Computer Society, 511-518.

[23] Weinreich, U. 1953/1974. Languages in Contact, Berlin:

Walter de Gruyter.

[24] Weizenbaum, J. 1976. Computer Power and Human Reason:

From Judgment To Calculation. San Francisco: W. H.

Freeman.

[25] Yang, D. & Powers, D.M.W. 2006. Word sense

disambiguation using lexical cohesion in the context. Joint

Conference of the International Committee on

Computational Linguistics and the Association for

Computational Linguistics (COLING-ACL) 2006, Sydney,

Australia, 929-936.

[26] Yang, D. & Powers, D.M.W. 2005. Measuring semantic

similarity in the taxonomy of WordNet. Australia Computer

Science Conference (ACSC2005). pp315-322

[27] Yang, D. & Powers, D.M.W. 2008. Automatic Thesaurus

Construction, Australia Computer Science Conference

(ACSC2009), pp147-156